Machine learning Interview Questions

Machine learning Interview Questions

1. What Was the Purpose of Machine Learning?

The most straightforward response is to make our lives easier. Many systems employed hardcoded rules of "if" and "otherwise" decisions to process data or change user input in the early days of "intelligent" applications. Consider a spam filter, which is responsible for moving appropriate incoming email messages to a spam folder.

However, using machine learning algorithms, we provide enough information for the data to learn and find patterns.

Unlike traditional challenges, we don't need to define new rules for each machine learning problem; instead, we simply need to utilise the same approach but with a different dataset.



2. What Are Machine Learning Algorithms and What Are Their Different Types?

Machine learning algorithms come in a variety of shapes and sizes. Here's a list of them organised by general category: Whether or if they are taught under human supervision (Supervised, unsupervised, reinforcement learning)

The criteria in the figure below are not mutually exclusive; we can combine them in any way we see fit.


3. What is Supervised Learning and How Does It Work?

Supervised learning is a machine learning algorithm that uses labelled training data to infer a function. A series of training examples makes up the training data.


01 as an example


Knowing a person's height and weight can help determine their gender. The most popular supervised learning algorithms are shown below.


Vector Support Machines (SVMs)


Bayesian naive

Trees of Decision

Neural Networks and the K-nearest Neighbour Algorithm

02 as an example


If you're interested in learning more, Create a T-shirt classifier with labels such as "this is an S, this is an M, and this is an L," based on S, M, and L examples shown in the classifier.


4. What is Unsupervised Learning and How Does It Work?

Unsupervised learning is a sort of machine learning method that searches for patterns in a set of data. There is no dependent variable or label to forecast in this case. Algorithms for Unsupervised Learning:

Clustering, Anomaly Detection, Neural Networks, and Latent Variable Models are some of the techniques used to detect anomalies.

A T-shirt clustering, for example, will be divided into "collar style and V neck style," "crew neck style," and "sleeve types."


5. What does 'Naive' mean in the context of a Naive Bayes model?

The Naive Bayes technique is a supervised learning algorithm that is naive since it assumes that all qualities are independent of each other by applying Bayes' theorem.

Given the class variable y and the dependent vectors x1 through xn, Bayes' theorem states the following relationship:


=P | x1,..., xn) (yi) P(x1,..., xn | yi)(P)(P)(P)(P)(P)(P)(P)(P)( (x1,..., xn)

Using the naive conditional independence assumption that each xi is independent, we can simplify this relationship to:

P(xi | yi, x1,..., xi-1, xi+1,...., xn) = P(xi | yi) = P(xi | yi) = P(xi | yi) = P(xi | yi) = P(xi | yi) = P(xi | yi

We can apply the following classification rule because P(x1,..., xn) is a constant given the input:

ni=1 P(yi | x1,..., xn) = P(y)

P(xi | yi)P(x1,...,xn) and we can also estimate P(yi)and P(yi | xi) using Maximum A Posteriori (MAP) estimation. The former is then the relative frequency of class y in the training set.

P(yi | x1,..., xn) P(yi | x1,..., xn) P(yi | x ni=1P(xi | yi) P(yi) ni=1P(xi | yi)

max P y = arg arg arg arg arg arg arg arg (yi)

ni=1P(xi | yi) ni=1P(xi | yi) ni=1P(

The assumptions that different naive Bayes classifiers make about the distribution of P(yi | xi) vary a lot: Bernoulli, binomial, Gaussian, and so on.


6. What exactly is PCA? When are you going to use it?

The most frequent method for dimension reduction is principal component analysis (PCA).

PCA measures the variation in each variable in this situation (or column in the table). It discards the variable if there is little variance.

As a result, the dataset is easier to understand. PCA is employed in a variety of fields, including finance, neuroscience, and pharmacology.

It can be handy as a preprocessing step, especially when characteristics have linear relationships.


7. Describe the SVM Algorithm in depth.

A Support Vector Machine (SVM) is a supervised machine learning model that can do linear and non-linear classification, regression, and even outlier detection.

Assume we've been given some data points, each of which belongs to one of two classes, and our goal is to distinguish between the two groups using a collection of examples.

A data point in SVM is represented as a p-dimensional vector (a list of p numbers), and we wanted to see whether we could separate them using a (p-1)-dimensional hyperplane. This

A linear classifier is what it's called.

The data is classified using a variety of hyperplanes. To select the best hyperplane that indicates the greatest distance between the two classes.
If such a hyperplane exists, it is referred to as a maximum-margin hyperplane, and the linear classifier that it creates is referred to as a maximum margin classifier. The optimum hyperplane for dividing H3 data.

We have data (x1, y1),..., (xn, yn), and several features (xii,..., xip), with yi being 1 or -1.

The set of points satisfying the hyperplane's equation H3 is called the hyperplane's equation.

x-b = 0 w.

Where w is the hyperplane's normal vector. The offset of the hyperplane from the original along the normal vector w is determined by the parameter b||w||.

As a result,

each i, either xiis in the hyperplane of 1 or -1. Basically, xisatisfies:

w . xi - b = 1 or w. xi - b = -1


8. What are SVM Support Vectors?

A Support Vector Machine (SVM) is an algorithm that tries to fit a line (or plane or hyperplane) between the distinct classes that maximises the distance between the line and the classes' points.

It tries to find a strong separation between the classes in this way. The Support Vectors are the points on the dividing hyperplane's edge, as seen in the diagram below.


9. What Are SVM's Different Kernels?

In SVM, there are six different types of kernels:

When data is linearly separable, a linear kernel is utilised.

When you have discrete data with no natural idea of smoothness, you can use a polynomial kernel.

Create a decision boundary with a radial basis kernel that can perform a far better job of

The linear kernel is less effective in separating two classes.

The sigmoid kernel is a neural network activation function.


10.What is Cross-Validation, and how does it work?

Cross-validation is a technique for dividing your data into three sections: training, testing, and validation. The data is divided into k subsets, and the model has been trained on k-1 of them. he final selection will be used for testing. This is repeated for each subgroup. This is referred to as k-fold cross-validation. Finally, the ultimate score is calculated by averaging the scores from all of the k-folds.


11. What is Machine Learning Bias?

Data bias indicates that there is inconsistency in the data. Inconsistency can occur for a variety of causes, none of which are mutually exclusive.

For example, to speed up the hiring process, a digital giant like Amazon built a single engine that will take 100 resumes and spit out the top five candidates to hire.

The software was adjusted to remove the prejudice after the company noticed it wasn't providing gender-neutral results.


12. What is the difference between regression and classification?

Classification is used to provide discrete outcomes, as well as to categorise data into specified categories.
Classifying emails into spam and non-spam groups, for example.

Regression, on the other hand, works with continuous data.
Predicting stock prices, for example.

At a specific point in time, pricing.

The term "classification" refers to the process of categorising the output into a set of categories.
Is it going to be hot or cold tomorrow, for example?

Regression, on the other hand, is used to forecast the connection that data reflects.
What will the temperature be tomorrow, for example?


13. What is the difference between precision and recall?

Precision and recall are two metrics that can be used to assess the effectiveness of machine learning deployment. However, they are frequently employed at the same time.

Precision solves the question, "How many of the things projected to be relevant by the classifier are genuinely relevant?"

Recall, on the other hand, responds to the query, "How many of all the actually relevant objects are found by the classifier?"

Precision, in general, refers to the ability to be precise and accurate. As a result, our machine learning model will follow suit. If your model must predict a set of items in order to be useful. How many of the items are genuinely important?

The Venn diagram below depicts the relationship between precision and accuracy recall Precision and recall can be defined mathematically as follows:

accuracy = number of happy accurate responses divided by the total number of items returned by the ranker # joyful accurate answers/# total relevant answers = recall


14. What Should You Do If You're Overfitting or Underfitting?

Overfitting occurs when a model is too well suited to training data; in this scenario, we must resample the data and evaluate model accuracy using approaches such as k-fold cross-validation. Where as in the case of Underfitting, we are unable to understand or capture patterns from the data, we must either tweak the algorithms or input more data points to the model.


15. What is a Neural Network and How Does It Work?

It's a simplified representation of the human mind. It has neurons that activate when it encounters anything similar to the brain.

The various neurons are linked by connections that allow information to travel from one neuron to the next.

17. What is the difference between a Loss Function and a Cost Function? What is the main distinction between them?
When computing loss, we just consider one data point, which is referred to as a loss function.

When determining the sum of error for multiple data, the cost function is used. There isn't much of a difference.

To put it another way, a loss function captures the difference between actual and predicted values for a single record, but a cost function aggregates the difference for multiple records.

the whole training set

Mean-squared error and Hinge loss are the most widely utilised loss functions.

The Mean-Squared Error (MSE) is a measure of how well our model predicted values compared to the actual values.

MSE (Mean Squared Error) = (predicted value - actual value)

Hinge loss is a technique for training a machine learning classifier.

max L(y) = max L(y) = max L(y) = max (0,1- yy)

Where y = -1 or 1 indicating two classes and y represents the output form of the classifier. The most common cost function represents the total cost as the sum of the fixed costs and the variable costs in the equation y = mx + b


17. What is Ensemble learning?

Ensemble learning is a method that combines multiple machine learning models to create more powerful models.

There are many reasons for a model to

Make an impression. The following are a few reasons:

Various Populations
Various Hypotheses
Various modelling techniques
We will encounter an error when working with the model's training and testing data. Bias, variance, and irreducible error are all possible causes of this inaccuracy.

The model should now always have a bias-variance trade-off, which we term a bias-variance trade-off.

This trade-off can be accomplished by ensemble learning.

There are a variety of ensemble approaches available, but there are two general strategies for aggregating several models:

Bagging is a natural approach for generating new training sets from an existing one.
Boosting, a more elegant strategy, is used to optimise the optimum weighting scheme for a training set, similar to bagging.


18. How do you know which Machine Learning algorithm should I use?

It is entirely dependent on the data we have. SVM is used when the data is discrete. We utilise linear regression if the dataset is continuous.

As a result, there is no one-size-fits-all method for determining which machine learning algorithm to utilise; it all depends on the exploratory data analysis (EDA).

EDA is similar to "interviewing" a dataset. We do the following as part of our interview:

Sort our variables into categories like continuous, categorical, and so on.
Use descriptive statistics to summarise our variables.
Use charts to visualise our variables.
Choose one best-fit algorithm for a dataset based on the above observations.


19. How Should Outlier Values Be Handled?

An outlier is a dataset observation that is significantly different from the rest of the dataset. Tools are used to find outliers Z-score in a box plot
Scatter plot, for example.
To deal with outliers, we usually need to use one of three easy strategies:

We can get rid of them.
They can be labelled as outliers and added to the feature set.
Similarly, we can change the characteristic to lessen the impact of the outlier.


20.What is a Random Forest, exactly? What is the mechanism behind it?

Random forest is a machine learning method that can be used for both regression and classification.

Random forest, like bagging and boosting, operates by merging a number of different tree models. Random forest creates a tree using a random sample of the test data columns.

The steps for creating trees in a random forest are as follows:

Take a sample size from the population. data for training
Begin by creating a single node.
From the start node, run the following algorithm:
Stop if the number of observations is less than the node size.
Choose variables at random.
Determine which variable does the "best" job of separating the data.
Dividing the observations into two nodes is a good idea.
On each of these nodes, run step 'a'.

Follow Us on!

How can we help you?

To request a quote or want to meet up for a course discussion, contact us directly or fill out the form and we will get back to you promptly.