Top 15 Data Science Interview Questions & Answers

Data Science Interview Questions

1. What exactly does the phrase "Data Science" imply?
Data Science is an interdisciplinary discipline that encompasses a variety of scientific procedures, algorithms, tools, and machine learning approaches that work together to uncover common patterns and gain useful insights from raw input data using statistical and mathematical analysis.

2. What is the distinction between data science and data analytics?
Data science is altering data using a variety of technical analysis approaches to derive useful insights that data analysts may apply to their business scenarios.
Data analytics is concerned with verifying current hypotheses and facts, as well as providing answers to queries for a more efficient and successful business decision-making process.
Data Science fosters innovation by providing answers to questions that help people make connections and solve challenges in the future. Data analytics is concerned with extracting current meaning from past contexts, whereas data science is concerned with predictive modeling.
Data science is a vast field that employs a variety of mathematical and scientific methods and algorithms to solve complicated issues, whereas data analytics is a subset of data science.

4. Make a list of the overfitting and underfitting circumstances.
Overfitting: The model only works well with a small set of training data. If the model is given any fresh data as input, it fails to provide any results. These circumstances arise as a result of the model's low bias and large variance. Overfitting is more common in decision trees.
Underfitting: In this case, the model is so simple that it is unable to recognize the proper connection in the data, and hence performs poorly even on test data. This can happen when there is a lot of bias and little variation. Underfitting is more common in linear regression.

5. Distinguish between data in long and wide formats.
a lengthy format Data Data in a Wide Format
Each row of the data reflects a subject's one-time information. Each subject's data would be organized in different/multiple rows. The repeated replies of a subject are divided into various columns in this example.
When viewing rows as groupings, the data may be identified.
By viewing columns as groups, the data may be identified.
This data format is most typically used in R analysis and for writing log files at the end of each experiment.
This data format is most widely used in stats programs for repeated measures ANOVAs and is seldom utilized in R analysis.

6. What is the difference between Eigenvectors and Eigenvalues?
Column vectors or unit vectors with a length/magnitude of 1 are called eigenvectors. Right vectors are another name for them. Eigenvalues are coefficients that are applied to eigenvectors to give them varying length or magnitude values.
Eigen decomposition is the process of breaking down a matrix into Eigenvectors and Eigenvalues. These are then utilized in machine learning approaches such as PCA (Principal Component Analysis) to extract useful information from a matrix.

7. What does it imply to have high and low p-values?
A p-value is a measure of the likelihood of getting outcomes that are equal to or greater than those obtained under a certain hypothesis, provided the null hypothesis is true. This indicates the likelihood that the observed discrepancy happened by coincidence.

If the p-value is less than 0.05, the null hypothesis may be rejected, and the data is unlikely to be true null.
The strength in support of the null hypothesis is indicated by a high p-value, i.e. values less than 0.05. It indicates that the data is true null.
The hypothesis can go either way with a p-value of 0.05.

8. When does resampling take place?
Resampling is a data sampling technique that improves accuracy and quantifies the uncertainty of population characteristics. It is done to check that the model is adequate by training it on various patterns in a dataset to guarantee that variances are handled. It's also done when models need to be verified using random subsets or when doing tests with labels substituted on data points.

9. What do you mean when you say "imbalanced data"?
When data is spread unequally across several categories, it is said to be highly unbalanced. These datasets cause a performance problem in the model, as well as inaccuracies.

10. Do the predicted value and the mean value varies in any way?
Although there aren't many variations between these two, it's worth noting that they're employed in different situations. In general, the mean value relates to the probability distribution, whereas the anticipated value is used when dealing with random variables.

11. What does Survivorship Bias mean to you?
Due to a lack of prominence, this bias refers to the logical fallacy of focusing on parts that survived a procedure while missing others that did not. This bias can lead to incorrect conclusions being drawn.

12. Define the words key performance indicators (KPIs), lift, model fitting, robustness, and DOE.
KPI stands for Key Performance Indicator, which is a metric that assesses how successfully a company meets its goals.
Lift is a measure of the target model's performance when compared to a random choice model. The lift represents how well the model predicts compared to if there was no model.
Model fitting is a measure of how well the model under consideration matches the data.
Robustness: This refers to the system's capacity to deal with changes and variations.

13. Identify the variables that might confuse.
Confounders are another term for confounding factors. These variables are a form of extraneous variable that has an impact on both independent and dependent variables, generating erroneous associations and mathematical correlations between variables that are connected but not incidentally.

14. What if a dataset contains variables with more than 30% missing values? How would you deal with such a dataset?
We use one of the following methods, depending on the size of the dataset:

The missing values are replaced with the mean or average of the remaining data if the datasets are minimal. This may be done in pandas by using mean = df. mean(), where df is the panda's data frame that contains the dataset and mean() determines the data's mean. We may use df.fillna to fill in the missing numbers with the computed mean (mean).
The rows with missing values may be deleted from bigger datasets, and the remaining data can be utilized for data prediction.

15. What is Cross-Validation, and how does it work?
Cross-validation is a statistical technique that is used to test the validity of a hypothesis.

Interview Questions asked in PWC for transaction monitoring

Interview Questions asked in PWC for Transaction Monitoring Last Updated...

Data Science Interview Questions

Data Science Interview Questions

AML Compliance Trends 2026

Interview Questions asked in PWC for transaction monitoring

Follow Us on!

Facebook

Instagram

YouTube

Linkedin

Tumblr

Twitter

How can we help you?

Company

Careers

Community

Learn Today For Better Tomorrow!