Revise statistics concepts with this ultimate collection of 15 most asked interview questions and answers.
Statistics for Data Science
- What is the significance of Central Limit Theorem (CLT) in data science?
In practice, Central Limit Theorem (CLT) is often used for hypothesis testing, Confidence Interval, and understanding the behaviour of statistical estimators. CLT is crucial in field such as financial analysis, quality control, social sciences experiments, etc.
There are 3 basics rules of CLT. According to these rules Sampling distribution will always follow the normal distribution under the following conditions:
i. If the size of the sample data is large.
ii. population distribution has a finite variance and,
iii. The sample variables are independent and do not fluctuate.
CLT helps researchers and scientists to understand the behavior of data. It makes it easier to make informed decisions and predictions.
2. What is the difference between discriptive statistics and inferential statistics?
Descriptive statistics parameters helps in summarizing and describing the dataset. It’s results are mostly represented in numerical forms or graphical forms. Some parameters of descriptive statistics are mean, median, mode, variance, deviation, range etc. These are numerical parameters. The graphical parameters are bar chart, pie chart, line chart, etc.
Inferential statistics is analysing sample data to make conclusions on population data. It use sample data to generalize population data and make predictions. Hypothesis Testing, ANOVA, P values are some of the parameters of inferential statistics.
3. How p- value is used in hypothesis testing?
p – value in statistics is used to validate a hypothesis against observed data. A p – value returns probability value against observed data, assuming that the null hypothesis is true. Lower the p value determines the greater significance of difference of observed data.
If the p-value is less than the chosen significance level (often 0.05), it indicates that the data provides enough evidence to reject the null hypothesis. If the p-value is greater than the significance level, it suggests that the data does not provide strong evidence against the null hypothesis.
4. What is the difference between Type-1 Error and Type-2 Error?
A Type 1 error occurs when null hypothesis is incorrectly rejected. It is also called False Positive(FP).
Type-2 error occurs when null hypothesis incorrectly accepted. It is also called False Negative(FN).
In other words:
type-1 error: Rejecting null hypothesis when it is True.
Type-2 error: Accepting null hypothesis when it is False.
Both need to be balanced properly. Especially in healthcare sector, whether type-1 or type-2 error could have a severe impact on the health and life of people.
5. What is hypothesis testing?
In statistical analysis, There are two types of Hypothesis.
Null Hypothesis(H0) : This is default assumption that there is not significant differences in the observed value.
Alternate Hypothesis(H1): This is the hypothesis for which we are gathering evidence to support our assumptions.
6. What is A/ B testing?
A/ B Testing is done to test out of two versions of a model, which one is performing better. Such type of testing is normally done in marketing and advertising industry.
Suppose we have to adverise a product online. We have created two different sample aids to advertise. With the help of different parameters we can determine which sample has attracted and engaged more users.
7. What is Overfitting? How to overcome from overfitting problem in data science?
Overfitting is a modelling error. Overfitting occurs when a model shows great result on training data but returns poor results on test data.
In overfitting, training data sets so closely that model fails to do correct predictions on new dataset. Some of the prevention measures are cross validation, regularization and simplifying models.
Overfitting often mislead and lead to wastage of resources, time and money if not detected earlier.
8. What is the difference between correlation and causation?
Correlation indicates association or relationship between two or more variables. whereas in causation, One variable directly affects the others.
Example: Ice Cream Sales and Crime Rates
Correlation: There is a strong positive correlation between ice cream sales and crime rates. When ice cream sales increase, crime rates also tend to increase.
Causation: This does not mean that ice cream sales directly cause crime rates to rise. Instead, both ice cream sales and crime rates might independently increase during hot weather because more people are out in public places, leading to both more ice cream sales (due to the heat) and potentially more opportunities for crime.
9. What is confidence interval? Explain…
Confidence interval is a range of values derived from sample dataset. And it is assumed that the certain values of unknown population may fall within that range. Confidence intervals are used for generalization purpose on population data. Certain parameters and statistical calculations are involved to find out CI.
10. What is the law of large numbers?
Law of large numbers ensures that the larger the size of sample dataset provide more accurate estimate of population parameters.
Law of large numbers suggests that the larger the sample size, the average outcome of random events will be more closer to the expected result.
11. What is the chi-square test, and when is it used?
Chi-square test is used to check that the two categorical variables are significantly associated or not.
There are two types of Chi-square test:
- Chi- square test of independence : This tests determine whether two categorical variables are independent to each others.
2. Chi- square Goodness-of-fit Test: It is used to deternine whther a sample matches the expected distribution.
12. What is Bayes’ theorem, and how is it applied in statistics?
Bayes’ theorem is a conditional probability. It is used to determine how an event can impact the other events.
In data science, medicine and finances; Bayes theoram has wider implication.
It is a mathematical framework for updating the probability of a hypothesis based on new data.
Bays’ theorem is fundamental in decision making especially in uncertainty.
Baye’s theorem is used for different purposes such as:
- Hypothesis Testing
- Machine Learning especially in Naive Bayes’ algorithm
- Fraud Detection,
- A/B Testing, etc.
13. Define probability density function (PDF) and cumulative distribution function (CDF).
The Probability Density Function(PDF) denotes the probability density of a continuous random variable at every possible outcome.
The cumulative distributionfunction(CDF) gives the cumulative probability up to a specified point.
14. Explain Likelihood in data science.
In Machine Learning, Likelihood specifies how well a machine learning model explains observed data.
likelihood focuses on the parameter values that best explain the observed data.
Likelihood is function of parameters of the machine learning model on the given data.
Likelihood plays a significant role in regression analysis, hypothesis tesing and Bayesian inference, etc.
15. How do you determine if a dataset is normally distributed?
There are several ways to identify whether dataset is normally distributed or not. Some of them are given below.
Check for mean , median and mode of dataset. A normally distributed data have almost equal mean, median and mode.
Check for empirical rule. If a dataset is normally distributed then 68% data falls under 1 standard deviation, 95% in 2 normal distribution and 99.7% within 3 normal distribution.
Visual Inspection
a. Histogram : Plot a Histogram and check the data distribution is bell shaped or not. Bell shaped distribution is property of normal distribution
b. Boxplot: A normal distribution normally have symmetric whiskers and no outliers.
There are different statistical tests to check the normal data distribution. Such as :
Shapiro – wilk test
Kolmogrove – Samirnov test and many others.
Also Read>>>