Linear Regression Interview Questions – Part 2
In the previous post, you saw some common interview questions asked on linear regression. The questions in that segment were mostly related to the essence of linear regression and focused on general concepts related to linear regression. This section extensively covers the common interview questions asked related to the concepts learnt in multiple linear regression.
Q1. What is Multicollinearity? How does it affect the linear regression? How can you deal with it?
Multicollinearity occurs when some of the independent variables are highly correlated (positively or negatively) with each other. This multicollinearity causes a problem as it is against the basic assumption of linear regression. The presence of multicollinearity does not affect the predictive capability of the model. So, if you just want predictions, the presence of multicollinearity does not affect your output. However, if you want to draw some insights from the model and apply them in, let’s say, some business model, it may cause problems.
One of the major problems caused by multicollinearity is that it leads to incorrect interpretations and provides wrong insights. The coefficients of linear regression suggest the mean change in the target value if a feature is changed by one unit. So, if multicollinearity exists, this does not hold true as changing one feature will lead to changes in the correlated variable and consequent changes in the target variable. This leads to wrong insights and can produce hazardous results for a business.
A highly effective way of dealing with multicollinearity is the use of VIF (Variance Inflation Factor). Higher the value of VIF for a feature, more linearly correlated is that feature. Simply remove the feature with very high VIF value and re-train the model on the remaining dataset.
Q2. How can you handle categorical variables present in the dataset?
Many a time it might happen that your dataset has categorical variables that might be a potentially good predictor for the response variable. So handling them right is quite crucial.
One of the ways to handle categorical data with just two levels is to do a binary mapping of the variables wherein one of the levels will correspond to zero and the other to 1.
Another way of handling categorical variables with few levels is to perform a dummy encoding. The key idea behind dummy encoding is that for a variable with, say, ‘N’ levels, you create ‘N-1’ new indicator variables for each of these levels. So for a variable say, ‘Relationship’ with three levels namely, ‘Single’, ‘In a relationship’, and ‘Married’, you would create a dummy table like the following:
|Relationship Status||Single||In a relationship||Married|
|In a relationship||0||1||0|
But you can clearly see that there is no need of defining three different levels. If you drop a level, say ‘Single’, you would still be able to explain the three levels.
Let’s drop the dummy variable ‘Single’ from the columns and see what the table looks like:
|Relationship Status||In a relationship||Married|
|In a relationship||1||0|
If both the dummy variables namely ‘In a relationship’ and ‘Married’ are equal to zero, that means that the person is single. If ‘In a relationship’ is one and ‘Married’ is zero, that means that the person is in a relationship and finally, if ‘In a relationship’ is zero and ‘Married’ is 1, that means that the person is married.
Now, creating dummy variables might be useful when the number of levels in a categorical variable is small, but if a categorical variable has a hundred levels, it is clearly impossible to create 99 new variables. In such cases, grouping the variables might be useful. For example, for the variable “Cities in India”, you can use a geographical grouping, i.e.:
- Keep the ‘n’ largest cities, group the rest
- Geographical hierarchy
- City < District < State < Zone
- Group cities with the similar value for the outcome variable
- Cluster cities with similar values for the predictor variables
Another way to deal with categorical variables is that you can perform a One-hot encoding which hasn’t been covered in our syllabus.
Q3. What is the major difference between R-squared and Adjusted R-squared/Why is it advised to use Adjusted R-squared in case of multiple linear regression?
The major difference between R-squared and Adjusted R-squared is that R-squared doesn’t penalise the model for having more number of variables. Thus, if you keep on adding variables to the model, the R-squared will always increase (or remain the same in the case when the value of correlation between that variable and the dependent variable is zero). Thus, R-squared assumes that any variable added to the model will increase the predictive power.
Adjusted R-squared on the other hand, penalises models based on the number of variables present in it. Its formula is given as:
where ‘N’ is the number of data points and ‘k’ is the number of features
So if you add a variable and the Adjusted R-squared drops, you can be certain that that variable is insignificant to the model and shouldn’t be used. So in the case of multiple linear regression, you should always look at the adjusted R-squared value in order to keep redundant variables out from your regression model.
Q4. Explain gradient descent with respect to linear regression.
Gradient descent is an optimisation algorithm. In linear regression, it is used to optimise the cost function and find the values of the βs (estimators) corresponding to the optimised value of the cost function.
Gradient descent works like a ball rolling down a graph (ignoring the inertia). The ball moves along the direction of the greatest gradient and comes to rest at the flat surface (minima).
Mathematically, the aim of gradient descent for linear regression is to find the solution of
ArgMin J(Θ0,Θ1), where J(Θ0,Θ1) is the cost function of the linear regression. It is given by —
Here, h is the linear hypothesis model, h=Θ0 + Θ1x, y is the true output, and m is the number of data points in the training set.
Gradient Descent starts with a random solution, and then based on the direction of the gradient, the solution is updated to the new value where the cost function has a lower value.
The update is:
Repeat until convergence
To read more about gradient descent, refer to the additional resources on linear regression here (link).
Q4. What is VIF? How do you calculate it?
Variance Inflation Factor (VIF) is used to check the presence of multicollinearity in a dataset. It is calculated as—
Here, VIFi is the value of VIF for the ith variable, Ri2 is the R2 value of the model when that variable is regressed against all the other independent variables.
If the value of VIF is high for a variable, it implies that the R2 value of the corresponding model is high, i.e. other independent variables are able to explain that variable. In simple terms, the variable is linearly dependent on some other variables.
Q5. Explain the bias-variance trade-off.
Bias refers to the difference between the values predicted by the model and the real values. It is an error. One of the goals of an ML algorithm is to have a low bias.
Variance refers to the sensitivity of the model to small fluctuations in the training dataset. Another goal of an ML algorithm is to have low variance.
For a dataset that is not exactly linear, it is not possible to have both bias and variance low at the same time. A straight line model will have low variance but high bias, whereas a high-degree polynomial will have low bias but high variance.
There is no escaping the relationship between bias and variance in machine learning.
- Decreasing the bias increases the variance.
- Decreasing the variance increases the bias.
So, there is a trade-off between the two; the ML specialist has to decide, based on the assigned problem, how much bias and variance can be tolerated. Based on this, the final model is built.