Violations of the regression model
20 important questions on Violations of the regression model
- Can we detect if is there an indication of outliers with the information provided in Table 2?
1. Have a look at table 6, can you explain what is the difference between the columns of the results table?
2. Looking at the results provided in panel A, Interpret the coefficient of the variable TSTATUS in the specification of the model with dependent variable SCAR(-5,+5)?
3. Can you select the best model specification with the information provided in table 6?
2. The coefficient of correlation is significant at a 99% confidence level. Also, If the TSTATUS (dummy variable) increase with +1 the SCAR window of -5 +5 increases with 0.0135
3. No you cant, because there is no R2.
- Check for the presence of outliers in this analysis?
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding
How do the authors check for potential multicollinearity in their model? Is there any other statistic that could have been used?
1. Is there any indication of multicollinearity in this analysis?
2. Which type of correction has been applied to the models to correct for potential heteroscedasticity?
2. Robust standard errors
1. Take a look at the variables. There are some dummies. You will see the variable “female” and also the variables “Hispanic” and “Black”. Why are there two variables for “Hispanic” and “Black” and only one for “Female” (and not also a variable for “Male”)?
2. How many females are in the sample?
3. How many individuals have a degree?
4. Take a look at the variable exper, try to determine whether you can find indication of extreme values (outliers) in this variable
2. Around 52 of the sample
3. BA |.3065208 + 0440633 = around 35%
4. Yes, there are extreme values.
- Describe the distribution of the variable experience in terms of its symmetry with the information provided in table 3.
- Describe the distribution of the variable experience in terms how much peaked is in comparison to the normal distribution with the information provided in table 3.
- What is the range of the variable experience?
2. Kurtosis 3.37197,as it is Leptokurtic (table 3)
3. 166 - 3 = 163
Keeping in mind that category 1 is female, can you conclude by looking at figure 1 that the distribution of the variable experience is different among genders?
Select the best model in table 4. Explain the statistics that you have chosen to select between these models. Justify your answer.
Model three, as the (adjusted) R2 is the highest
***EXTRA TABLE 4***
AIC = Akaike C and BIC= Bayesian, also a goodness of fit test, the lower the better. Root mean square error, the lower it is the better.
***EXTRA TABLE 4***
The survey was held in a specific year; all variables have been measured on the same point in time.
- a) Given that the data are collected on the same point in time, what might be a principal problem?
- b) 'Wage' is taken as the dependent variable. Is that reasonable or not?
2. Yes, it is a logical relationship.
What does multivariate regression tell you?
- to what extent a set of variables is able to explain the outcome variable (e.g., R2)
- which variable(s) in the set are the best predictors for the outcome (significance and size of β’s)
- whether a variable still helps predict the outcome if other variables are also used as predictors (significance of corresponding β
What are marginal effects?
Making one variable constant, to see how the other variable independently from the other X influences the Y
What is the difference between bivariate regression and multivariate regression?
What happens if we do OLS with distorted results?
› Variances / standard errors could be inflated
- t-ratio (=b/sd(b)) deflated
- could imply that the parameter is not significant
- could imply rejection of H0
› Size of individual coefficients (b’s) could be inflated
- t-ratio (=b/sd(b)) inflated
- could imply that the parameter becomes significant
- could imply acceptation of H0
› The signs of the coefficients could change
We cannot trust the results!
What is 'mean centering against multicollinearity and does it work?
Centering: subtracting a constant from every value of a variable
redefine the 0 point for that predictor to whatever value you subtract
shifts the scale over, but retains the units
- › Mean centering: subtracting the mean from every value
- › Common ‘solution’ for multicollinearity, but for a linear or multiplicative model, this is just an algebraic transformation
different coefficients and standard errors
but not a better model!!
note: interpretation of marginal effects changes
› For a polynomial model (e.g. quadratic term) this may help
interpretation of marginal effects changes
What is the problem with heteroskedasticity ?
› Uneven distribution of errors in the scatterplot
› A few more large errors of the same sign in the area with large errors would tilt the regression line substantially
Causes:
1. different size of observations may result in different size of error terms
e.g. distance travelled of a rocket from take- off (measurement error)
2. groups of observations are different
- follow different processes
- with different error terms
- e.g. poorer people always buy the same food; wealthier people occasionally buy expensive food
What happens if we do OLS with heteroskedasticity?
- › OLS does not produce estimates with minimum variance in the error term
- › OLS underestimates the variance / standard errors of the estimated coefficients
too high t-values
may lead to erroneous conclusion of significance and accepting H0 (that the variable has a significant effect)
How can we check whether we have this problem?
› Several tests › Graphical:
the eyeball test: scatterplot of residuals normal probability plot
What is the scatterplot of residuals?
› You want to see most of the scores concentrated in the center (around 0); no systematic patterns
› For each independent variable!
How can we solve the problem of heteroskedacity?
1. Weighted least squares
- more precise observations (with less variability) are given greater weight in determining the regression coefficients
2. Refine the variable
- transform into a form that does not suffer from heteroskedasticity
- e.g. rather than national income, use per capita income
What is reverse causality?
› We usually assume that changes in the dependent variable are caused by changes in the independent variable(s)
› But: we only find a statistical relationship
says nothing about causality
says nothing about the direction of causality
› In some analyses, it could be that Y (also) causes X... : reverse causality
endogeneity, week 1
› We can test whether changes in X precede changes in Y (‘Granger causality’)
The question on the page originate from the summary of the following study material:
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding