Home / Summaries / Class notes - Empirical Research Project / variable-regression-logistic

Logistic regression

Q: Which methods let you estimate the model for logistic regression?

› Maximum likelihood estimation (MLE)  statistical method for estimating the coefficients of a model  selects coefficients that make the observed values most likely to have occurred › Likelihood function (L)  L measures the probability of observing the particular set of dependent variable values that occur in the sample  thehighertheL,the higher the probability of observing the values in the sample  MLE involves finding coefficients β that maximizes the (log) likelihood (note that often LL<0 › Log-likelihood statistic (LL<0) in OLS, you minimize sum of squared residuals; here you maximize the likelihood indicator of how much explained information there is after the model has been fitted  small values indicate poorly fitting statistical models  deviancestatistic:-2LL:haschi-squareddistribution › Note  we cannot interpret values of L (or LL) directly higher is better, but critical value of Chi2 distribution depends on number of degrees of freedom  so: look at significance of test

Q: What is Warld statistic (z)?

the Wald test can be used to test the true value of the parameter based on the sample estimate.  similartot-statisticinnormalregression  tests the null hypothesis that b = 0  biased when b is large  bettertolookatlikelihood-ratiostatistics(i.e.compare the specifications with and without this variable)  or:correction

Q: Read these two slides

It is about assessing the regressor, where you can see how it differs from OLS

Q: What is multiple logistic regression?

› Predict membership of more than two categories › Breaks the dependent variable down to a series of comparisons › Example: three categories A, B, C  analysis consists of two comparisons  select baseline category, e.g. A  comparison: A vs. B,A vs. C

Q: Give your insights from this regression output

Note: estimate probability of direct network entry (dependent variable =1 if dyad) Interpretation of parameter estimate:  positive (significant): this variable increases the probability of dyad –direct network entry  negative(significant):...decreases... › No evidence for H1 and H4 › Evidence for other hypotheses › Some controls matter, but not all

Q: Looking at the information provided in table 1, explain the main characteristics of the sampled firms. Note that in table 1 (descriptive statistics in the sampled firms) there is a mistake. The title of the third column instead of Minimum should be Maximum.

203, SMEs from manufacturing, B2B markets, <250 employees —> Formal EU definition. Questionnaire was retrospective (gathering past information). You can derive the foreign sales and foreign markets as % of the total sales and markets. It appears that there are also outliers (e.g. employees, foreign sales and markets).

21 important questions on Logistic regression

What is the difference between OLS and logistic regression?

With OLS the dependent variable is continuous, with logistic regression it is binary.

Which methods let you estimate the model for logistic regression?

› Maximum likelihood estimation (MLE)

 statistical method for estimating the coefficients of a model
 selects coefficients that make the observed values most likely to have occurred
› Likelihood function (L)

 L measures the probability of observing the particular set of dependent variable values that occur in the sample
 thehighertheL,the higher the probability of observing the values in the sample
 MLE involves finding coefficients β that maximizes the (log) likelihood (note that often LL<0

› Log-likelihood statistic (LL<0)
in OLS, you minimize sum of squared residuals; here you maximize the likelihood
indicator of how much explained information there is after the model has been fitted
-  small values indicate poorly fitting statistical models
-  deviancestatistic:-2LL:haschi-squareddistribution
  › Note
-  we cannot interpret values of L (or LL) directly
  higher is better, but critical value of Chi2 distribution depends on number of degrees of freedom
-  so: look at significance of test

What is Warld statistic (z)?

the Wald test can be used to test the true value of the parameter based on the sample estimate.

 similartot-statisticinnormalregression
 tests the null hypothesis that b = 0
 biased when b is large
 bettertolookatlikelihood-ratiostatistics(i.e.compare the specifications with and without this variable)
 or:correction

Read these two slides

It is about assessing the regressor, where you can see how it differs from OLS

What are the potential problems with logistic regression?

› As in normal regression:
linearity: logistic regression assumes linear relationship between the regressors and the logit of the dependent variable

 independence of errors (no correlation between errors)
 multicollinearity (inflates standard errors)
› Unique problems:

 statistical software: iterative procedure fails to converge
 two reasons:
- incomplete information...
- complete separation...

What is multiple logistic regression?

› Predict membership of more than two categories
› Breaks the dependent variable down to a series of comparisons
› Example: three categories A, B, C  analysis consists of two comparisons
 select baseline category, e.g. A
 comparison: A vs. B,A vs. C

Give your insights from this regression output

Note: estimate probability of direct network entry (dependent variable =1 if dyad)
Interpretation of parameter estimate:

 positive (significant): this variable increases the probability of dyad –direct network entry
 negative(significant):...decreases...

› No evidence for H1 and H4
› Evidence for other hypotheses
› Some controls matter, but not all

Looking at the information provided in table 1, explain the main characteristics of the sampled firms. Note that in table 1 (descriptive statistics in the sampled firms) there is a mistake. The title of the third column instead of Minimum should be Maximum.

203, SMEs from manufacturing, B2B markets, <250 employees —> Formal EU definition. Questionnaire was retrospective (gathering past information).
You can derive the foreign sales and foreign markets as % of the total sales and markets. It appears that there are also outliers (e.g. employees, foreign sales and markets).

The authors collected data 'on site'. What does it mean and what are advantages/disadvantages?

That they went there themselves to hand out the surveys. The chances of a response become higher this way. However, this is time-consumer and costly for the researchers.

How did authors check for the potential sample bias in the data?

Sample bias: They compared respondence with non-respondence and it showed no evidence of bias.
Market bias: How representative is this sample in terms of the dif. host market presence new eu markets (92), Russia (61) & China (5).

Do you think the results will be influenced by the financial crisis of 2008?

The data was collected in 2004 before the crisis, so no.

Explain why Sandberg decided to use logistic regression rather than OLS. Explain the dependent variable of this analysis

The dependent variable is a dummy variable, OLS is therefore not suited, as the results become inaccurate.

What does the model estimate given the dependent variable? From that point of view: how good are the models?

That the independent variables (e.g. general Internationalization knowledge, market-specific knowledge) and control variables (e.g. export share) influence if a company chooses a dyad or triad network node configuration.

Model = B0 +B1 firm size +B2export share + B3 host market experience + … + E

What are the requirements of a logistic regression?

Binary logistic regression is the most suitable technique when the dependent variable is non-metric and dichotomous,when the independent variables are both qualitative and quantitative, and when the underlying assumption of multivariate normality may not be fulfilled.

What does it mean: hierarchical? (table 4 title)

You start first only with the control variables, then the independent variables, then the moderate variables and then everything together.

What types of knowledge are the authors concluded considering to explain the propensity to choose a network entry configuration?

General internationalization, market-specific and customer-specific knowledge.
General internationalization knowledge (—> int. experience of the firm) (H1, direct effect) (H4, moderator) (H6, explanatory var)
Market- specific knowledge (H2, direct effect, H4 explanatory effect and H5, moderator)
Customer specific knowledge (H3, direct effect, H5 exp. var, H6, moderator)

What is the assumption behind the 'host market' variable? Can that assumption be verified? That they effect the international strategy of the firms.

According to table 4 they do. Host market variable (Baltic States, Poland, russia, China)- market that the SME has more experience on.

E.g. China = 1, if the respondent has more experience in China than in Baltic, Poland or Russia.
= 0, otherwise

What is the baseline and group for the models? Is that a logical choice? The control variables.

Categorical variable - host market

Baltic & Poland
Russia
China
We cannot include them all in the model, because we want to avoid multicollinearity. We exclude one of them (base category) China is used at the base category.
It is written in the text why they chose China

Which is the statistic that is used in the logistic regression model to capture the individual significance of the explanatory variables?

Wald statistic (z).

Interpret the coefficient on Russia in model 1 table 4.

If the host market is Russia the chances of a dyad network node configuration becomes lower by 1.74.

Other things being equal’
When the host market is Russia
The probability of having a dyad is lower, that if the host market is China.

Based on the statistics provided in table 4 can you choose which is the best specification?

You don’t have the degrees of freedom, so you cannot use the deviance of likelihood to interpret the realibility of the model. ll < 0 —> the closer to zero the better, -2 ll - deviance sf —> the higher the better So you have to choose between Nagelkarke R2 (the higher the better) and the correct classification (the higher the better), which suggest that model 8 has the best specification.

Extra*** Degrees of freedom sample size number of variables —> use the deviance of likelihood to interpret the reliability of the model. If not, use the Nagelkarke R2 and the Correct classification. *** Extra

The question on the page originate from the summary of the following study material:

Empirical Research Project

View summary

A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Remember faster, study better. Scientifically proven.

Logistic regression

21 important questions on Logistic regression

What is the difference between OLS and logistic regression?

Which methods let you estimate the model for logistic regression?

What is Warld statistic (z)?

Read these two slides

What are the potential problems with logistic regression?

What is multiple logistic regression?

Give your insights from this regression output

Looking at the information provided in table 1, explain the main characteristics of the sampled firms. Note that in table 1 (descriptive statistics in the sampled firms) there is a mistake. The title of the third column instead of Minimum should be Maximum.

The authors collected data 'on site'. What does it mean and what are advantages/disadvantages?

How did authors check for the potential sample bias in the data?

Do you think the results will be influenced by the financial crisis of 2008?

Explain why Sandberg decided to use logistic regression rather than OLS. Explain the dependent variable of this analysis

What does the model estimate given the dependent variable? From that point of view: how good are the models?

What are the requirements of a logistic regression?

What does it mean: hierarchical? (table 4 title)

What types of knowledge are the authors concluded considering to explain the propensity to choose a network entry configuration?

What is the assumption behind the 'host market' variable? Can that assumption be verified? That they effect the international strategy of the firms.

What is the baseline and group for the models? Is that a logical choice? The control variables.

Which is the statistic that is used in the logistic regression model to capture the individual significance of the explanatory variables?

Interpret the coefficient on Russia in model 1 table 4.

Based on the statistics provided in table 4 can you choose which is the best specification?

Summaries related to Logistic regression

Class notes - Empirical Research Project

Academic Writing for International Students …

International Financial Management - Custom …

Class notes - International Financial Manage…

Class notes - Innovation Management in Multi…

Class notes - International Strategic Allian…

lecture 1

lecture 2

lecture 3

Lecture 4

lecture 5

lecture 6