예시 코딩 - Linear Model with Panel Data

코드 파일: linear_model-panel-ols.ipynb

Panel Data Analysis

Panel data = $[y, \mathbf{x}]_{i,t} = [y,x_1,x_2,\dots,x_p]_{i,t}$

$i \in I$ : index for grouping.

e.g. individual name, age, size, gender, location
Its numeric order may or may not be meaningful.

$t \in T$ : index for time.

Its numeric order is increasing by definition

$y_{i,t}$ : dependent variable of group $i$ , observed at time $t$

$\mathbf{x}_{i,t}=[x_1,x_2,\dots,x_p]_{i,t}$ : independent variables of group $i$ , observed at time $t$

python

   id  time   y  x1  x2
0   1     1  10   5   3
1   1     2  12   6   4
2   1     3  14   7   5
3   2     1  20  10   6
4   2     2  22  11   7

Comparing Linear Models for Investment amount of Firms

$e_{i,t}$ = error, assumed to be IID normally distributed across $i$ and $t$

Example 2. Gulen and Ion 2015.

Model: Increase in cash flow or Increase in investment opportunity $\implies$ Increase in investment

Statistical significance test for the positive relationships between investment and the two regressors.

Data frequency: quarterly or annual

Data sources: accounting data provided via Compustat and market price data via CRSP

$y_{i,t+1}$ = investment amount of firm $i$ , between time $t$ and time $t+1$

$(x_1)_{i,t}$ = operating cash flows of firm $i$ , observed at time $t$

$(x_2)_{i,t}$ = Tobin’s q of firm $i$ , observed at time $t$

Data pre-processing: Winsorization involves replacing values of extreme outliers with given quantiles (e.g. 0.01) on the respective end.

Linear Models.

Random effect model: $y_{i,t+1}=\alpha + \beta_1 \cdot (x_1)_{i,t}+\beta_2 \cdot(x_2)_{i,t}+e_{i,t}$

1-way Fixed effect model: $y_{i,t+1}=\alpha_i + \beta_1 \cdot (x_1)_{i,t}+\beta_2 \cdot(x_2)_{i,t}+e_{i,t}$

2-way Fixed effect model: $y_{i,t+1}=\alpha_i + \alpha_t+ \beta_1 \cdot (x_1)_{i,t}+\beta_2 \cdot(x_2)_{i,t}+e_{i,t}$

python

import pyfixest as pf

model_ols = pf.feols(
  "investment_lead ~ cash_flows + tobins_q",
  vcov = "iid", 
  data = data_investment
)
model_ols.summary()

model_fe_firm = pf.feols(
  "investment_lead ~ cash_flows + tobins_q | gvkey",
  vcov = "iid", 
  data = data_investment
)
model_fe_firm.summary()

model_fe_firmyear = pf.feols(
  "investment_lead ~ cash_flows + tobins_q | gvkey + year",
  vcov = "iid", 
  data = data_investment
)
model_fe_firmyear.summary()

pf.etable([model_ols, model_fe_firm, model_fe_firmyear], coef_fmt = "b (t)")

The random effect (RE) model assumes unexplained linear effects are random. It assumes that unexplained variations by omitted variables are not linearly correlated with explained variations by the included independent variables (i.e. explanatory regressors). Thus, this model may have a lot of omitted variables (i.e. unexplained variations), so the OLS coefficients are most likely biased (which may be indicated by a low adjusted R-squared). If the bias in the OLS coefficients is severe, then we cannot make sure if the direction of the coefficients is positive or negative.

The panel linear regression focuses on how the regressors within each individual/group affect the dependent variable, controlling for the unobserved, time-invariant, long-term differences across these individuals/groups (e.g., cultural factors, political systems). We are interested in the explanatory power of a firm’s cash flow and Tobin’s q on top of the average investment of each firm or the time fixed effect.

The fixed effect (FE) model with $\alpha_i$ assumes that each group has its own mean at each group level, so unobserved factors specific to individuals/groups are linearly correlated with the regressors.

The $\alpha_i$ represents the fixed mean effect (fixed and unique intercept parameter) for group $i$

The $\alpha_i$ represents the firm-specific average investment across all years (i.e. investment in a long-run). The idea of the firm fixed effect is to remove the firm’s average investment, which might be affected by firm-specific variables that we do not observe. Firms in a specific industry might invest more on average. This sort of variation is unwanted because it is related to unobserved variables that can bias the OLS estimates in any direction.

The $\alpha_t$ represents the fixed mean effect (fixed and unique intercept parameter) for time $t$

The $\alpha_t$ represents the time fixed effects. For example, average investment across firms might vary over time due to macroeconomic factors that affect all firms, such as economic crises. Higher investments during an economic expansion with simultaneously high cash flows. If the inclusion of time fixed effects did only marginally affect the R-squared and the coefficients, then it may indicate that the coefficients are not driven by an omitted variable that varies over time.

The Hausman test is a statistical test used to help researchers decide between fixed effects (FE) and random effects (RE) models. This decision is critical because choosing the incorrect model could lead to biased or inefficient estimates. The Hausman test examines whether there is a significant linear correlation between the unobserved individual effects (the unobserved heterogeneity specific to each individual or group) and the independent variables in the model. The null hypothesis of the Hausman test is that the difference in the coefficients of the FE and RE models is not statistically significant, meaning there is no linear correlation between the individual effects and the regressors (suggesting the RE model is appropriate). If the null hypothesis is rejected, it suggests the fixed effects model is more appropriate.

Estimate the model using both FE and RE models to obtain two sets of parameter estimates.
Compute the difference between the FE and RE coefficients.
Calculate the test statistic:
Interpret the result:

Clustered Standard Errors

biased estimates vs. biased standard errors of the unbiased estimate

Linear Regression Residuals $\hat{e}_{i,t}:=y_{i,t}-\hat{\beta}'\cdot\mathbf{x}_{i,t}$

Even if we successfully solved the omitted variable problem so that the OLS estimate to be unbiased, the bigger problem remains. What if $Cov(\hat{e}_{i,t}, \hat{e}_{i,t+h}) \ne0$ ? Such dependencies in the residuals invalidate the i.i.d. assumption of OLS and lead to biased standard errors. With biased OLS standard errors, we cannot reliably interpret the statistical significance of the unbiased estimated coefficients.

The residuals $\hat{e}_{i,t}$ may be correlated across years for a given group (time-series dependence), or, alternatively, the residuals may be correlated across different groups (cross-section dependence). One of the most common approaches to dealing with such dependence is the use of clustered standard errors (Petersen 2008) on the group level and/or on the time level. The idea behind clustering is that the correlation of residuals within a cluster can be of any form. As the number of clusters grows, the cluster-robust standard errors become consistent (Donald and Lang 2007; Wooldridge 2010). A natural requirement for clustering standard errors in practice is hence a sufficiently large number of clusters. Typically, around at least 30 to 50 clusters are seen as sufficient (Cameron, Gelbach, and Miller 2011).

If the t statistics of OLS coefficients drop significantly upon clustering the standard errors on the firm level, it indicates there exists high correlation of residuals within firms.

python

model_fe_firmyear = pf.feols(
  "investment_lead ~ cash_flows + tobins_q | gvkey + year",
  vcov = "iid", 
  data = data_investment
)

model_cluster_firm = pf.feols(
"investment_lead ~ cash_flows + tobins_q | gvkey + year",
vcov = {"CRV1": "gvkey"},
data = data_investment
)

model_cluster_firmyear = pf.feols(
"investment_lead ~ cash_flows + tobins_q | gvkey + year",
vcov = {"CRV1": "gvkey + year"},
data = data_investment
)

pf.etable([model_fe_firmyear, model_cluster_firm, model_cluster_firmyear], coef_fmt = "b (t)")