Instrumental variable regressions are used when one wants to establish a causal channel through which the explanatory variable affects the dependent variable, but one is worried that the error term in the OLS regression is correlated with the explanatory variable.

To elaborate, in general a statistically significant slope coefficient in an OLS regression merely indicates a significant correlation between the dependent variable (Y) and the explanatory variable (X). In order to claim that a change in X *causes* a change in Y, it needs to be the case that X is uncorrelated with the unobserved error term (e) in the OLS regression. Otherwise, it is the case that X and Y co-move only because a change in X leads to a change in e, which in turn moves Y. It should be noted that the fitted residuals from the OLS regression are uncorrelated with X by construction, and hence this cannot be used to argue for exogeneity of X. Endogeneity or exogeneity of X can never be proved mathematically, but can only be argued given the particular situation and the X and Y variables considered. Once a researcher is worried that X is endogenous, one possible solution is to use an instrumental variable Z, that is uncorrelated with the error term e but correlated with the endogenous explanatory variable X. In other words, the instrument Z needs to be correlated with Y only through X.

### Example: Eswaran and Malhotra (2011)

The research question of the paper by Eswaran and Malhotra (2011) is to find the determinants of domestic violence in developing countries and does this violence impinges on women’s autonomy or the reverse. Answering this question empirically is tricky since there is a clear endogeneity issue here: greater female autonomy may impinge on domestic violence and domestic violence may in turn affect female autonomy.

They use two sets of instruments to deal with the endogeneity issue: woman’s current breast-feeding status and an index of the woman’s height. The argument why woman’s breast-feeding status is a good instrument is that since this activity takes away the woman’s time she can devote to her husbands’ demands and thus, its more likely the husband will engage in spousal violence. The reason for using the woman’s height as an instrument is that a husband is less likely to engage in violence if he feels he cannot physically overpower his wife. Moreover, the woman’s height gets determined before marriage and is unlikely to be directly correlated with the socioeconomic status of her husband. For the instruments to be valid, the exclusion restriction requires that the relationship between the instrument and autonomy outcomes be completely mediated by domestic violence.

They estimate the following two-stage least square model:

V = α_{1}*X + α_{2}* Z + ε_{1}

D = β_{1}*X + β_{2}*V + ε_{2}

and the STATA command to implement this will be:

**ivregress 2sls** D X (V = Z)

where, V is domestic violence reported by the respondent, Z denotes the instrumental variable for domestic violence (either breast-feeding or height), D is the woman’s decision-making autonomy and X denotes a vector of exogenous regressors.

The authors find domestic violence to significantly reduce autonomy and the coefficients are much larger in magnitude as compared to Ordinary least squares method after accounting for the endogeneity of domestic violence using IV’s.

### Another Example

Let us consider an example. Suppose a researcher is interested in estimating the causal effect of smoking on health. A simple OLS regression of health on smoking does not establish that smoking causes poor health because other variables like whether or not a person is mentally depressed, may affect both health and smoking. In other words, smoking is endogenous. One possible solution to the problem of endogeneity is to use tax rate for tobacco products as an instrumental variable for smoking. The tax rate for tobacco products is a reasonable choice for an instrument because the researcher assumes that it can only be correlated with health through its effect on smoking. If the researcher then finds tobacco taxes and state of health to be correlated, this may be viewed as evidence that smoking causes changes in health.

In STATA, an instrumental variable regression can be implemented using the following command:

**ivregress 2sls **y x1 (x2 = z1 z2)

In the above STATA implementation, y is the dependent variable, x1 is an exogenous explanatory variable, x2 is the endogenous explanatory variable which is being instrumented by the variables z1, z2 and also x1. Mathematically, the above IV regression is equivalent to the following simultaneous-equations framework:

(1) x2_{i} = a_{0} + a_{1}z1_{i} + a_{2}z2_{i} + u_{i}

(2) y_{i} = b_{0} + b_{1}x1_{i} + b_{2}x2_{i} + e_{i}

The command option **2sls **(2-stage least squares) tells STATA to fit two independent OLS regressions (1) and (2) using least squares technique in . Equation (1) is often referred to as the "first stage regression". A statistically significant coefficient in the first stage is crucial becuase otherwise it means that the endogenous explanatory variable is not sufficiently correlated with the instrument (leading to the problem of '*weak instrument*').