Lesson 2 - Data and uncertainty

Kyle Carlson |

Table of contents

Data-generating processes and models

In this course the data-generating process (DGP) is primary. A DGP is a completely-specified set of random variables that can generate a data set. The parameters of the DGP include parameters of the random variables, e.g., the mean and variance of a normal distribution, and parameters in functions that relate random variables. The average treatment effect is also a parameter of the DGP! Completely-specified means it could be programmed in a computed to generate a data set, i.e., a table of numbers. We sometimes informally say that the DGP is “the true model.” In practice, this framework runs into two challenges: (1) we make simplifying assumptions about the form of the DGP, relative to how we think it works in reality, and (2) we do not know or directly observe the parameters of the DGP, only a sample of data. We will put aside challenge 1. Challenge 2 is where statistics apply and where we will focus. We will study how to use a sample of data to make inferences about the DGP.

Example: A data-generating process and model

  • An example of a data-generating process is: \(Y^0 \sim \text{Bernoulli}(p=0.3)\) and \(Y^1 \sim \text{Bernoulli}(p=0.4)\). We could program this into a computer to generate 2 columns of binary variables.
  • With a causal interpretation, this implies that ATE = 0.4 - 0.3 = 0.1.
  • In practical cases, we do not know the parameter values. But, we suppose they exist, and we might write the DGP like this: \(Y^0 \sim \text{Bernoulli}(p^0), Y^1 \sim \text{Bernoulli}(p^1), (p^0, p^1) \in [0, 1]^2\).
  • A model is a parameterized set of data-generating processes. A model for the above example DGP would be the set \(\{(Y^0, Y^1) : Y^0 \sim \text{Bernoulli}(p^0), Y^1 \sim \text{Bernoulli}(p^1), (p^0, p^1) \in [0, 1]^2 \}\). The DGP is one specific element of this set.
  • In practical cases, we would assume that our data set was generated by some DGP in this model. Then we would use statistics and our data to do inference about the values of \(p^0\) and \(p^1\), which includes making estimates of their values and conducting hypothesis tests.

Example: A classic linear model

In econometrics we often write a linear model as \(Y = \mathbf{X}\beta + \epsilon\) where \(\mathbf{X}\) is a fixed, real-valued \(n\)-by-\(k\) matrix of regressors, \(\epsilon\) is a random vector of disturbances and \(Y\) is a random vector of outcomes, and \(\beta\) is a fixed vector of real-valued \(k\) coefficents. We commonly assume the exogeneity condition \(E[\epsilon\vert \mathbf{X}] = 0\). To make a data-generating process in this model, we must set specific values for the regressors (\(\mathbf{X}^0\)) and the coefficients (\(\beta^0\)) and specify a distribution for \(\epsilon\), for example, Normal\((\mu, \sigma^2)\). Note that, as written, this DGP can only generate data sets of a fixed size \(n\), but that can be relaxed if we allow $\mathbf{X}$ to be a matrix of random variables. To estimate the parameters, we conventionally use ordinary least squares (OLS), which has some optimality properties under additional conditions on the model.

You can contrast our approach with the typical, prediction-oriented framework in machine learning. That tradition does posit a joint distribution over the labels and features, but the emphasis is on prediction of labels rather than estimation of parameters in the joint distribution.

Uncertainty and random variation

In the previous lesson we assumed an infinite quantity of data, which removed sampling variation and simplified our analysis of average treatment effects. However, in any actual experiment there is a limited quantity of data, so we must face the complication of uncertainty. Statistics and probability give us the tools to make precise statements about the uncertainty in our data.

Random samples

A simple data set would look like the following table. Here \(N\) is the sample size, that is, the number of individuals in the sample. A more compact way to represent the data set is \(\{(y_i, d_i, \mathbf{x}_i): i=1,\dots,N\}\). Under a random sampling assumption, each data point \((y_i, d_i, \mathbf{x}_i)\) is an independent realization of the superpopulation variables \((Y, D, \mathbf{X})\). Note that we use lowercase symbols to refer to a specific realization of fixed values. When we want to treat the data as being random we use uppercase: \(\{(Y_i, D_i, \mathbf{X}_i): i=1,\dots,N\}\).

Outcome Treatment status Covariates
\(y_1\) \(d_1\) \(\mathbf{x}_1\)
\(y_2\) \(d_2\) \(\mathbf{x}_2\)
\(\ldots\) \(\ldots\) \(\ldots\)
\(y_N\) \(d_N\) \(\mathbf{x}_N\)

Important terminological point to avoid confusion: Each \((y_i, d_i, \mathbf{x}_i)\) is not a sample. It is an observation , unit, or individual. The data set \(\{(y_i, d_i, \mathbf{x}_i): i=1,\dots,N\}\) does not contain “N samples”. The entire data set is a single sample with a size of \(N\). This distinction is important for understanding sampling distributions, especially in more complex cases.


Point estimators

Estimators are functions of random variables that we use to estimate a population parameter. For example, if we want to estimate the population parameter \(E[Y]\) a natural estimator is the sample mean of \(Y_i\). Conventionally, we use a bar symbol to denote the sample mean, so \(\bar{Y}=\tfrac{1}{N}\sum_{i=1,\ldots,N}{Y_i}\). Another convention is to use “hat” symbol to indicate an estimator. For example, if our parameter is \(\mu = E[Y]\) then we could denote the estimator of the paramter by \(\hat{\mu} = \bar{Y}\).

Note that we use uppercase because estimators are functions of random variables. Therefore, estimators are also themselves random variables with distributions, which we study to know how well the estimator performs. When we think of a particular, fixed sample of data the corresponding quantity \(\bar{y}=\tfrac{1}{N}\sum_{i=1,\ldots,N}{y_i}\) is called an estimate and has a fixed value. Finally, you may also hear the term estimand, which refers to the parameter we want to estimate. In this example the estimand is the population mean \(E[Y]\).

Another way to think of this: An estimator is a procedure, and an estimate is the result of the procedure.

Standard errors

As mentioned above, an estimator is a random variable with a distribution. The standard deviation of an estimator has a special name: standard error. The standard error tells us how precise our estimate is. A smaller standard error means our estimator has a narrower distribution and provides more precise estimates.

If our estimator is the sample mean \(\bar{Y}\), then the standard error is \(\text{sd}(\bar{Y}) = \sqrt {\tfrac{\text{Var}(Y)}{N}}\). Note that the standard error is a function of the parameters of the data-generating process (\(\text{Var}(Y)\)) and the sample size \(N\). Therefore, we typically do not know the true standard error and must instead estimate it. The typical point estimator of the standard error of the mean is \(\hat{\text{sd}} (\bar{Y}) = \sqrt{\tfrac{\hat{\sigma}^2}{N}}\) where \(\hat{\sigma}^2\) is the usual, unbiased estimate of the variance of \(Y\), that is, \(\tfrac{\sum_{i=1}^N (Y_i - \bar{Y})^2}{N-1}\).