Chapter 2. Simple Linear Regression

The simple linear regression model consists of the mean function and the variance function.
SLR, mean function E(Y|X=x) = β0 + β1x
The value of parameters are usually unknown and must be estimated using data.
SLR, variance function Var(Y|X=x) = σ2
In SLR, the variance function is assumed to be constant, with a positive value of σ2 that is usually unknown.
yi is observed value of i the response y, and will typically not equal its expected value E(Y|X=xi) because σ2 > 0.
yi = E(Y|X=x) + ei, where ei is statistical error (implicit equation for ei)
ei can be defined explicitly as, ei = yi - E(Y|X=x) = yi - (β0 + β1x)
The errors, ei, depend on the unknown parameters in the mean function and so are not observable quantities. They are random variables.
Assumption of ei, E(ei|xi) = 0 (mean of statistical errors is 0). So, if you draw a scatterplot of the ei vs the xi, we would have a null scatterplot, with no patterns.
Assumption of ei, they are independent.
Assumption of ei, expected to be normally distributed. If the errors are thought to follow some different distribution, such as Poisson or the binomial, other methods besides OLS may be more appropriate (more in chapter 12).
Estimates of parameters are computable functions of the data and are therefore statistics.
Estimate of parameters are denoted by putting "hat' over the corresponding Greek letter.
The fitted value for case i is given by Ê(Y|X = xi).
The computations that are needed for least squares for SLR, depend on the averages of the variables and their sums of squares and sums of cross-products.
SXY value is based on sum of cross products.
Sums of squares and cross-products are centered by subtracting the average from each of the values before squaring or taking cross-products.
The “hat” rule described earlier would suggest that different symbols should be used for these quantities; for example, ρˆ xy might be more appropriate for the sample correlation if the population correlation is ρxy. This inconsistency is deliberate since these sample quantities estimate population values only if the data used are a random sample from a population. The random sample condition is not required for regression calculations to make sense, and will often not hold in practice.
Thus, interpretation of correlation coefficient depends on the method of sampling.
In the heights example, mother–daughter pairs can be viewed as a sample from a population, then the sample correlation is an estimate of a population correlation.
Same is about interpretation of sample variance.
In case of Forbes data, data were collected at 17 selected locations. So the sample variance of boiling points is not an estimate of any meaningful population variance.
The normal equations for the simple linear regression model (mean model) depend on the data through the sufficient statistics, sum(xi), sum(yi), sum (xi square) and sum (xi.yi).
Numerically more stable sufficient statistics are given by xbar, ybar, SXX and SXY.
Solving A.7, using more stable sufficient statistics.
Estimating the (error or residual) Variance, σ2
Since the variance σ2 is essentially the average squared size of the ei, we should expect estimator σˆ 2 is obtained by averaging the squared residuals (statistical errors).
Under the assumption that the errors are uncorrelated random variables with 0 means and common variance σ2 , an unbiased estimate of σ2 is obtained by dividing RSS = ∑eˆi 2 by its degrees of freedom (df), where residual df = number of cases minus the number of parameters in the mean function.
RSS is residual sum of squares. Unbiased estimate of σ2, RMS= RSS/df, is called residual or error mean square.
Sqrt (Residual Mean Squares, RMS) is called the standard error of regression. and it is the same units as is the response variable.
With the assumption of statistical errors/ residuals ei are normally distributed, then RMS, an unbiased estimate of σ2, will be distributed as a multiple of a chi-squared variable with df = n-2. This is used to obtain the distribution of test statistics and also to make confidence statements concerning σ2.
σˆ 2 is an unbiased estimate of σ2 if the errors are normally distributed, although normality is not required for this result to hold.
Expectations throughout this chapter condition on X to remind us that X is treated as fixed and the expectation is over the conditional distribution of Y|X, or equivalently of the conditional distribution of e|X.
Properties of Least Square Estimates
The OLS estimates depend on data only through the sufficient statistics, making computing easy, while limitation is, for non-linear data sets with same sufficient statistics, OLS gives same fitted regression. Hence, it is essential to visualize the data.
The estimates βˆ0 and βˆ1 can be written as linear combinations of yi. (i= 1,.., n)
Fitted line passes through the center of the data (xbar, ybar)
For mean function with intercept included, ∑eˆi = 0, if without intercept ∑eˆi NE to 0.
The variances of the estimators
As the parameter estimates depends on random error, eis (standard error of regression), the estimates are also random variables.
If all the ei have o mean, then least square estimate are unbiased.
The variances of the estimators, assuming Var(ei|X) = σ2 , i = 1, . . . , n, and Cov(ei, ej|X) = 0, i ≠ j, is given below
we have ˆ β0 depends on ˆ β1, the estimates are correlated.
The means and variances, and covariances of the estimated regression coefficients do not require a distributional assumption concerning the errors. Since the estimates are linear combinations of the yi, and hence linear combinations of the errors ei, the central limit theorem shows that the coefficient estimates will be approximately normally distributed if the sample size is large enough.
For smaller samples, if the errors e = y − E(y|X = x) are independent and normally distributed, written in symbols as ei|X∼ NID(0,σ2) , i=1,.., n, then the regression estimates ˆ β0 and ˆ β1 will have a joint normal distribution with means, variances, and covariances.
When the errors are normally distributed, the ols estimates can be justified using a completely different argument, since they are then also maximum likelihood estimates.
Estimated Variances
The square root of an estimated variance is called a standard error, for which we use the symbol se( ).
The terms standard error and standard deviation are sometimes used interchangeably.
In this book, an estimated standard deviation always refers to the variability between values of an observable random variable like the response yi or an unobservable random variance like the errors ei.
The term standard error will always refer to the square root of the estimated variance of a statistic like a mean ybar, or a regression coefficient ˆ β1.
CONFIDENCE INTERVALS
When the errors are NID(0, σ2 ), parameter estimates, fitted values, and predictions will be normally distributed because all of these are linear combinations of the yi and hence of the ei.
Confidence intervals and tests can be based on a t-distribution, which is the appropriate distribution with normal estimates but using σˆ 2 to estimate the unknown variance σ2.
For slope, the test of interest is of,
A 95% confidence interval for the slope, or for any of the partial slopes in multiple regression, is the set of β1 such that
For intercept, test of interest is of,
A (1 − α) × 100% confidence interval for the intercept is the set of points β0 in the interval
The estimated mean function can be used to obtain values of the response for given values of the predictor. The two important variants of this problem are prediction and estimation of fitted values. Since prediction is more important, we discuss it first.
Prediction
In prediction we have a new case, possibly a future value, not one used to estimate parameters, with observed value of the predictor x*.
We would like to know the value y*, the corresponding response, but it has not yet been observed.
Given this additional assumption, a point prediction of y*, say ỹ*, is just ỹ* = ˆ β0+ ˆβ1x*,
ỹ* predicts the as yet unobserved y*.
Assuming the model is correct, then the true value of y* is y* = ˆ β0+ ˆβ1x* + e* where e* is the random error attached to the future value, presumably with variance σ2 .
Thus, even if β0 and β1 were known exactly, predictions would not match true values perfectly, but would be off by a random amount with standard deviation σ. In the more usual case where the coefficients are estimated, the prediction error variability will have a second component that arises from the uncertainty in the estimates of the coefficients.
Combining these two sources of variation
The first σ2 on the right of (2.16) corresponds to the variability due to e*, and the remaining term is the error for estimating coefficients.
If x* is similar to the xi used to estimate the coefficients, then the second term will generally be much smaller than the first term.
If x* is very different from the xi used in estimation, the second term can dominate.
Taking square roots of both sides of (2.16) and estimating σ2 by σˆ 2 , we get the standard error of prediction (sepred) at x*
A prediction interval uses multipliers from the t-distribution with df equal to the df in estimating σ2 .
Estimation of Fitted Values
In rare problems, one may be interested in obtaining an estimate of E(Y|X = x*).
This quantity is estimated by the fitted value ŷ = β0 + β1x*, and its standard error is
To obtain confidence intervals, it is more usual to compute a simultaneous interval for all possible values of x.
This is the same as first computing a joint confidence region for β0 and β1, and from these, computing the set of all possible mean functions with slope and intercept in the joint confidence set.
The confidence region for the mean function is the set of all y such that
This formula uses an F-distribution with 2 and n − 2 df in place of the t distribution to correct for the simultaneous inference about two estimates rather than just one.
For multiple regression, replace 2F(α; 2, n − 2) by p′F(α; p′, n − p′), where p′ is the number of parameters estimated in the mean function including the intercept.
The prediction intervals are much wider than the confidence intervals.
THE COEFFICIENT OF DETERMINATION, R2
R2 is a scale-free one-number summary of the strength of the relationship between the xi and the yi in the data.
It generalizes nicely to multiple regression, depends only on the sums or squares, and appears to be easy to interpret.
adjusted R2,

differs from (2.21) by adding a correction for df of the sums of squares that can facilitate comparing models in multiple regression, there are better ways of making this comparison discussed in Chapter 10.
Residuals
The most common plot, especially useful in simple regression, is the plot of residuals versus the fitted values.