Linear Regression

Matrix Denotion#

For simple linear regression:

Y=Xβ+ϵY = X\beta+\epsilon

Intercept is included in β\beta if we apend a vector of 1 in XX

X=(x1x2xN)T=(x1Tx2TxNT)=(x11x12x1px21x22x2pxN1xN2xNp)N×p\begin{aligned} X = & \begin{pmatrix} x_1 & x_2 & \cdots & x_N \end{pmatrix}^{T} \\ = & \begin{pmatrix} x_1^T \\ x_2^T \\ \vdots \\ x_N^T \end{pmatrix} \\ = & \begin{pmatrix} x_{ 11 } & x_{ 12 } & \cdots & x_{ 1p } \\ x_{ 21 } & x_{ 22 } & \cdots & x_{ 2p } \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np} \end{pmatrix}_{N\times p} \end{aligned}
Y=(y1y2yN)Y=\begin{pmatrix}y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}
W=(w1w2wp)W=\begin{pmatrix}w_1 \\ w_2 \\ \vdots \\ w_p \end{pmatrix}

SSR(Sum of Squared Residuals):

WTXTYW^TX^TY and YTXWY^TXW is a scalar,whose transpose is itself.

Because ϵ\epsilon is a scalar,ϵ2\epsilon^2 can be denoted as ϵTϵ\epsilon^T\epsilon

L=ϵTϵ=(WTXTYT)(XWY)=(YTβTXT)(YXβ)=XTXTXWWTXTYYTXW+YTY=XTXTXW2WTXTY+YTY\begin{aligned} L = & \epsilon^T\epsilon=(W^TX^T-Y^T)(XW-Y)=(Y^T-\beta^TX^T)(Y-X\beta)\\ =&X^TX^TXW-W^TX^TY-Y^TXW+Y^TY\\ =&X^TX^TXW-2W^TX^TY+Y^TY \end{aligned}

Take partial derivative for WWL(W)=2XTXW2XTYL^\prime(W) = 2X^TXW-2X^TY。 In order to get L(W)=0L^\prime(W) = {} 0, W=(XTX)1XTYW = (X^TX)^{-1}X^TY

Geometric Perspective#

XβX\beta is a linear combination of features and YXβY-X\beta is the normal vector of feature space that points to the end of Y.

XTY=XTXββ=(XTX)1XTY\begin{aligned} X^TY=&X^TX\beta \\ \beta=&(X^TX)^{-1}X^TY \end{aligned}

Probabilistic Perspective#

ϵN(0,σ2)\epsilon \sim N(0, \sigma^2)

y=wTx+ϵy = w^{T}x+\epsilon

yx;wN(wTx,σ2)y|x;w \sim N(w^{T}x, \sigma^2)

P(yx;w)=12πσexp{(ywTx)22σ2}P(y|x;w)=\frac{1}{\sqrt{2\pi}\sigma}\exp \{-\frac{(y-w^Tx)^2}{2\sigma^2}\}

L(w)=logP(Yx;w)=logi=1NP(yixi;w)=i=1NP(yixi;w)=i=1N(log12πσ12σ2(yiwTxi)2)\begin{aligned} L(w)&=logP(Y|x;w)=\log \prod_{i=1}^{N}P(y_i|x_{i};w)=\sum_{i=1}^{N}P(y_i|x_{i};w)\\ &=\sum_{i=1}^{N} (\log \frac{1}{\sqrt{2\pi}\sigma} -\frac{1}{2\sigma^2}(y_i-w^Tx_i)^2)\\ \end{aligned}
w^=\hat{w} =


W=(XTX+λI)1XTYW = (X^TX+\lambda I)^{-1}X^TY 半正定矩阵加上对角矩阵一定可逆


MAP 可以得到岭回归

W^MAP=argminWi=1N(yiWTxi)2+σ2σ02W22\hat W_{MAP}=\operatorname{arg}\underset{W}{\operatorname{min}}\sum\limits_{i=1}^N(y_i-W^Tx_i)^2+\frac{\sigma^2}{\sigma_0^2}||W||_2^2

σ2\sigma^2 是噪声的方差

σ02{\sigma_0^2}WW 的方差

## Q \& A ### What is regression? Which models can you use to solve a regression problem? Regression means go backwards. Because we assume that there is an unknown model out there that generates the data we observed in the given dataset. What we do is to make use of the observed dataset and estimate the model behind it. This is what we call Regression. Regression is a statistical method that is used to find out the relationship between a set of independent variables and dependent variable. Regression analysis can handle many things. For example, you can use regression analysis to do the following: - Model multiple independent variables - Include continuous and categorical variables - Use polynomial terms to model curvature - Assess interaction terms to determine whether the effect of one independent variable depends on the value of another variable ### How to remove the effect of confounding? You want to isolate the effect of each variable. reference: - [Confounding Variables Can Bias Your Results]( ### What’s the normal distribution? Why do we care about it? In two words, "Bell Curve" is a **Normal Distriution**. - mean = median = mode - symmetry about the center - 50\% of values less than the mean and 50\% greater than the mean **Central Limit Theorem**: As you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution. The **essential component** of the Central Limit Theorem is that the averae of your samples means will be the population mean. Normal Distribution is easy for mathematical statisticians to work with, including many kinds of statistical tests, which assume normal distributions. ### How do we check if a variable follows the normal distribution? QQ-Plot ### What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices? ### What are the methods for solving linear regression do you know? - Ordinary Least Squares - Gradient Descent - Maximum Liklihod Estimation - Geometric ### What is the normal equation? ‍


$X^TX$ is known as the **Gramian matrix** of $X$, which is a **positive semi-definite matrix**. $X^Ty$ is known as the **moment matrix** ### What is SGD  —  stochastic gradient descent? What’s the difference with the usual gradient descent? GD updates parameter after running through all the samples while SGD runs on subset of data (Minibatch). ### Which metrics for evaluating regression models do you know? - MSE(Mean Square Error): $MSE=\frac{1}{n}\sum\limits_{i=1}^n(y_i-\hat y_i)^2$ - RMSE: square root of the MSE. - MASE(Mean Absolute Scaled Error) - RMSSE(Root Mean Squared Scaled Error)