NO.Idea

# Linear Regression

## .css-12m0k8p{pointer-events:auto;}Matrix Denotion.css-18hqai{color:#319795;font-weight:400;outline:none;opacity:0;margin-left:0.375rem;}.css-18hqai:focus{opacity:1;box-shadow:0 0 0 3px rgba(66,153,225,0.6);}#

For simple linear regression：

$Y = X\beta+\epsilon$

Intercept is included in $\beta$ if we apend a vector of 1 in $X$

\begin{aligned} X = & \begin{pmatrix} x_1 & x_2 & \cdots & x_N \end{pmatrix}^{T} \\ = & \begin{pmatrix} x_1^T \\ x_2^T \\ \vdots \\ x_N^T \end{pmatrix} \\ = & \begin{pmatrix} x_{ 11 } & x_{ 12 } & \cdots & x_{ 1p } \\ x_{ 21 } & x_{ 22 } & \cdots & x_{ 2p } \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np} \end{pmatrix}_{N\times p} \end{aligned}
$Y=\begin{pmatrix}y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}$
$W=\begin{pmatrix}w_1 \\ w_2 \\ \vdots \\ w_p \end{pmatrix}$

SSR(Sum of Squared Residuals)：

$W^TX^TY$ and $Y^TXW$ is a scalar，whose transpose is itself.

Because $\epsilon$ is a scalar，$\epsilon^2$ can be denoted as $\epsilon^T\epsilon$

\begin{aligned} L = & \epsilon^T\epsilon=(W^TX^T-Y^T)(XW-Y)=(Y^T-\beta^TX^T)(Y-X\beta)\\ =&X^TX^TXW-W^TX^TY-Y^TXW+Y^TY\\ =&X^TX^TXW-2W^TX^TY+Y^TY \end{aligned}

Take partial derivative for $W$$L^\prime(W) = 2X^TXW-2X^TY$。 In order to get $L^\prime(W) = {} 0$, $W = (X^TX)^{-1}X^TY$

## Geometric Perspective#

$X\beta$ is a linear combination of features and $Y-X\beta$ is the normal vector of feature space that points to the end of Y.

\begin{aligned} X^TY=&X^TX\beta \\ \beta=&(X^TX)^{-1}X^TY \end{aligned}

## Probabilistic Perspective#

$\epsilon \sim N(0, \sigma^2)$

$y = w^{T}x+\epsilon$

$y|x;w \sim N(w^{T}x, \sigma^2)$

$P(y|x;w)=\frac{1}{\sqrt{2\pi}\sigma}\exp \{-\frac{(y-w^Tx)^2}{2\sigma^2}\}$

\begin{aligned} L(w)&=logP(Y|x;w)=\log \prod_{i=1}^{N}P(y_i|x_{i};w)=\sum_{i=1}^{N}P(y_i|x_{i};w)\\ &=\sum_{i=1}^{N} (\log \frac{1}{\sqrt{2\pi}\sigma} -\frac{1}{2\sigma^2}(y_i-w^Tx_i)^2)\\ \end{aligned}
$\hat{w} =$

## 岭回归#

$W = (X^TX+\lambda I)^{-1}X^TY$ 半正定矩阵加上对角矩阵一定可逆

## 贝叶斯视角#

MAP 可以得到岭回归

$\hat W_{MAP}=\operatorname{arg}\underset{W}{\operatorname{min}}\sum\limits_{i=1}^N(y_i-W^Tx_i)^2+\frac{\sigma^2}{\sigma_0^2}||W||_2^2$

$\sigma^2$ 是噪声的方差

${\sigma_0^2}$$W$ 的方差

## Q \& A ### What is regression? Which models can you use to solve a regression problem? Regression means go backwards. Because we assume that there is an unknown model out there that generates the data we observed in the given dataset. What we do is to make use of the observed dataset and estimate the model behind it. This is what we call Regression. Regression is a statistical method that is used to find out the relationship between a set of independent variables and dependent variable. Regression analysis can handle many things. For example, you can use regression analysis to do the following: - Model multiple independent variables - Include continuous and categorical variables - Use polynomial terms to model curvature - Assess interaction terms to determine whether the effect of one independent variable depends on the value of another variable ### How to remove the effect of confounding? You want to isolate the effect of each variable. reference: - [Confounding Variables Can Bias Your Results](https://statisticsbyjim.com/regression/confounding-variables-bias/) ### What’s the normal distribution? Why do we care about it? In two words, "Bell Curve" is a **Normal Distriution**. - mean = median = mode - symmetry about the center - 50\% of values less than the mean and 50\% greater than the mean **Central Limit Theorem**: As you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution. The **essential component** of the Central Limit Theorem is that the averae of your samples means will be the population mean. Normal Distribution is easy for mathematical statisticians to work with, including many kinds of statistical tests, which assume normal distributions. ### How do we check if a variable follows the normal distribution? QQ-Plot ### What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices? ### What are the methods for solving linear regression do you know? - Ordinary Least Squares - Gradient Descent - Maximum Liklihod Estimation - Geometric ### What is the normal equation? ‍

(X^TX)\hat\beta=X^Ty

$X^TX$ is known as the **Gramian matrix** of $X$, which is a **positive semi-definite matrix**. $X^Ty$ is known as the **moment matrix** ### What is SGD  —  stochastic gradient descent? What’s the difference with the usual gradient descent? GD updates parameter after running through all the samples while SGD runs on subset of data (Minibatch). ### Which metrics for evaluating regression models do you know? - MSE(Mean Square Error): $MSE=\frac{1}{n}\sum\limits_{i=1}^n(y_i-\hat y_i)^2$ - RMSE: square root of the MSE. - MASE(Mean Absolute Scaled Error) - RMSSE(Root Mean Squared Scaled Error)