普通线性回归 Linear Regression

一般形式

f(x)=w1x1+...+wkxk+bf(x)=w_1x_1+...+w_kx_k+b

需要优化的参数为权重wnw_n与偏置bb,通常使用最小二乘法估计模型参数。

一般写作向量形式:

f(x)=wx+bf(x)=wx+b

其中 w=(w1,...,wk)Tw=(w_1,...,w_k)^Tx=(x1,...,xk)x=(x_1,...,x_k)

优化(最小二乘法)Least Square Method

目标是求出一组参数w,bw,b,使得对于所有输入的预测值与输出值的 MSE 最小。
定义MSE为:

E=in(f(xi)yi)E = \sum_i^n(f(x_i)-y_i)

优化目标是

w,b=arg minw,bEw^*,b^*=\argmin_{w,b} E

一元线性回归的最小二乘法推导

EEf(x)f(x)yy 相交时为 00,而交点左右逐渐增大,显然 EE 为一个凸函数

因为 EE 是凹函数,其导数为 0 时 EE 刚好为最小值,所以为了求一组 wwbb 使得 EE 最小可以通过对 EE 分别求 wwbb 的偏导并置 00 得到闭式解:

Ew=w[in(wxi+byi)2]=inw[(wxi+byi)2]=in[2(wxi+byi)(xi)]=2(winxi2inxiyi+binxi)=2(winxi2in(yib)xi)(1)\begin{aligned} \frac{\partial E}{\partial w} &= \frac{\partial}{\partial w}[\sum_i^n(wx_i+b-y_i)^2] \\ & = \sum_i^n\frac{\partial}{\partial w}[(wx_i+b-y_i)^2]\\ & = \sum_i^n[2\cdot(wx_i+b-y_i)(x_i)] \\ &=2\cdot(w\sum_i^nx^2_i-\sum_i^nx_iy_i+b\sum_i^nx_i)\\ &=2\cdot(w\sum_i^nx^2_i-\sum_i^n(y_i-b)x_i) \end{aligned} \tag{1}

Eb=inw[(wxi+byi)2]=in[2(wxi+byi)]=2(nbin(yiwxi))(2)\begin{aligned} \frac{\partial E}{\partial b} & = \sum_i^n\frac{\partial}{\partial w}[(wx_i+b-y_i)^2]\\ & = \sum_i^n[2\cdot(wx_i+b-y_i)] \\ & = 2\cdot(nb - \sum_i^n(y_i-wx_i)) \end{aligned} \tag{2}

将式 (1)(1)00

0=2(winxi2in(yib)xi)0=winxi2in(yib)xiwinxi2=in(yib)xi(3)\begin{aligned} 0 &= 2\cdot(w\sum_i^nx^2_i-\sum_i^n(y_i-b)x_i)\\ 0 &= w\sum_i^nx^2_i-\sum_i^n(y_i-b)x_i\\ w\sum_i^nx^2_i &= \sum_i^n(y_i-b)x_i \end{aligned} \tag{3}

将式 (2)(2)00

0=2(nbin(yiwxi))b=1nin(yi)w(1ninxi)b=yˉwxˉ(4)\begin{aligned} 0 & = 2\cdot(nb - \sum_i^n(y_i-wx_i))\\ b &= \frac{1}{n}\sum_i^n(y_i)-w(\frac{1}{n}\sum_i^nx_i)\\ b &= \bar{y}-w\bar{x} \end{aligned} \tag{4}

(4)(4) 代入 (3)(3)

winxi2=in(yi(yˉwxˉ))xiwinxi2=in(xiyiyˉxi+wxˉxi)winxi2=inxiyiyˉinxi+wxˉinxiw(inxi2xˉinxi)=inxiyiyˉinxiw=inxiyiyˉinxiinxi2xˉinxiw=inxiyi1ninyiinxiinxi21ninxiinxiw=inxiyiinyixˉinxi21ninxi2w=inyi(xixˉ)inxi21ninxi2(5)\begin{aligned} w \sum_i^n x^2_i &= \sum_i^n (y_i-(\bar{y}-w \bar{x}))x_i \\ w\sum_i^n x^2_i &= \sum_i^n(x_iy_i-\bar{y}x_i+w\bar{x}x_i) \\ w\sum_i^n x^2_i &= \sum_i^n x_i y_i-\bar{y}\sum_i^n x_i+w\bar{x} \sum_i^n x_i \\ w(\sum_i^n x^2_i - \bar{x} \sum_i^n x_i) &= \sum_i^n x_i y_i-\bar{y}\sum_i^n x_i \\ w &= \frac{\sum_i^n x_i y_i-\bar{y}\sum_i^n x_i}{\sum_i^n x^2_i - \bar{x} \sum_i^n x_i} \\ w &= \frac{\sum_i^n x_i y_i-\frac{1}{n}\sum_i^n y_i \sum_i^n x_i}{\sum_i^n x^2_i - \frac{1}{n}\sum_i^n x_i \sum_i^n x_i} \\ w &= \frac{\sum_i^n x_i y_i-\sum_i^n y_i \bar{x}}{\sum_i^n x^2_i - \frac{1}{n}\sum_i^n x_i^2} \\ w &= \frac{\sum_i^n y_i (x_i-\bar{x})}{\sum_i^n x^2_i - \frac{1}{n}\sum_i^n x_i^2} \end{aligned} \tag{5}

(5)(5) 向量化以能够使用矩阵运算加速库

详细过程

w=inxiyiyˉinxiinxi2xˉinxiw=inxiyiyˉ(n1ninxi)inxi2xˉ(n1ninxi)w=inxiyiyˉ(nxˉ)inxi2xˉ(nxˉ)w=inxiyinxˉyˉinxi2nxˉ2w=inxiyiinxˉyˉinxi2inxˉ2w=in(xiyi(xˉyˉxiyˉ+xˉyi))in(xi2(xˉ2xixˉ+xˉxi))w=in(xiyi(xiyˉ+xˉyixˉyˉ))in(xi2(xixˉ+xˉxixˉ2))w=in(xiyixiyˉxˉyi+xˉyˉ)in(xi2xixˉxˉxi+xˉ2)w=in(xi(yiyˉ)x^(yiyˉ))in(xi(xixˉ)xˉ(xixˉ))w=in(xixˉ)(yiyˉ)in(xixˉ)2\begin{aligned} w &= \frac{\sum_i^n x_i y_i-\bar{y}\sum_i^n x_i}{\sum_i^n x^2_i - \bar{x} \sum_i^n x_i} \\ w &= \frac{\sum_i^n x_i y_i-\bar{y} (n \cdot \frac{1}{n} \sum_i^n{x_i})}{\sum_i^n x^2_i - \bar{x} (n\cdot \frac{1}{n}\sum_i^n x_i)} \\ w &= \frac{\sum_i^n x_i y_i-\bar{y} (n \bar{x})}{\sum_i^n x^2_i - \bar{x} (n \bar{x})} \\ w &= \frac{\sum_i^n x_i y_i-n\bar{x}\bar{y}}{\sum_i^n x^2_i - n \bar{x}^2} \\ w &= \frac{\sum_i^n x_i y_i-\sum_i^n\bar{x}\bar{y}}{\sum_i^n x^2_i - \sum_i^n \bar{x}^2} \\ w &= \frac{\sum_i^n (x_i y_i-(\bar{x}\bar{y}-x_i\bar{y}+\bar{x}y_i))}{\sum_i^n (x^2_i - (\bar{x}^2-x_i\bar{x}+\bar{x}x_i))} \\ w &= \frac{\sum_i^n (x_i y_i-(x_i\bar{y}+\bar{x}y_i-\bar{x}\bar{y}))}{\sum_i^n (x^2_i - (x_i\bar{x}+\bar{x}x_i-\bar{x}^2))} \\ w &= \frac{\sum_i^n (x_iy_i-x_i\bar{y}-\bar{x}y_i+\bar{x}\bar{y})}{\sum_i^n (x^2_i-x_i\bar{x}-\bar{x}x_i+ \bar{x}^2)} \\ w &= \frac{\sum_i^n (x_i(y_i-\bar{y})-\hat{x}(y_i-\bar{y}))}{\sum_i^n (x_i(x_i-\bar{x})-\bar{x}(x_i-\bar{x}))} \\ w &= \frac{\sum_i^n (x_i-\bar{x})(y_i-\bar{y})}{\sum_i^n (x_i-\bar{x})^2} \\ \end{aligned}

xd=(x1xˉ;...;xnxˉ)x_d=(x_1-\bar{x};...;x_n-\bar{x}),为去均值后的 xxyd=(y1yˉ;...;ynyˉ)y_d=(y_1-\bar{y};...;y_n-\bar{y}) 为去均值后的 yy,代入上式:

w=xdTydxdTxd\begin{aligned} w=\frac{x_d^Ty_d}{x_d^T x_d} \end{aligned}

多元线性回归的推导

便于讨论,令

X=(x11...x1n1............xn1...xnd1)=(x1T1......xnT1)=(x^1T...x^nT)X= \left(\begin{matrix} x_{11} & ... & x_{1n} & 1\\ ... & ... & ... & ... \\ x_{n1} & ... & x_{nd} & 1 \end{matrix}\right) = \left(\begin{matrix} x_1^T & 1\\ ... & ... \\ x_n^T & 1 \end{matrix}\right) = \left(\begin{matrix} \hat{x}_1^T\\ ... \\ \hat{x}_n^T \end{matrix}\right)

w^=(w;b)\hat{w}=(w;b)

E=in(x^iTw^)2=[y1x^1Tw^...y1x^1Tw^][y1x^1Tw^...y1x^1Tw^]=(yXw^)T(yXw^)\begin{aligned} E&=\sum^n_i(\hat{x}_i^T \hat{w})^2\\ &=\left[\begin{matrix}y_1-\hat{x}_1^T \hat{w} &...& y_1-\hat{x}_1^T \hat{w}\end{matrix} \right] \left[\begin{matrix}y_1-\hat{x}_1^T \hat{w} \\ ... \\ y_1-\hat{x}_1^T \hat{w} \end{matrix}\right]\\ &=(y-X\hat{w})^T(y-X\hat{w}) \end{aligned}

EE 展开:

E=yTyyTXw^w^TXTy+w^TXTXw^E=y^Ty-y^TX\hat{w}-\hat{w}^TX^Ty+\hat{w}^TX^TX\hat{w}

w^\hat{w} 求导得:

dEdw^=ddw^yTyddw^yTXw^ddw^w^TXTy+ddw^w^TXTXw^\frac{d E}{d\hat{w}}=\frac{d}{d\hat{w}}y^Ty-\frac{d}{d\hat{w}}y^TX\hat{w}-\frac{d}{d\hat{w}}\hat{w}^TX^Ty+\frac{d}{d\hat{w}}\hat{w}^TX^TX\hat{w}

矩阵求导法则可得:

dEdw^=0XTyXTy+(XTy+XTX)w^=2XT(Xw^y)\begin{aligned} \frac{d E}{d\hat{w}}&=0-X^Ty-X^Ty+(X^Ty+X^TX)\hat{w}\\ &=2X^T(X\hat{w}-y) \end{aligned}

凸函数证明过程略,将式 (7)(7)00

0=2XT(Xw^y)XTXw^=XTyw^=(XTX)1XTy\begin{aligned} 0&=2X^T(X\hat{w}-y) \\ X^TX\hat{w}&=X^Ty\\ \hat{w}&=(X^TX)^{-1}X^Ty \end{aligned}

XTXX^TX为非正定矩阵时,可以引入正则项或使用伪逆矩阵,这都会导致出现多个解,使得没法直接求出全局最优解。

广义线性模型

让模型逼近 yy 的衍生物

g(y)=wTx+by=g1(wTx+b)\begin{aligned} g(y) &= w^Tx+b\\ y&=g^{-1}(w^Tx+b) \end{aligned}

g()g(\cdot)中不应该有需要优化的参数。

对数几率回归 Logit Regression

通常用于二分类预测问题,预测 yy 为样本x作为正例的可能性,则1y1-y 为反例的可能性,两者比值 y1y\frac{y}{1-y} 反应了 xx 作为正例的相对可能性,再对其取对数,得到对数几率 lny1y\ln \frac{y}{1-y},对数几率回归:

lny1y=wTx+by=11+e(wTx+b)\begin{aligned} \ln\frac{y}{1-y}&=w^Tx+b\\ y&=\frac{1}{1+e^{-(w^Tx+b)}} \end{aligned}

对于二分类的误差,通常不使用MSE而是BCELoss(Binary Cross Entropy)来衡量:

(y,y^)=ylny^+(1y)ln(1y^)\ell(y,\hat{y})=y\cdot\ln\hat{y}+(1-y)\ln(1-\hat{y})

整体误差:

E=in(yln(11+e(w^Tx^i))+(1y)ln(ew^Tx^i1+ew^Tx^i))=in(yln(11+ew^Tx^i)+ln(ew^Tx^i1+ew^Tx^i)yew^Tx^i1+ew^Tx^i)=in(yln(1+e(w^Tx^i))ln(1+ew^Tx^i)+yln(1+ew^Tx^i))=in(y[ln(1+ew^Tx^i)ln(1+ew^Tx^i)]ln(1+ew^Tx^i))=in(yln(1+ew^Tx^i1+ew^Tx^i)ln(1+ew^Tx^i))=in(yln(1+ew^Tx^i1+1ew^Tx^i)ln(1+ew^Tx^i))=in(yln((1+ew^Tx^i)ew^Tx^iew^Tx^i+1ew^Tx^iew^Tx^i)ln(1+ew^Tx^i))=in(yln((1+ew^Tx^i)ew^Tx^i1+ew^Tx^i)ln(1+ew^Tx^i))=in(ln(1+ew^Tx^i)yw^Tx^i)\begin{aligned} E &= -\sum^n_i(y\cdot\ln(\frac{1}{1+e^{-(\hat{w}^T\hat{x}_i)}})+(1-y)\ln(\frac{e^{-\hat{w}^T\hat{x}_i}}{1+e^{-\hat{w}^T\hat{x}_i}}))\\ &=-\sum^n_i(y\cdot\ln(\frac{1}{1+e^{-\hat{w}^T\hat{x}_i}})+\ln(\frac{e^{-\hat{w}^T\hat{x}_i}}{1+e^{-\hat{w}^T\hat{x}_i}})-y\cdot\frac{e^{-\hat{w}^T\hat{x}_i}}{1+e^{-\hat{w}^T\hat{x}_i}})\\ &=-\sum^n_i(-y\cdot\ln(1+e^{-(\hat{w}^T\hat{x}_i)})-\ln(1+e^{\hat{w}^T\hat{x}_i})+y\cdot\ln(1+e^{\hat{w}^T\hat{x}_i}))\\ &=-\sum^n_i(y\cdot[\ln(1+e^{\hat{w}^T\hat{x}_i})-\ln(1+e^{-\hat{w}^T\hat{x}_i})]-\ln(1+e^{\hat{w}^T\hat{x}_i}))\\ &=-\sum^n_i(y\cdot\ln(\frac{1+e^{\hat{w}^T\hat{x}_i}}{1+e^{-\hat{w}^T\hat{x}_i}})-\ln(1+e^{\hat{w}^T\hat{x}_i}))\\ &=-\sum^n_i(y\cdot\ln(\frac{1+e^{\hat{w}^T\hat{x}_i}}{1+\frac{1}{e^{\hat{w}^T\hat{x}_i}}})-\ln(1+e^{\hat{w}^T\hat{x}_i}))\\ &=-\sum^n_i(y\cdot\ln(\frac{(1+e^{\hat{w}^T\hat{x}_i})\cdot e^{\hat{w}^T\hat{x}_i}}{e^{\hat{w}^T\hat{x}_i}+\frac{1}{e^{\hat{w}^T\hat{x}_i}}\cdot e^{\hat{w}^T\hat{x}_i}})-\ln(1+e^{\hat{w}^T\hat{x}_i}))\\ &=-\sum^n_i(y\cdot\ln(\frac{(1+e^{\hat{w}^T\hat{x}_i})\cdot e^{\hat{w}^T\hat{x}_i}}{1+e^{\hat{w}^T\hat{x}_i}})-\ln(1+e^{\hat{w}^T\hat{x}_i}))\\ &=\sum^n_i(\ln(1+e^{\hat{w}^T\hat{x}_i})-y\cdot \hat{w}^T\hat{x}_i)\\ \end{aligned}

目标为:

w^=arg minw^E\hat{w}^*=\argmin_{\hat{w}}E

EE为凸函数的证明略。最优化模型可以通过梯度下降、牛顿法等求得最优解。

对数几率回归的梯度下降推导

y^=11+e(wx+b)\hat{y}=\frac{1}{1+e^{-(wx+b)}}

=in(ln(1+ew^Tx^i)yw^Tx^i)=\sum^n_i(\ln(1+e^{\hat{w}^T\hat{x}_i})-y\cdot \hat{w}^T\hat{x}_{i)\\}

EEwkw_k 求偏导:

Ew=in(wln(1+ewxik+b)wyi(wxik+b))=in(11+ewxik+bewxik+bxikyixik)=in(11+1e(wxik+b)1e(wxik+b)xikyixik)=in(1e(wxik+b)+1xikyixik)=in(yi^yi)xik\begin{aligned} \frac{\partial E}{\partial w}&=\sum^n_i(\frac{\partial}{\partial w}\ln(1+e^{wx_{ik}+b})-\frac{\partial}{\partial w}y_i\cdot (wx_{ik}+b))\\ &=\sum^n_i(\frac{1}{1+e^{wx_{ik}+b}}\cdot e^{wx_{ik}+b}\cdot x_{ik}-y_i\cdot x_{ik}) \\ &=\sum^n_i(\frac{1}{1+\frac{1}{e^{-(wx_{ik}+b)}}}\cdot \frac{1}{e^{-(wx_{ik}+b)}}\cdot x_{ik}-y_i\cdot x_{ik}) \\ &=\sum^n_i(\frac{1}{e^{-(wx_{ik}+b)}+1}\cdot x_{ik}-y_i\cdot x_{ik}) \\ &=\sum^n_i(\hat{y_i}-y_i)\cdot x_{ik} \end{aligned}

EEbb 求偏导:

Eb=in(1e(wxik+b)+1yi)Eb=in(y^yi)\begin{aligned} \frac{\partial E}{\partial b}&=\sum^n_i(\frac{1}{e^{-(wx_{ik}+b)}+1}-y_i) \\ \frac{\partial E}{\partial b}&=\sum^n_i(\hat{y}-y_i) \\ \end{aligned}

梯度下降迭代:

wk:=wkηin(yi^yi)xik,k=1,....,dw_k:= w_k - \eta\sum^n_i(\hat{y_i}-y_i)\cdot x_{ik},k=1,....,d

b:=bηin(y^yi)b:= b - \eta\sum^n_i(\hat{y}-y_i)