机器学习系列笔记四:线性回归算法

    技术2022-07-27  91

    机器学习系列笔记四:线性回归算法

    文章目录

    机器学习系列笔记四:线性回归算法introduction最小二乘法实现简单线性回归自定义SimpleLinearRegression向量化运算实现SimpleLinearRegression 衡量回归算法的标准MSERMSEMAEscikit-learn中的MSE和MAER Squared 实现多元线性回归自定义多元线性回归模型scikit-learn中的线性回归KNN Regressor 线性回归的可解释性线性回归算法总结参考资料

    introduction

    线性回归是用于解决回归问题的一个经典算法,其有如下特点

    解决回归问题思想简单、实现容易许多强大的非线性模型的基础结果具有很好的可解释性 我们能通过线性回归算法的结果,通过对数据的分析和模型的建立学习到真实世界真正的知识。 蕴含了机器学习中的很多重要思想

    以房价预测为例

    假设房屋面积和房价是具有某种线性关系的,那么基于这个假设,我们就可以寻找一条"直线",最大程度地"拟合"样本特征(房屋面积)与样本输出标记(房价)的关系。

    当我们知道这条直线的参数时,比如y=x+100,我们就可以通过这个表示房屋面积与房屋价格映射关系的直线来预测房价,比如新输入的房屋大小为 250 m 2 250m^2 250m2,则预测结果为 250 + 100 = 350 250+100=350 250+100=350

    对于这种样本特征只有一个的线性回归称之为简单线性回归。

    假设我们找到了最佳拟合的直线方程:y=ax+b,则对于每一个样本点 x ( i ) x^{(i)} x(i)(上图中的x),根据我们的直线方程,得到的预测值为: y ^ ( i ) = a x ( i ) + b \hat{y}^{(i)}=ax^{(i)}+b y^(i)=ax(i)+b,而真实值为 y ( i ) y^{(i)} y(i)

    我们定义 y ^ ( i ) \hat{y}^{(i)} y^(i) y ( i ) y^{(i)} y(i) 的差距为: ( y ( i ) − y ^ ( i ) ) 2 (y^{(i)}-\hat{y}^{(i)})^2 (y(i)y^(i))2 考虑所有的样本: ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 \sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2 i=1m(y(i)y^(i))2 我们希望 ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 \sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2 i=1m(y(i)y^(i))2 尽量小,将 y ^ ( i ) = a x ( i ) + b \hat{y}^{(i)}=ax^{(i)}+b y^(i)=ax(i)+b 带入其中。可以得到: ∑ i = 1 m ( y ( i ) − a x ( i ) − b ) 2 \sum_{i=1}^{m}(y^{(i)}-ax^{(i)}-b)^2 i=1m(y(i)ax(i)b)2 这个函数的变体通常有两种形式,被称作损失函数(loss function)或者效用函数(utility function)

    通过分析问题,确定问题的损失函数或效用函数;

    通过最优化损失函数或效用函数,获得回归模型

    近乎所有参数学习算法都是这样的思路(套路)

    线性回归多项式回归逻辑回归SVM神经网络…

    正是因为机器学习中大多数的算法都有这样的一个思路,所以有一个学科非常的重要称之为最优化原理,而在最优化领域有一个相应的分支领域叫做凸优化,它解决的是一类特殊的优化问题。

    最小二乘法

    回到这个关于房价的简单线性回归问题,我们的目标是使得 ∑ i = 1 m ( y ( i ) − a x ( i ) − b ) 2 \sum_{i=1}^{m}(y^{(i)}-ax^{(i)}-b)^2 i=1m(y(i)ax(i)b)2 尽可能小,通常使用的手段为最小二乘法。

    对最小二乘法的推导

    因为针对我们的目标函数 ∑ i = 1 m ( y ( i ) − a x ( i ) − b ) 2 \sum_{i=1}^{m}(y^{(i)}-ax^{(i)}-b)^2 i=1m(y(i)ax(i)b)2 ,其中的变量是 a , b a,b a,b, 所以我们令 J ( a , b ) = ∑ i = 1 m ( y ( i ) − a x ( i ) − b ) 2 J(a,b)=\sum_{i=1}^{m}(y^{(i)}-ax^{(i)}-b)^2 J(a,b)=i=1m(y(i)ax(i)b)2 根据导数的物理意义,我们可以设定如下方程: ∂ J ( a , b ) ∂ a = 0 \frac{\partial J\left( a,b \right)}{\partial a}=0 aJ(a,b)=0

    ∂ J ( a , b ) ∂ b = 0 \frac{\partial J\left( a,b \right)}{\partial b}=0 bJ(a,b)=0

    这里需要用到链式求导法则,可以得到结果分别是: b = y ˉ − a x ˉ b=\bar{y}-a\bar{x} b=yˉaxˉ

    a =   ∑ i = 1 m ( x ( i ) y ( i ) − x ( i ) y ˉ ) ∑ i = 1 m ( ( x ( i ) ) 2 − x ˉ x ( i ) ) a=\frac{\ \sum_{i=1}^m{\left( x^{\left( i \right)}y^{\left( i \right)}-x^{\left( i \right)}\bar{y} \right)}}{\sum_{i=1}^m{\left( \left( x^{\left( i \right)} \right) ^2-\bar{x}x^{\left( i \right)} \right)}} a=i=1m((x(i))2xˉx(i)) i=1m(x(i)y(i)x(i)yˉ)

    其中a可以做一些简化,首先: ∑ i m x ( i ) y ˉ = y ˉ ∑ i m x ( i ) = m y ˉ ⋅ x ˉ = x ˉ ∑ i = 1 m y ( i ) = ∑ i = 1 m x ˉ y ( i ) = ∑ i m x ˉ ⋅ y ˉ \sum_i^m{x^{\left( i \right)}\bar{y}=\bar{y}\sum_i^m{x^{\left( i \right)}=m\bar{y}\cdot \bar{x}}=\bar{x}\sum_{i=1}^m{y^{\left( i \right)}=\sum_{i=1}^m{\bar{x}y^{\left( i \right)}}}}=\sum_i^m{\bar{x}\cdot \bar{y}} imx(i)yˉ=yˉimx(i)=myˉxˉ=xˉi=1my(i)=i=1mxˉy(i)=imxˉyˉ 故a可以简化为如下: a =    ∑ i = 1 m ( x ( i ) y ( i ) − x ( i ) y ˉ − x ˉ y ( i ) + x ˉ y ˉ ) ∑ i = 1 m ( ( x ( i ) ) 2 − x ˉ x ( i ) − x ˉ x ( i ) + x ˉ 2 ) = ∑ i m ( x ( i ) − x ˉ ) ( y ( i ) − y ˉ ) ∑ i m ( x ( i ) − x ˉ ) 2 a=\frac{\,\,\sum_{i=1}^m{\left( x^{\left( i \right)}y^{\left( i \right)}-x^{\left( i \right)}\bar{y}-\bar{x}y^{\left( i \right)}+\bar{x}\bar{y} \right)}}{\sum_{i=1}^m{\left( \left( x^{\left( i \right)} \right) ^2-\bar{x}x^{\left( i \right)}-\bar{x}x^{\left( i \right)}+\bar{x}^2 \right)}}=\frac{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) \left( y^{\left( i \right)}-\bar{y} \right)}}{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) ^2}} a=i=1m((x(i))2xˉx(i)xˉx(i)+xˉ2)i=1m(x(i)y(i)x(i)yˉxˉy(i)+xˉyˉ)=im(x(i)xˉ)2im(x(i)xˉ)(y(i)yˉ)

    最终,我们的推导结果为: a = ∑ i m ( x ( i ) − x ˉ ) ( y ( i ) − y ˉ ) ∑ i m ( x ( i ) − x ˉ ) 2 a=\frac{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) \left( y^{\left( i \right)}-\bar{y} \right)}}{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) ^2}} a=im(x(i)xˉ)2im(x(i)xˉ)(y(i)yˉ)

    b = y ˉ − a x ˉ b=\bar{y}-a\bar{x} b=yˉaxˉ

    实现简单线性回归

    import numpy as np import matplotlib.pyplot as plt x = np.array([1.,2.,3.,4.,5.]) y = np.array([1.,3.,2.,3.,5.]) plt.scatter(x,y) plt.axis([0,6,0,6]) plt.show()

    a = ∑ i m ( x ( i ) − x ˉ ) ( y ( i ) − y ˉ ) ∑ i m ( x ( i ) − x ˉ ) 2 a=\frac{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) \left( y^{\left( i \right)}-\bar{y} \right)}}{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) ^2}} a=im(x(i)xˉ)2im(x(i)xˉ)(y(i)yˉ)

    b = y ˉ − a x ˉ b=\bar{y}-a\bar{x} b=yˉaxˉ

    x_mean = np.mean(x) y_mean = np.mean(y) num = 0.0 # 分子 d = 0.0 # 分母 for x_i ,y_i in zip(x,y): num += (x_i-x_mean)*(y_i-y_mean) d += (x_i-x_mean)**2 a = num / d b = y_mean-a*x_mean a 0.8 b 0.39999999999999947

    预测结果图形化:

    y_hat = a*x+b plt.scatter(x,y) plt.plot(x,y_hat,color='r') # 预测 x_predict = 6 y_predict = a * x_predict+b plt.scatter(x_predict,y_predict,color='green') plt.show() print(y_predict)

    5.2

    可以看到,红线就是使用简单线性规划法得到的拟合直线,而绿色的点就是我们根据输入6得到的预测值

    自定义SimpleLinearRegression

    import numpy as np class SimpleLinearRegression1(object): def __init__(self): """初始化Simple Linear Regression模型""" self.a_ = None self.b_ = None def fit(self, x_train, y_train): """根据训练集训练SLR模型""" assert x_train.ndim == 1, \ "Simple Linear Regressor can only process Single feature training data." assert len(x_train) == len(y_train), \ "the size of x_train must be equal to y_train" x_mean = np.mean(x_train) y_mean = np.mean(y_train) num = 0.0 # 分子 d = 0.0 # 分母 for x_i, y_i in zip(x_train, y_train): num += (x_i - x_mean) * (y_i - y_mean) d += (x_i - x_mean) ** 2 self.a_ = num / d self.b_ = y_mean - self.a_ * x_mean return self def predict(self, x_predict): """给定待预测测试集x_predict,返回预测结果向量“""" assert x_predict.ndim == 1, \ "Simple Linear Regressor can only process Single feature training data." assert self.a_ is not None and self.b_ is not None, \ "must fit before predict" return np.array([self._predict(x) for x in x_predict]) def _predict(self, x_single): """给定单个待预测数据x_single,返回其单个预测结果值""" return self.a_ * x_single + self.b_ def __repr__(self): return "SimpleLinearRegression1()"

    测试自定义的SLR1

    from relatedCode.SimpleLinearRegression import SimpleLinearRegression1 reg1 = SimpleLinearRegression1() reg1.fit(x,y) SimpleLinearRegression1() reg1.predict(np.array([x_predict])) array([5.2])

    预测结果:

    y_hat1 = reg1.predict(x) plt.scatter(x,y) plt.plot(x,y_hat1,color='r') plt.show()

    向量化运算实现SimpleLinearRegression

    假设有这样一组向量:

    w = ( w ( 1 ) , w ( 2 ) , . . . , w ( m ) ) w=(w^{(1)},w^{(2)},...,w^{(m)}) w=(w(1),w(2),...,w(m))

    v = ( v ( 1 ) , v ( 2 ) , . . . , v ( m ) ) v=(v^{(1)},v^{(2)},...,v^{(m)}) v=(v(1),v(2),...,v(m))

    ∑ i m w ( i ) ⋅ v ( i ) = w ⋅ v \sum_i^m{w^{\left( i \right)}\cdot v^{\left( i \right)}=w\cdot v} imw(i)v(i)=wv

    我们称这样的表达式为满足向量化运算的

    由于 a = ∑ i m ( x ( i ) − x ˉ ) ( y ( i ) − y ˉ ) ∑ i m ( x ( i ) − x ˉ ) 2 a=\frac{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) \left( y^{\left( i \right)}-\bar{y} \right)}}{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) ^2}} a=im(x(i)xˉ)2im(x(i)xˉ)(y(i)yˉ)中的 x x x y y y满足向量化运算,所以可以在SimpleLinearRegression1的基础上做改进,不再使用for循环来计算a,b的值

    class SimpleLinearRegression2(object): def __init__(self): """初始化Simple Linear Regression模型""" self.a_ = None self.b_ = None def fit(self, x_train, y_train): """根据训练集训练SLR模型""" assert x_train.ndim == 1, \ "Simple Linear Regressor can only process Single feature training data." assert len(x_train) == len(y_train), \ "the size of x_train must be equal to y_train" x_mean = np.mean(x_train) y_mean = np.mean(y_train) num = 0.0 # 分子 d = 0.0 # 分母 # for x_i, y_i in zip(x_train, y_train): # num += (x_i - x_mean) * (y_i - y_mean) # d += (x_i - x_mean) ** 2 # 用矩阵点乘运算代替for循环 num = (x_train-x_mean).dot(y_train-y_mean) d = (x_train-x_mean).dot(x_train-x_mean) self.a_ = num / d self.b_ = y_mean - self.a_ * x_mean return self def predict(self, x_predict): """给定待预测测试集x_predict,返回预测结果向量“""" assert x_predict.ndim == 1, \ "Simple Linear Regressor can only process Single feature training data." assert self.a_ is not None and self.b_ is not None, \ "must fit before predict" return np.array([self._predict(x) for x in x_predict]) def _predict(self, x_single): """给定单个待预测数据x_single,返回其单个预测结果值""" return self.a_ * x_single + self.b_ def __repr__(self): return "SimpleLinearRegression2()"

    测试两种实现方式的运行效率差异

    from relatedCode.SimpleLinearRegression import SimpleLinearRegression2 reg2 = SimpleLinearRegression2() reg2.fit(x,y) SimpleLinearRegression2() y_hat2 = reg2.predict(x) plt.scatter(x,y) plt.plot(x,y_hat2,color='r') plt.show()

    效果完全与SLR1一致,然后我们对性能进行测试

    生成数据集 m = 1000000 big_x = np.random.random(size=m) big_y = big_x * 2.0 + 3.0 + np.random.normal(0,1,size=m) 通过%timeit魔法命令比较两个算法的训练时间效率 %timeit reg1.fit(big_x,big_y) %timeit reg2.fit(big_x,big_y) 781 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 17.8 ms ± 552 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) reg1.a_ 1.9994703899928086 reg2.a_ 1.9994703899928812 reg1.b_ 2.9997681141058745 reg2.b_ 2.999768114105838

    衡量回归算法的标准

    据上文的描述,我们得出对于线性回归算法,其评判标准为损失函数,而通过观察 ∑ i = 1 m ( y t e s t ( i ) − y ^ t e s t ( i ) ) 2 \sum_{i=1}^{m}(y_{test}^{(i)}-\hat{y}_{test}^{(i)})^2 i=1m(ytest(i)y^test(i))2,可以发现这个损失函数的大小表征是与m,即与样本个数是有关的,这是不合理的,所以就有了均分误差MSE: M S E = 1 m ∑ i = 1 m ( y t e s t ( i ) − y ^ t e s t ( i ) ) 2 MSE = \frac{1}{m}\sum_{i=1}^{m}(y_{test}^{(i)}-\hat{y}_{test}^{(i)})^2 MSE=m1i=1m(ytest(i)y^test(i))2

    同样,为保证该误差的量纲与y本身的量纲一致,所以在MSE的基础上开根号,就有了均方根误差RMSE R M S E = M S E = 1 m ∑ i = 1 m ( y t e s t ( i ) − y ^ t e s t ( i ) ) 2 RMSE=\sqrt{MSE}=\sqrt{\frac{1}{m}\sum_{i=1}^{m}(y_{test}^{(i)}-\hat{y}_{test}^{(i)})^2} RMSE=MSE =m1i=1m(ytest(i)y^test(i))2 同理,还有一种评判标准为平均绝对误差MAE: M A E = 1 m ∑ i = 1 m ∣ y t e s t ( i ) − y ^ t e s t ( i ) ∣ MAE = \frac{1}{m}\sum_{i=1}^{m}|y_{test}^{(i)}-\hat{y}_{test}^{(i)}| MAE=m1i=1mytest(i)y^test(i)

    这三种衡量标准都无法表征分类的准确度:1最好,0最差

    接下来,用代码的形式实现上述三种衡量标准。

    import numpy as np import matplotlib.pyplot as plt from sklearn import datasets

    导入波士顿房产数据

    boston = datasets.load_boston() boston.feature_names array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7') x = boston.data[:,5] # 只使用房间数量这个特征RM x.shape (506,) y = boston.target y.shape (506,) plt.scatter(x,y) plt.show()

    发现在y=50的位置出现了很多有偏数据,所以需要去除

    np.max(y) 50.0 x = x[y<50.0] y = y[y<50.0] plt.scatter(x,y) plt.show()

    使用简单线性回归法

    from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=666) x_train.shape (392,) x_test.shape (98,) from relatedCode.SimpleLinearRegression import SimpleLinearRegression2 reg = SimpleLinearRegression2() reg.fit(x_train,y_train) SimpleLinearRegression2() reg.a_ 7.8608543562689555 reg.b_ -27.459342806705543 plt.scatter(x_train,y_train) plt.plot(x_train,reg.predict(x_train),color='r') plt.show()

    可以看到表征该模型的拟合直线(红),使用该模型进行预测

    y_predict = reg.predict(x_test)

    MSE

    mse_test = np.sum((y_predict-y_test)**2) / len(y_test) mse_test 24.156602134387438

    RMSE

    rmse_test = np.sqrt(mse_test) rmse_test 4.914936635846635

    MAE

    mae_test = np.sum(np.absolute(y_predict-y_test))/len(y_test) mae_test 3.5430974409463873

    将三种误差的计算过程分别封装到metrics.py中

    def mean_squared_error(y_true, y_predict): """计算y_true和y_predict之间的MSE""" assert len(y_true) == len(y_predict), \ "the size of y_true must be equal to the size of y_predict" return np.sum((y_true - y_predict)**2) / len(y_true) def root_mean_squared_error(y_true, y_predict): """计算y_true和y_predict之间的RMSE""" return sqrt(mean_squared_error(y_true, y_predict)) def mean_absolute_error(y_true, y_predict): """计算y_true和y_predict之间的MAE""" return np.sum(np.absolute(y_true - y_predict)) / len(y_true) from relatedCode.metrics import mean_squared_error from relatedCode.metrics import root_mean_squared_error from relatedCode.metrics import mean_absolute_error mse = mean_squared_error(y_test,y_predict) rmse = root_mean_squared_error(y_test,y_predict) mae = mean_absolute_error(y_test,y_predict) print("mean_squared_error:{}".format(mse)) print("root_mean_squared_error:{}".format(rmse)) print("mean_absolute_error:{}".format(mae)) mean_squared_error:24.156602134387438 root_mean_squared_error:4.914936635846635 mean_absolute_error:3.5430974409463873

    scikit-learn中的MSE和MAE

    from sklearn.metrics import mean_squared_error from sklearn.metrics import mean_absolute_error mse_skl = mean_squared_error(y_test,y_predict) mae_skl = mean_absolute_error(y_test,y_predict) print("mean_squared_error of sklearn:{}".format(mse_skl)) print("mean_absolute_error of sklearn:{}".format(mae_skl)) mean_squared_error of sklearn:24.156602134387438 mean_absolute_error of sklearn:3.5430974409463873

    R Squared

    由于上诉的三种衡量指标都无法衡量分类的准确度,所以就有了R方这样一个近乎于最好的衡量线性回归法的指标: R 2 = 1 − ∑ i ( y ^ ( i ) − y ( i ) ) 2 ∑ i ( y ˉ − y ( i ) ) 2 R^2=1-\frac{\sum_{i}{\left( \widehat{y}^{\left( i \right)}-y^{\left( i \right)} \right) ^2}}{\sum_i{\left( \bar{y}-y^{\left( i \right)} \right) ^2}} R2=1i(yˉy(i))2i(y (i)y(i))2 对于 ∑ i ( y ^ ( i ) − y ( i ) ) 2 ∑ i ( y ˉ − y ( i ) ) 2 \frac{\sum_i{\left( \widehat{y}^{\left( i \right)}-y^{\left( i \right)} \right) ^2}}{\sum_i{\left( \bar{y}-y^{\left( i \right)} \right) ^2}} i(yˉy(i))2i(y (i)y(i))2 ,其

    分子描述的是使用我们的模型预测产生的误差/错误分母描述的是使用 y = y ˉ y=\bar{y} y=yˉ 预测产生的错误,而 y = y ˉ y=\bar{y} y=yˉ 又称为基准模型(baseline Model)

    故,R Squared具有如下特性:

    R 2 < = 1 R^2<=1 R2<=1 R 2 R^2 R2 越大越好,当我们的预测模型不犯任何错误时, R 2 R^2 R2 得到最大值1当我们的模型效果等于基准模型时, R 2 = 0 R^2 = 0 R2=0如果 R 2 < 0 R^2<0 R2<0 ,说明我们学习到的模型还不是基准模型(平均值作为预测值),此时,很有可能是我们的数据不存在任何线性关系。

    通过稍作变换, R 2 R^2 R2 可以转换为如下公式: R 2 = 1 − ∑ i ( y ^ ( i ) − y ( i ) ) 2 ∑ i ( y ˉ − y ( i ) ) 2 = 1 − ( ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) 2 ) / m ( ∑ i = 1 m ( y ( i ) − y ˉ ) 2 ) / m = 1 − M S E ( y ^ , y ) V a r ( y ) R^2=1-\frac{\sum_i{\left( \widehat{y}^{\left( i \right)}-y^{\left( i \right)} \right) ^2}}{\sum_i{\left( \bar{y}-y^{\left( i \right)} \right) ^2}}=1-\frac{\left( \sum_{i=1}^m{\left( \widehat{y}^{\left( i \right)}-y^{\left( i \right)} \right) ^2} \right) /m}{\left( \sum_{i=1}^m{\left( y^{\left( i \right)}-\bar{y} \right) ^2} \right) /m}=1-\frac{MSE\left( \widehat{y},y \right)}{Var\left( y \right)} R2=1i(yˉy(i))2i(y (i)y(i))2=1(i=1m(y(i)yˉ)2)/m(i=1m(y (i)y(i))2)/m=1Var(y)MSE(y ,y)

    在代码中实现如下:

    1 - mean_squared_error(y_test,y_predict)/np.var(y_test) 0.6129316803937322

    将该计算封装到metrics的方法中

    def r2_score(y_true,y_predict): """计算y_true 和 y_predict之间的R Square""" return 1-mean_squared_error(y_true,y_predict)/np.var(y_true) from relatedCode.metrics import r2_score r2_score(y_test,y_predict) 0.6129316803937322

    在sklearn中也封装了同样的方法供我们调用

    from sklearn.metrics import r2_score r2_score(y_true=y_test,y_pred=y_predict) 0.6129316803937324

    实际上,在sklearn的score方法中计算的就是R^2

    methoddescribefit(self, X, y[, sample_weight])Fit linear model.get_params(self[, deep])Get parameters for this estimator.predict(self, X)Predict using the linear model.score(self, X, y[, sample_weight])Return the coefficient of determination R^2 of the prediction.set_params(self, **params)Set the parameters of this estimator.

    实现多元线性回归

    对于多元线性回归,其目标同样是使得 ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 \sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2 i=1m(y(i)y^(i))2 尽可能小,只不过此时的 y ^ ( i ) \hat{y}^{(i)} y^(i) 有了更为复杂的表达式: y ^ ( i ) = θ 0 + θ 1 X 1 ( i ) + θ 2 X 2 ( i ) + . . . + θ n X n ( i ) \hat{y}^{\left( i \right)}=\theta _0+\theta _1X_{1}^{\left( i \right)}+\theta _2X_{2}^{\left( i \right)}+...+\theta _nX_{n}^{\left( i \right)} y^(i)=θ0+θ1X1(i)+θ2X2(i)+...+θnXn(i) 所以最终的目标就是找到 θ 0 , θ 1 . . . , θ n \theta_0,\theta_1...,\theta_n θ0,θ1...,θn ,使得 ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 \sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2 i=1m(y(i)y^(i))2 尽可能小。令 θ = ( θ 0 , θ 1 . . . , θ n ) T \theta = (\theta_0,\theta_1...,\theta_n)^T θ=(θ0,θ1...,θn)T

    y ^ ( i ) = θ 0 X 0 i + θ 1 X 1 ( i ) + θ 2 X 2 ( i ) + . . . + θ n X n ( i ) \hat{y}^{\left( i \right)}=\theta _0X_{0}^{i}+\theta _1X_{1}^{\left( i \right)}+\theta _2X_{2}^{\left( i \right)}+...+\theta _nX_{n}^{\left( i \right)} y^(i)=θ0X0i+θ1X1(i)+θ2X2(i)+...+θnXn(i)

    其中 X 0 ( i ) ≡ 1 X_0^{(i)}\equiv 1 X0(i)1,再令 X ( i ) = ( X 0 ( i ) , X 1 ( i ) , X 2 ( i ) , . . . , X n ( i ) ) X^{(i)}=(X_0^{(i)},X_1^{(i)},X_2^{(i)},...,X_n^{(i)}) X(i)=(X0(i),X1(i),X2(i),...,Xn(i))

    故可以得到 y ^ ( i ) \hat{y}^{(i)} y^(i)向量化的表示: y ^ ( i ) = X ( i ) ⋅ θ \hat{y}^{(i)}=X^{(i)}\cdot\theta y^(i)=X(i)θ 其中 X ( i ) X^{(i)} X(i) 表示单个样本的所有特征,我们再将所有样本向量组成一个样本矩阵,如下: X = [ 1 X 1 ( 1 ) X 2 ( 1 ) ⋯ X n ( 1 ) 1 X 1 ( 2 ) X 2 ( 2 ) ⋯ X n ( 2 ) ⋮ ⋮ ⋱ ⋮ 1 X 1 ( m ) X 2 ( m ) ⋯ X n ( m ) ] = [ X ( 1 ) X ( 2 ) ⋮ X ( m ) ] X=\left[ \begin{matrix} 1& X_{1}^{\left( 1 \right)}& X_{2}^{\left( 1 \right)}\cdots& X_{n}^{\left( 1 \right)}\\ 1& X_{1}^{\left( 2 \right)}& X_{2}^{\left( 2 \right)}\cdots& X_{n}^{\left( 2 \right)}\\ \vdots& \vdots& \ddots& \vdots\\ 1& X_{1}^{\left( m \right)}& X_{2}^{\left( m \right)}\cdots& X_{n}^{\left( m \right)}\\ \end{matrix} \right] =\left[ \begin{array}{c} X^{\left( 1 \right)}\\ X^{\left( 2 \right)}\\ \vdots\\ X^{\left( m \right)}\\ \end{array} \right] X=111X1(1)X1(2)X1(m)X2(1)X2(2)X2(m)Xn(1)Xn(2)Xn(m)=X(1)X(2)X(m) 故可以得: y ^ = X ⋅ θ \hat{y}=X\cdot\theta y^=Xθ

    从而,回归目标就从 m i n { ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 } min\{\sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2\} min{i=1m(y(i)y^(i))2} 变为了: m i n { ( y − X ⋅ θ ) T ( y − X ⋅ θ ) } min\{(y-X\cdot\theta)^T(y-X\cdot\theta)\} min{(yXθ)T(yXθ)} 通过矩阵求导,最终得到的使得预测误差最小化的 θ \theta θ ,称为多元线性回归的正规方程解(Normal Equation)如下所示: θ = ( X T X ) − 1 X T y \theta=(X^TX)^{-1}X^Ty θ=(XTX)1XTy

    自定义多元线性回归模型

    import numpy as np from .metrics import r2_score class LinearRegression: def __init__(self): """初始化Linear Regression 模型""" self.coef_ = None # 系数 self.interception_ = None # 截距 self._theta = None def fit_normal(self, X_train, y_train): """根据训练集训练线性回归模型""" assert X_train.shape[0] == y_train.shape[0], \ "the size of X_train must be equal to the size of y_train" X = np.hstack([np.ones(shape=(len(X_train), 1)), X_train]) self._theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y_train) self.interception_ = self._theta[0] self.coef_ = self._theta[1:] return self def predict(self,X_pre): """给定预测集,返回预测结果向量""" assert self.interception_ is not None and self.coef_ is not None, \ "must fit before predict!" assert X_pre.shape[1] == len(self.coef_), \ "the feature number of X_predict must be equal to X_train" X = np.hstack([np.ones(shape=(len(X_pre),1)),X_pre]) return X.dot(self._theta) def score(self,X_test,y_test): y_predict = self.predict(X_test) return r2_score(y_true=y_test,y_predict=y_predict) def __repr__(self): return "LinearRegression()"

    测试

    import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from relatedCode.LinearRegression import LinearRegression from relatedCode.model_selection import train_test_split if __name__ == '__main__': boston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0] y = y[y < 50.0] X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666) reg = LinearRegression() reg.fit_normal(X_train, y_train) print("线性回归模型系数:", reg.coef_) print("线性回归模型截距", reg.interception_) score = reg.score(X_test, y_test) print("测试集预测准确率:", score)

    结果:

    C:\Users\ASUS\Anaconda3\envs\tf2\python.exe C:/Users/ASUS/Desktop/思维导图/机器学习编程实践/day03/test/LinearRegressionTest.py 线性回归模型系数: [-1.20354261e-01 3.64423279e-02 -3.61493155e-02 5.12978140e-02 -1.15775825e+01 3.42740062e+00 -2.32311760e-02 -1.19487594e+00 2.60101728e-01 -1.40219119e-02 -8.35430488e-01 7.80472852e-03 -3.80923751e-01] 线性回归模型截距 34.11739972322438 测试集预测准确率: 0.8129794056212711 Process finished with exit code 0

    scikit-learn中的线性回归

    from sklearn import datasets from sklearn.linear_model import LinearRegression from relatedCode.model_selection import train_test_split def prepareData(): boston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0] y = y[y < 50.0] return train_test_split(X, y, seed=666) if __name__ == '__main__': X_train, X_test, y_train, y_test = prepareData() lin_reg = LinearRegression() lin_reg.fit(X_train, y_train) score = lin_reg.score(X_test, y_test) print("scikit-learn的线性模型预测准确率:", score)

    结果

    C:\Users\ASUS\Anaconda3\envs\tf2\python.exe C:/Users/ASUS/Desktop/思维导图/机器学习编程实践/day03/test/LinearRegression_SK.py scikit-learn的线性模型预测准确率: 0.8129794056212809 Process finished with exit code 0

    KNN Regressor

    from sklearn.neighbors import KNeighborsRegressor from sklearn import datasets from relatedCode.model_selection import train_test_split def prepareData(): boston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0] y = y[y < 50.0] return train_test_split(X, y, seed=666) if __name__ == '__main__': X_train, X_test, y_train, y_test = prepareData() knn_reg = KNeighborsRegressor() knn_reg.fit(X_train,y_train) knn_reg_score = knn_reg.score(X_test,y_test) print("KNN Regressor的准确率",knn_reg_score)

    结果:

    KNN Regressor的准确率 0.5865412198300899

    使用网格搜索的方式改进KNN回归器:

    from sklearn.neighbors import KNeighborsRegressor from sklearn import datasets from relatedCode.model_selection import train_test_split from sklearn.model_selection import GridSearchCV def prepareData(): boston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0] y = y[y < 50.0] return train_test_split(X, y, seed=666) def searchParams(knn_reg): param_grid = [ { "weights": ["uniform"], "n_neighbors": [i for i in range(1, 11)] }, { "weights": ["distance"], "n_neighbors": [i for i in range(1, 11)], "p": [i for i in range(1, 6)] } ] grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1) grid_search.fit(X_train, y_train) return grid_search.best_estimator_ if __name__ == '__main__': X_train, X_test, y_train, y_test = prepareData() knn_reg = searchParams(KNeighborsRegressor()) knn_reg_score = knn_reg.score(X_test, y_test) print("KNN Regressor的准确率", knn_reg_score)

    结果:

    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. Fitting 5 folds for each of 60 candidates, totalling 300 fits [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.9s [Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 1.2s finished KNN Regressor的准确率 0.7160666820548708

    线性回归的可解释性

    from sklearn import datasets from sklearn.linear_model import LinearRegression import numpy as np boston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0] y = y[y < 50.0] lin_reg = LinearRegression() lin_reg.fit(X, y) sorted_coef_index = np.argsort(lin_reg.coef_) print("按相关性排序后的模型系数索引:", sorted_coef_index) print("各个因素对房价的影响程度(递增)", boston.feature_names[sorted_coef_index]) print(boston.DESCR)

    结果

    按相关性排序后的模型系数索引: [ 4 7 10 12 0 2 6 9 11 1 8 3 5] 各个因素对房价的影响程度(递增) ['NOX' 'DIS' 'PTRATIO' 'LSTAT' 'CRIM' 'INDUS' 'AGE' 'TAX' 'B' 'ZN' 'RAD' 'CHAS' 'RM'] .. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's ...

    根据排序的影响因子结果以及这些因子的描述,我们可以发现这个结果是和我们的日常生活判断吻合的,也就是说真实世界的房价被各个因素影响的情况与通过线性回归学习到的结果是一致的。

    比如:NOX表示房屋内一氧化碳的浓度,这是与房价逆相关的,再比如RM表示房屋内房间的数量,这是与房价正相关的。

    所以我们称之为线性回归的可解释性。

    我们可以根据线性回归模型的训练结果–各个参数来找到各个代表因素对目标的影响程度。

    线性回归算法总结

    线性回归算法是典型的参数学习 对比KNN:非参数学习只能解决回归问题 虽然很多分类方法中,线性回归是基础(如逻辑回归) 对比KNN:既可以解决分类问题,又可以解决回归问题使用前提:假设数据是线性的 对比KNN:对数据没有假设优点:对数据具有强解释性 这是一个白盒子算法,根据这个模型的训练结果,我们能够学到真正的知识。

    参考资料

    liuyubobo:https://github.com/liuyubobobo/Play-with-Machine-Learning-Algorithms

    liuyubobo:https://coding.imooc.com/class/169.html

    莫烦Python:https://www.bilibili.com/video/BV1xW411Y7Qd?from=search&seid=11773783205368546023

    Processed: 0.013, SQL: 10