机器学习（二）简单逻辑回归python算法+代码（实例：体检阳性阴性预测患不患病）

技术2022-07-12 86

数据集：在txt文本里面，自己编写三列数行数据就可以，参考下图（由于无法上传，数据无所谓，可以自己编写，主要是实现算法）：右图是程序运行添加的头部和侧面编号，文本框只有数据，见下左图简单算法手写草拟： python实现： 1、导入python库

import numpy as np import pandas as pd from matplotlib import pyplot as plt plt.rcParams['font.sans-serif'] = ['SimHei']#显示中文

2、导入数据并标记

path = 'andrew_ml_ex22391逻辑回归数据集\ex2data1.txt' data = pd.read_csv(path,header=None,names=['体检1','体检2','患病']) 阳性 = data[data['患病'].isin([1])] 阴性 = data[data['患病'].isin([0])]

3、数据可视化

fig, ax = plt.subplots(figsize=(12,8)) ax.scatter(阳性['体检1'], 阳性['体检2'], s=50, c='r', marker='o', label='患病') ax.scatter(阴性['体检1'], 阴性['体检2'], s=50, c='g', marker='s', label='不患病') ax.legend() ax.set_xlabel('体检 1 数据') ax.set_ylabel('体检 2 数据') plt.show()

结果显示： 4、①Sigmoid函数和应用梯度下降更新Ѳ Sigmoid函数：偏导数： python代码如下：

# 实现sigmoid函数 def sigmoid(z): return 1/(1+np.exp(-z)) #实现代价函数 def Cost(theta,X,y): first = np.multiply(-y,np.log(sigmoid(X*theta.T))) second = np.multiply((1-y),np.log(1-sigmoid(X*theta.T))) return np.sum(first-second)/(len(X)) #t梯度下降 def gradientDescent(X, y, theta, alpha, iters): temp = np.matrix(np.zeros(theta.shape)) parameters = int(theta.ravel().shape[1]) cost = np.zeros(iters) for i in range(iters): error = sigmoid(X * theta.T) - y for j in range(parameters): term = np.multiply(error, X[:, j]) temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term)) theta = temp cost[i] = Cost(theta,X,y) return theta, cost # 加一列常数列 data.insert(0, 'Ones', 1) # 初始化X，y，θ cols = data.shape[1] X = data.iloc[:,0:cols-1] y = data.iloc[:,cols-1:cols] X = np.matrix(X.values) y = np.matrix(y.values) theta = np.matrix(np.array([0,0,0])) alpha = 0.0001#如果学习率太大会造成梯度上升，得出NAN数值，详见后期专门分析学习率 iters = 1500 g,cost= gradientDescent(X, y, theta, alpha, iters) print(g,cost)

运行结果：（最后代价函数为0.62909，有点大） ②利用scipy.optimize.fmin_tnc工具库，不用自己定义，自动应用学习率和迭代最优解 python代码如下：

#实现代价函数 def Cost(theta,X,y): theta = np.matrix(theta) X = np.matrix(X) y = np.matrix(y) first = np.multiply(-y, np.log(sigmoid(X * theta.T))) second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T))) return np.sum(first - second) / (len(X)) # 加一列常数列 data.insert(0, 'Ones', 1) # 初始化X，y，θ cols = data.shape[1] X = data.iloc[:,0:cols-1] y = data.iloc[:,cols-1:cols] X = np.array(X.values) y = np.array(y.values) theta = np.zeros(3) #实现梯度函数 def gradient(theta, X, y): theta = np.matrix(theta) X = np.matrix(X) y = np.matrix(y) parameters = int(theta.ravel().shape[1]) grad = np.zeros(parameters) error = sigmoid(X * theta.T) - y for i in range(parameters): term = np.multiply(error, X[:, i]) grad[i] = np.sum(term) / len(X) return grad import scipy.optimize as opt result = opt.fmin_tnc(func=Cost, x0=theta, fprime=gradient,args=(X,y)) # 用θ的计算结果代回代价函数计算 print(result[0]) print(Cost( result[0],X, y))

运行结果：对比得知第②种方法最优如果想用第①种方法，就要设置个循环，假设代价函数（数据集全部损失函数的平均）小于0.3停止迭代（运行时间可能特别慢） python代码如下：

alpha = 0.0005#如果学习率太大会造成梯度上升，得出NAN数值，详见后期专门分析学习率 iters = 15000 while True: g, cost = gradientDescent(X, y, theta, alpha, iters) if cost[-1] <0.3: print(g,cost) break else: print(cost) iters =iters+15000

迭代结果输出如下： 5、画出决策线 python代码如下：

print(result[0]) print(Cost( result[0],X, y)) plotting_x1 = np.linspace(30, 100, 100) plotting_h1 = ( - result[0][0] - result[0][1] * plotting_x1) / result[0][2] fig, ax = plt.subplots(figsize=(12,8)) ax.plot(plotting_x1, plotting_h1, 'y', label='预测 ') ax.scatter(阳性['体检1'], 阳性['体检2'], s=50, c='r', marker='o', label='患病') ax.scatter(阴性['体检1'], 阴性['体检2'], s=50, c='g', marker='s', label='不患病') ax.legend() ax.set_xlabel('体检 1 数据') ax.set_ylabel('体检 2 数据') plt.show()

运行结果： ①、梯度下降配合循环逻辑，得出的Ѳ画出的图： ②、利用scipy.optimize.fmin_tnc工具库得出的Ѳ画出的图： 6、预测得病率和得不得病的反馈得病为1，不得病为0 体检1为60，体检2为70 python代码如下：

def hfunc1(theta, X): return sigmoid(np.dot(theta.T, X)) def predict(theta, X): probability = sigmoid(np.dot(theta.T, X)) return [1 if probability >= 0.5 else 0] print('得病率为：',hfunc1(result[0],[1,60,70])) print('预测得不得病：',predict(result[0],[1,60,70]))

运行结果：如有问题欢迎指正，谢谢~

Processed: 0.012, SQL: 9