小余同学整理的实例1

技术2022-07-10 152

SKNN+train_test_split实例

import numpy as np import matplotlib.pyplot as plt from sklearn import datasets

1. 导入鸢尾花的数据集

iris=datasets.load_iris() x=iris.data y=iris.target x.shape (150, 4) y.shape (150,) y array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

2.train_test_split过程

2.1 将x训练集中的元素进行乱序处理，返回索引np.random.permutation(len(x))

shuffle_indexes=np.random.permutation(len(x)) shuffle_indexes array([ 65, 9, 132, 90, 128, 40, 63, 28, 130, 3, 27, 108, 105, 71, 26, 94, 37, 43, 73, 22, 31, 138, 15, 52, 131, 29, 84, 93, 55, 149, 49, 91, 7, 116, 127, 36, 106, 137, 115, 46, 124, 96, 77, 4, 8, 57, 136, 21, 113, 82, 134, 143, 114, 42, 112, 88, 85, 118, 147, 50, 13, 14, 48, 69, 67, 12, 16, 11, 141, 117, 142, 5, 126, 121, 19, 17, 122, 39, 30, 38, 45, 75, 144, 123, 34, 51, 23, 109, 148, 110, 56, 81, 54, 68, 61, 35, 41, 78, 103, 32, 99, 0, 145, 140, 58, 10, 47, 72, 104, 87, 111, 64, 107, 102, 33, 80, 74, 83, 59, 95, 135, 20, 89, 146, 18, 24, 86, 92, 66, 76, 25, 2, 98, 101, 53, 79, 70, 60, 129, 133, 139, 62, 119, 1, 97, 125, 120, 6, 44, 100])

2.2 测试数据集

test_ratio=0.2 test_size=int(len(x)*test_ratio) test_size 30 test_indexes=shuffle_indexes[:test_size] train_indexes=shuffle_indexes[test_size:] x_test=x[test_indexes] y_test=y[test_indexes] x_train=x[train_indexes] y_train=y[train_indexes] print(x_train.shape) print(y_train.shape) (120, 4) (120,) print(x_test.shape) print(y_test.shape) (30, 4) (30,)

3. sklearn 中的train_test_split

from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666) print(x_train.shape) print(y_train.shape) (120, 4) (120,) print(x_test.shape) print(y_test.shape) (30, 4) (30,)

4.进行预测y_predict——SKNN算法

from sklearn.neighbors import KNeighborsClassifier kNNClassifier=KNeighborsClassifier(n_neighbors=3) kNNClassifier.fit(x_train,y_train) y_predict=kNNClassifier.predict(x_test) y_predict array([1, 2, 1, 2, 0, 1, 1, 2, 1, 1, 1, 0, 0, 0, 2, 1, 0, 2, 2, 2, 1, 0, 2, 0, 1, 1, 0, 1, 2, 2])

5. 算准确率用y_test与y_predict结果对比

y_test array([1, 2, 1, 2, 0, 1, 1, 2, 1, 1, 1, 0, 0, 0, 2, 1, 0, 2, 2, 2, 1, 0, 2, 0, 1, 1, 0, 1, 2, 2]) sum(y_predict==y_test)/len(y_test) 1.0

具体SKNN过程实现举例

import numpy as np import matplotlib.pyplot as plt raw_data_x=[[3.3,2.3], [3.1,1.7], [1.3,3.6], [3.5,4.6], [2.2,2.8], [7.4,4.6], [5.7,3.5], [9.1,2.5], [7.7,3.4], [7.9,0.7] ] raw_data_y=[0,0,0,0,0,1,1,1,1,1] x_train=np.array(raw_data_x) y_train=np.array(raw_data_y) x_train array([[3.3, 2.3], [3.1, 1.7], [1.3, 3.6], [3.5, 4.6], [2.2, 2.8], [7.4, 4.6], [5.7, 3.5], [9.1, 2.5], [7.7, 3.4], [7.9, 0.7]]) y_train array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1]) x=np.array([8.0,3.3])

1. 计算测试集中y离x的距离distance

from math import sqrt distances=[] for x_train in x_train: d=sqrt(np.sum((x_train-x)**2)) distances.append(d) distances [4.805205510693586, 5.154609587543949, 6.706713054842886, 4.684015371452148, 5.821511831131154, 1.431782106327635, 2.308679276123039, 1.360147050873544, 0.31622776601683783, 2.601922366251537] distances= [sqrt( np.sum((x_train- x)** 2)) for x_train in x_train ] distances [4.805205510693586, 5.154609587543949, 6.706713054842886, 4.684015371452148, 5.821511831131154, 1.431782106327635, 2.308679276123039, 1.360147050873544, 0.31622776601683783, 2.601922366251537]

2. 求离x最近的k个数据对应的结果

np.argsort(distances) array([8, 7, 5, 6, 9, 3, 0, 1, 4, 2], dtype=int64) nearest=np.argsort(distances) k=6 topK_y=[y_train[i] for i in nearest[:k]] topK_y [1, 1, 1, 1, 1, 0]

3. 结果中个数最多的值为预测值

from collections import Counter Counter(topK_y) Counter({1: 5, 0: 1}) votes=Counter(topK_y) votes.most_common(1) [(1, 5)] y_predict=votes.most_common(1)[0]

4. votes.most_common()求票数最多的几个，返回结果为一个二维数组

votes=Counter(topK_y) votes.most_common(1) [(1, 5)] y_predict=votes.most_common(1)[0][0] y_predict 1

Processed: 0.010, SQL: 9