Chapter1：k-Nearest Neighbors (kNN) with R

技术2023-03-27 97

DataCamp——kNN with R

k-近邻算法（kNN）概述何谓近何谓邻何谓K核心思想算法流程 What about kNN in RRecognizing a road sign with kNNThinking like kNNExploring the traffic sign datasetClassifying a collection of road signsUnderstanding the impact of 'K'Testing other 'k' valuesSeeing how the neighbors votedWhy normalize data kNN优缺点优点缺点

k-近邻算法（kNN）概述

KNN（K- Nearest Neighbor）法即K最邻近法，该算法是数据挖掘分类中一个理论上比较成熟的算法，也是最简单的机器学习算法之一。首先我们先来对kNN算法做一个简单的了解，对于算法名字可以拆解成三部分，也就是k、近、邻，下面我们来具体讨论一下这三部分。

何谓近

近其实表示的就是两两样本之间的距离。众所周知，在距离判别法中也同样存在各种距离来判定其分类，比如欧式距离，曼哈顿距离、切比雪夫距离等，如果两者距离较其他样本之间更近，那么则判断为同一类。同样，在k近邻算法中，我们同样也是选取距离作为判别的指标之一，通常来说，一般选取欧式距离和曼哈顿距离：欧式距离： $d(x,y)=\sqrt{\sum_{k=1}^{n}{(x_k-y_k)}^{2}}$ 曼哈顿距离： $d(x,y)=\sum_{k=1}^{n}{|x_k-y_k|}$

何谓邻

邻表示的是邻居，这里指代所有已经正确分类过的训练集中的样本，那么在这个算法中，取多少个邻居作为判别指标呢？这就和算法中的K有关了。

何谓K

K表示一个变量，他是指定进行分类时要考虑的邻居数，在R中默认情况下，k=1，不过k=1也不是所有情况下都是最好的，但k也不是越大越好。如下面这个例子：在上图中我们可以发现，我们所需要分类的是一个人行横道的标志，但是与它距离最近的是一个限速标志，因为限速标志具有与待判标志相近的背景颜色等一系列因素，所以，如果在这里k取1的话，则这里就出现了误判。第二、第三、第四近邻都是人行横道的标志，所以如果k取3，那么两个人行横道标志与限速标志就会进行投票表决，显然，人行横道标志以2:1的优势获得了胜利。同样，如果k取5，人行横道标志同样以3:2的优势获得了胜利，故而，k的选取还是比较重要的。这里值得提一句的是，在平局情况下，赢家通常是随机决定的。小的K会创建很小的邻域，分类器能够发现非常细微的模式，而大K则会忽略细节方面，试图发现更广泛，更一般的模式潜在噪音点。在如何设置K这个问题上，没有普遍的规则。主要取决于要学习的模式以及噪声数据的影响，有一种可参考的建议是k选取训练数据中观察次数的平方根，比如你的训练集中有100个样本，那么k选取10，通常，k是不大于20的整数。当然这只是建议，分类器的好坏还是根据判别的正确率来衡量的，建议多选取几个k选取性能较好的分类器。

核心思想

总的来说，KNN算法的核心思想是，如果一个样本在特征空间中的K个最相邻的样本中的大多数属于某一个类别，则该样本也属于这个类别，并具有这个类别上样本的特性。该方法在确定分类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。KNN方法在类别决策时，只与极少量的相邻样本有关。由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本集来说，KNN方法较其他方法更为适合。

算法流程

（1）计算测试数据与各个训练数据之间的距离。（2）按照距离递增关系进行排序。（3）选取最小的前K个点。（4）确定前K个点所在的类别出现的频率。（5）返回前K个点中出现频率最高的类别作为测试数据的预测分类。

What about kNN in R

Recognizing a road sign with kNN

After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.

As it begins to drive away, its camera captures the following image: Can you apply a kNN classifier to help the car recognize this sign? The dataset signs is loaded in your workplace along with the dataframe next_sign, which holds the observation you want to classify.

# Load the 'class' package library(class) # Creat a vector of labels sign_types<-signs$sign_type # Classify the next sign observed knn(train=signs[-1], test=next_sign, cl =sign_types)

Thinking like kNN

With your help, the test car successfully identified the sign and stopped safely at the interection.

How did the knn() function correctly classify the stop sign ? ————————————————————————————————————————————— The correct answer: The sign was in some way similar to another stop sign . —————————————————————————————————————————————

Exploring the traffic sign dataset

To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.

Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here. The result is a dataset that records the sign_type as well as 16x3=48 color properties of each sign.

# Examine the structure of the signs dataset str(signs) # Count the number of signs of each type table(signs$sign_type) # Check r10's average red level by sign type aggregate(r10~sign_type, data=signs, mean)

Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

The test course includes 59 additional road signs divided into three types: At the conclusion of the trial, you are asked to measure the car’s overall performance at recognizing these signs.

The class package and the dataset signs are already loaded in your workplace. So is the dataframe test_signs, which holds a set of observations you’ll test your model on.

# Use kNN to identify the test road signs sign_types<-signs$sign _type signs_pred <- knn(train=signs[-1], test=test_signs[-1],cl=sign_types) # Create a confusion matrix of the predicted versus actual values signs_actual <-test _signs$sign_type table (signs_actual, signs_pred) # Compute the accuracy mean (signs_actual == signs_pred)

Understanding the impact of ‘K’

There is a complex relationship between k and classification accuracy. Bigger is not always better.

Which of these is vaild reason for keeping k as small (but no smaller) ————————————————————————————————————————————— The correct answer: A smaller k may utilize more subtle patterns —————————————————————————————————————————————

Testing other ‘k’ values

By default, the knn() function in the class package uses only the single nearest neighbor.

Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.

Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.

The class package is already loaded in your workplace along with the datasets signs, signs_test, and sign_types. The object signs_actual holds the true values of the signs.

# Compute the accuracy of the baseline model (default k =1) k_1<-knn(train =signs[-1], test =signs_test[-1], cl=sign_types) mean(signs_actual==k_1) # Modify the above to set k = 7 k_7<-knn(train =signs[-1], test =signs_test[-1], cl=sign_types,k=7) mean(signs_actual==k_7) #Set k=15 and compare to the above k_15<-knn(train =signs[-1], test =signs_test[-1], cl=sign_types,k=15) mean(signs_actual==k_15)

Seeing how the neighbors voted

When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

In this exercise, you will learn how to obtain the voting results from the knn() function.

The class package has already been loaded in your workspace along with the datasets signs, sign_type, and signs_test

# Use the prob parameter to get the proportion of votes for the winning class sign_pred<-knn(train =signs[-1], test =signs_test[-1], cl=sign_types,prob=T, k=7) # Get the "prob" attribute from the predicted classes sign_prob<-attr(sign_pred,"prob") # Examine the first several predictions head(sign_pred) # Examine the proportion of votes for the winning class head(sign_prob)

Why normalize data

Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization. What is the purpose of this step? ————————————————————————————————————————————— The correct answer: To ensure all data elements may contribute equal shares to distance. ————————————————————————————————————————————— The most commonly used standardized formula of kNN Codes

Normalization<- function(x){ return((x-min(x))/(max(x)-min(x))) }

kNN优缺点

优点

KNN方法思路简单，易于理解，易于实现，无需估计参数，无需训练。

缺点

该算法在分类时有个主要的不足是，当样本不平衡时，如一个类的样本容量很大，而其他类样本容量很小时，有可能导致当输入一个新样本时，该样本的K个邻居中大容量类的样本占多数。该方法的另一个不足之处是计算量较大，因为对每一个待分类的文本都要计算它到全体已知样本的距离，才能求得它的K个最近邻点。 ————————————————————————————————————————————— That’s all. Thank you

Processed: 0.015, SQL: 9