文章目录
1. 朴素贝叶斯2. NB 与 逻辑回归对比
本文为
scikit-learn机器学习(第2版)学习笔记
相关知识参考:《统计学习方法》朴素贝叶斯法(Naive Bayes,NB)
1. 朴素贝叶斯
通过最大概率来预测类:
y
=
arg max
c
k
P
(
Y
=
c
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
\color{red} y=\argmax\limits_{c_k} P(Y=c_k) \prod\limits_{j} P(X^{(j)}=x^{(j)}|Y=c_k)
y=ckargmaxP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
模型假设:
样本独立同分布;
条件独立性 :
X
(
j
)
X^{(j)}
X(j) 之间条件独立
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
(
1
)
=
x
(
1
)
,
.
.
.
,
X
(
n
)
=
x
(
n
)
∣
Y
=
c
k
)
=
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k)
P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck)=∏j=1nP(X(j)=x(j)∣Y=ck)
模型变体:
多项式NB:适合于类别特征高斯NB:适合连续特征,假设每个特征对每个类都符合正态分布伯努利NB:适合所有特征为二元值的情况
朴素贝叶斯的假设很少为真,但是NB模型可以有效地判别线性可分类
当训练数据缺乏时,性能通常优于判别模型模型简单,运行速度快,易于实现
2. NB 与 逻辑回归对比
%matplotlib inline
import pandas
as pd
from sklearn
.datasets
import load_breast_cancer
from sklearn
.linear_model
import LogisticRegression
from sklearn
.naive_bayes
import GaussianNB
from sklearn
.model_selection
import train_test_split
import matplotlib
.pyplot
as plt
X
, y
= load_breast_cancer
(return_X_y
=True)
X_train
, X_test
, y_train
, y_test
= train_test_split
(X
, y
, stratify
=y
, random_state
=11)
lr
= LogisticRegression
()
nb
= GaussianNB
()
lr_scores
= []
nb_scores
= []
train_sizes
= range(10, len(X_train
), 10)
for train_size
in train_sizes
:
X_slice
, _
, y_slice
, _
= train_test_split
(
X_train
, y_train
, train_size
=train_size
, stratify
=y_train
, random_state
=11)
nb
.fit
(X_slice
, y_slice
)
nb_scores
.append
(nb
.score
(X_test
, y_test
))
lr
.fit
(X_slice
, y_slice
)
lr_scores
.append
(lr
.score
(X_test
, y_test
))
plt
.plot
(train_sizes
, nb_scores
, label
='Naive Bayes')
plt
.plot
(train_sizes
, lr_scores
, linestyle
='--', label
='Logistic Regression')
plt
.rcParams
['font.sans-serif'] = 'SimHei'
plt
.title
("NB vs LogRg 对比")
plt
.xlabel
("训练样本数")
plt
.ylabel
("测试集预测准确率")
plt
.legend
()
在小型数据集上,NB模型性能优于逻辑回归当训练样本数增多以后,逻辑回归模型性能逐渐提升