朴素贝叶斯法是基于贝叶斯定理与特征条件独立假设的分类方法。首先基于特征条件独立的假设学习输入/输出的联合概率分布 P ( X , Y ) P(X,Y) P(X,Y);然后根据此模型对于给定的输入 x x x,利用贝叶斯定理求出后验概率最大的输出 y y y,即 P ( Y ∣ X ) P(Y|X) P(Y∣X),朴素贝叶斯是一个生成模型。所谓的特征条件独立是在类别确定的情况下实例的特征之间是独立的。
训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋅ ⋅ ⋅ , ( x N , y N ) } T=\lbrace (x_1,y_1),(x_2,y_2),···,(x_N,y_N)\rbrace T={(x1,y1),(x2,y2),⋅⋅⋅,(xN,yN)}其中 x i ∈ X ⊆ R n x_i \in X \subseteq R^n xi∈X⊆Rn, y i ∈ Y = { c 1 , c 2 , ⋅ ⋅ ⋅ c K } y_i \in Y=\lbrace c_1,c_2,···c_K\rbrace yi∈Y={c1,c2,⋅⋅⋅cK} 首先要学习的是联合概率分布 P ( X , Y ) P(X,Y) P(X,Y),根据条件概率公式可以计算联合概率分布 P ( X , Y ) = P ( X ∣ Y ) ⋅ P ( Y ) P(X,Y)=P(X|Y)·P(Y) P(X,Y)=P(X∣Y)⋅P(Y)其中先验概率分布为 P ( Y = c k ) , k = 1 , 2 , ⋅ ⋅ ⋅ , K P(Y=c_k),k=1,2,···,K P(Y=ck),k=1,2,⋅⋅⋅,K条件概率分布 P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , ⋅ ⋅ ⋅ , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , ⋅ ⋅ ⋅ , K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},X^{(2)}=x^{(2)},···,X^{(n)}=x^{(n)}|Y=c_k),k=1,2,···,K P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),⋅⋅⋅,X(n)=x(n)∣Y=ck),k=1,2,⋅⋅⋅,K结合对特征条件独立的假设 P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , ⋅ ⋅ ⋅ , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ i = 1 n P ( X i = x i ∣ Y = c k ) P(X^{(1)}=x^{(1)},X^{(2)}=x^{(2)},···,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{i=1}^{n}P(X^{i}=x^{i}|Y=c_k) P(X(1)=x(1),X(2)=x(2),⋅⋅⋅,X(n)=x(n)∣Y=ck)=i=1∏nP(Xi=xi∣Y=ck)因此 P ( X = x , Y = c k ) = ∏ i = 1 n P ( X i = x i ∣ Y = c k ) ⋅ P ( Y = c k ) P(X=x,Y=c_k)=\prod_{i=1}^{n}P(X^{i}=x^{i}|Y=c_k)·P(Y=c_k) P(X=x,Y=ck)=i=1∏nP(Xi=xi∣Y=ck)⋅P(Y=ck)根据贝叶斯定理 P ( Y = c k ∣ X = x ) = ∏ i = 1 n P ( X i = x i ∣ Y = c k ) ⋅ P ( Y = c k ) P ( X = x ) P(Y=c_k|X=x)=\frac{\prod_{i=1}^{n}P(X^{i}=x^{i}|Y=c_k)·P(Y=c_k)}{P(X=x)} P(Y=ck∣X=x)=P(X=x)∏i=1nP(Xi=xi∣Y=ck)⋅P(Y=ck)根据全概率公式得到 P ( Y = c k ∣ X = x ) = ∏ i = 1 n P ( X i = x i ∣ Y = c k ) ⋅ P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_k|X=x)=\frac{\prod_{i=1}^{n}P(X^{i}=x^{i}|Y=c_k)·P(Y=c_k)}{\sum_kP(X=x|Y=c_k)P(Y=c_k)} P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)∏i=1nP(Xi=xi∣Y=ck)⋅P(Y=ck)根据条件独立假设 P ( Y = c k ∣ X = x ) = ∏ i = 1 n P ( X i = x i ∣ Y = c k ) ⋅ P ( Y = c k ) ∑ k P ( Y = c k ) ∏ i = 1 n P ( X i = x i ∣ Y = c k ) , k = 1 , 2 , ⋅ ⋅ ⋅ , K P(Y=c_k|X=x)=\frac{\prod_{i=1}^{n}P(X^{i}=x^{i}|Y=c_k)·P(Y=c_k)}{\sum_k P(Y=c_k)\prod_{i=1}^{n}P(X^{i}=x^{i}|Y=c_k)},k=1,2,···,K P(Y=ck∣X=x)=∑kP(Y=ck)∏i=1nP(Xi=xi∣Y=ck)∏i=1nP(Xi=xi∣Y=ck)⋅P(Y=ck),k=1,2,⋅⋅⋅,K上式即为朴素贝叶斯的基本公式。最后我们要求的是在输入实例 x x x的情况下得到最大的 P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ck∣X=x)所对应的 c k c_k ck就是输入实例 x x x所对应的类别,用数学形式表示为: y = f ( x ) = arg max c k P ( Y = c k ∣ X = x ) = ∏ i = 1 n P ( X i = x i ∣ Y = c k ) ⋅ P ( Y = c k ) ∑ k P ( Y = c k ) ∏ i = 1 n P ( X i = x i ∣ Y = c k ) , k = 1 , 2 , ⋅ ⋅ ⋅ , K y=f(x)=\argmax_{c_k} P(Y=c_k|X=x)=\frac{\prod_{i=1}^{n}P(X^{i}=x^{i}|Y=c_k)·P(Y=c_k)}{\sum_k P(Y=c_k)\prod_{i=1}^{n}P(X^{i}=x^{i}|Y=c_k)},k=1,2,···,K y=f(x)=ckargmaxP(Y=ck∣X=x)=∑kP(Y=ck)∏i=1nP(Xi=xi∣Y=ck)∏i=1nP(Xi=xi∣Y=ck)⋅P(Y=ck),k=1,2,⋅⋅⋅,K分母的值为一个定值,因此化简为: y = arg max c k ∏ i = 1 n P ( X i = x i ∣ Y = c k ) ⋅ P ( Y = c k ) , k = 1 , 2 , ⋅ ⋅ ⋅ , K y=\argmax_{c_k}\prod_{i=1}^{n}P(X^{i}=x^{i}|Y=c_k)·P(Y=c_k),k=1,2,···,K y=ckargmaxi=1∏nP(Xi=xi∣Y=ck)⋅P(Y=ck),k=1,2,⋅⋅⋅,K上式就是贝叶斯过程要计算的公式。对于离散的数据上式的参数都可以通过其对应的频率求得即极大似然。用极大似然的方式求得参数可能会出现为0的情况,因此参数估计引入了贝叶斯估计法具体的条件概率贝叶斯是 P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_{\lambda}(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^{N}I(x_i^{(j)}=a_{jl},y_i=c_k)+\lambda}{\sum_{i=1}^{N}I(y_i=c_k)+S_j\lambda} Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ其中 λ \lambda λ为一个超参数,当 λ = 0 \lambda=0 λ=0时为极大似然估计,当 λ = 1 \lambda=1 λ=1时称为拉普拉斯平滑。 S j S_j Sj为特征 x j x^j xj可以取的值的数量。