酒店数据集退订分析与预测

技术2022-07-13 73

酒店数据集退订分析与预测

1、背景2、提出问题3、理解数据3.1数据清洗 4、可视化分析4.1 两家酒店总体退订率情况4.2 退订用户特征4.2.1用户需求属性4.2.2 用户行为属性4.3 用户画像属性 4.3 酒店价格属性 5、小结和建议6、退订预测6.1 特征选择6.2 特征编码6.3建立模型主要内容：本文将主要从以下几个方面分析：1、背景 2、提出问题 3、理解数据 4、数据清洗 5、可视化分析 6、退订率预测 7、结论和建议

1、背景

酒店集数据中包括了两家酒店(城市&度假)在20150701-20170831期间的预订数据，主要包括预订时间，停留时间，成人/儿童/婴儿的数量以及餐食类型和可用停车位数量等信息‘ 退订率高会对酒店造成损失，本文主要根据数据集内容，分析各类数据对退订率的影响并进行预测，从数据分析的结论给出相关建议

2、提出问题

（1）分析两家酒店总体退订率情况；（2）从整体情况看，退订的用户普遍具有哪些特征；（3）尝试找到合适的模型预测退订用户；（4）针对性给出预防用户取消预订的建议；

3、理解数据

根据介绍，该数据集有32个字段，共119390条数据，每条记录包含了每个订单的基本信息，我们的目的就是发现其他特征和is_canceled之间的关系

3.1数据清洗

- 数据观察

导入数据集文件

import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.rcParams['font.sans-serif'] = [u'SimHei'] plt.rcParams['axes.unicode_minus'] = False from datetime import datetime data_path=r'hotel_bookings.csv' data=pd.read_csv(data_path)

查看数据集信息，查看数据集大小，并初步观察前10条的数据内容。

#数据清洗 #查看数据集大小 # 查看数据集大小 data.shape # 运行结果：(119390, 32) # 设置查看列不省略 pd.set_option('display.max_columns',None) # 查看前10条数据 data.head(10)

- 缺失值处理查看数据是否存在Null，

# Null计数 data.isnull().sum().sort_values(ascending=False) out: company 112593 agent 16340 country 488 children 4 lead_time 0 arrival_date_year 0 arrival_date_month 0 arrival_date_week_number 0 is_canceled 0 market_segment 0 arrival_date_day_of_month 0 stays_in_weekend_nights 0 stays_in_week_nights 0 adults 0 babies 0 meal 0 reservation_status_date 0 distribution_channel 0 reservation_status 0 is_repeated_guest 0 previous_cancellations 0 previous_bookings_not_canceled 0 reserved_room_type 0 assigned_room_type 0 booking_changes 0 deposit_type 0 days_in_waiting_list 0 customer_type 0 adr 0 required_car_parking_spaces 0 total_of_special_requests 0 hotel 0 dtype: int64

children,country,agent,company均有空值 children 为空值推测为无children，使用0填充 country 列使用众数填充 agent 无旅行社id标识，推测为个人客户，使用0填充 company 列缺失值较多，且对结果分析产生影响不大，进行删除

#缺失值处理 df = data.copy() df.drop(['company'],axis=1,inplace=True) df['children'].fillna(0,inplace=True) df['agent'].fillna(0,inplace=True) df['country'].fillna(value=df['country'].mode()[0],inplace=True) 验证数据合法性，去除异常值 # 查看每一列数据取值 for x in data.columns: #遍历数据集中所有的列 test=data.loc[:,x].value_counts() print('{0} 的行数是：{1}'.format(x,test.sum())) print('{0} 的数据类型是：{1}'.format(x,data[x].dtypes)) print('{0} 的内容是：\n{1}\n'.format(x,test)) df.describe()

从以上代码输出的数据，可以做出如下处理：（1）删除总人数和总入住天数为0的异常数据（2）删除预订总天数为0的异常数据(即’stays_in_weekend_nights’+‘stays_in_week_nights’)

#删除总人数和总入住天数为0的异常数据 df.reset_index() zero_people=df[df[['adults', 'children', 'babies']].sum(axis=1)==0] df.drop(zero_people.index,inplace=True) zero_date=df[df[['stays_in_weekend_nights','stays_in_week_nights']].sum(axis=1)==0] df.drop(zero_date.index,inplace=True) 数据去重保证数据唯一性，因为酒店类数据可能会出现不同数据库导出重复数据的现象，这里数据由30+列数据组成，完全相同的概率较小，故而针对完全相同的列进行去重 df.drop_duplicates(inplace=True) df.info() out: <class 'pandas.core.frame.DataFrame'> Int64Index: 86719 entries, 2 to 119389 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 86719 non-null object 1 is_canceled 86719 non-null int64 2 lead_time 86719 non-null int64 3 arrival_date_year 86719 non-null int64 4 arrival_date_month 86719 non-null object 5 arrival_date_week_number 86719 non-null int64 6 arrival_date_day_of_month 86719 non-null int64 7 stays_in_weekend_nights 86719 non-null int64 8 stays_in_week_nights 86719 non-null int64 9 adults 86719 non-null int64 10 children 86719 non-null float64 11 babies 86719 non-null int64 12 meal 86719 non-null object 13 country 86719 non-null object 14 market_segment 86719 non-null object 15 distribution_channel 86719 non-null object 16 is_repeated_guest 86719 non-null int64 17 previous_cancellations 86719 non-null int64 18 previous_bookings_not_canceled 86719 non-null int64 19 reserved_room_type 86719 non-null object 20 assigned_room_type 86719 non-null object 21 booking_changes 86719 non-null int64 22 deposit_type 86719 non-null object 23 agent 86719 non-null float64 24 days_in_waiting_list 86719 non-null int64 25 customer_type 86719 non-null object 26 adr 86719 non-null float64 27 required_car_parking_spaces 86719 non-null int64 28 total_of_special_requests 86719 non-null int64 29 reservation_status 86719 non-null object 30 reservation_status_date 86719 non-null object dtypes: float64(3), int64(16), object(12)

还剩下86719行数据，所有字段的数据类型均符合实际含义，数据清洗工作完成

4、可视化分析

4.1 两家酒店总体退订率情况

整体取消预订比例 #整体取消预订比例 df['is_canceled']=df['is_canceled'].astype('str').replace(['0','1'],['no','yes']) plt.rcParams['figure.figsize']=6,6 plt.pie(df['is_canceled'].value_counts(),labels=df['is_canceled'].value_counts().index,autopct='%1.2f%%',explode=(0.1,0)) #每一块饼图距离中心的距离，默认值(0,0)就是不离开中心，explode = (0,0,0.1,0) 表示将第三块分离出来，而这里就表示将第一块分离出来 plt.title('is_canceled(Yes/No) Ratio') plt.show()

不分酒店类型情况下，整体取消比例为27.69%，接下来我们比较看下城市和度假酒店两者退订率是否有明显差异

#city类型酒店取消预订比例 fig= plt.figure(figsize=(10.,6)) fig.add_subplot(1,2,1) cancel_df_city=df[df['hotel']=='City Hotel']['is_canceled'].value_counts().to_frame() cancel_df_city.index=['no','yes'] x1=cancel_df_city.index y1=cancel_df_city['is_canceled']/cancel_df_city['is_canceled'].sum() plt.bar(x1,y1,width = 0.5,color = 'c') #resort类型酒店取消预订比例 fig.add_subplot(1,2,2) cancel_df_resort=df[df['hotel']=='Resort Hotel']['is_canceled'].value_counts().to_frame() cancel_df_resort.index=['no','yes'] x2=cancel_df_resort.index y2=cancel_df_resort['is_canceled']/cancel_df_resort['is_canceled'].sum() plt.bar(x2,y2,width = 0.5,color = 'c')

城市型酒店的退订率比度假酒店类型稍高，可能与两家酒店提供的功能、地理位置甚至服务有关，但总体来说都在20%-30%之间，两家退订率都相当高

退订率与时间的变化 import seaborn as sns ordered_months = [ "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December" ] fig,ax1=plt.subplots() ax2=ax1.twinx() monthly=df.groupby('arrival_date_month').size() monthly/2 monthly.loc[['July', 'August']] = monthly.loc[['July', 'August']] * 2 / 3 df['is_canceled']=df['is_canceled'].replace(['yes','no'],['1','0']).astype('int') sns.barplot(list(range(1, 13)), monthly[ordered_months], ax=ax1) ax2.plot( range(12), df.groupby('arrival_date_month')['is_canceled'].mean()[ordered_months].values, 'ro-') ax1.set_xlabel('Month') ax2.set_ylabel('Cancellation rate')

上图为2015年7月份到2017年8月份，图中所示为该段时间每月平均订单量以及成功入住的比例，可以看到每年的4-8月份是预订高峰期，但该段时间取消预订率也相对最高

4.2 退订用户特征

4.2.1用户需求属性

从预订到入住之间间隔的天数对退订率的影响 #定义核密度函数 def kdeplot(feature,xlabel): plt.figure(figsize=(9, 4)) plt.title("KDE for {0}".format(feature)) ax0 = sns.kdeplot(df[df['is_canceled'] == 0][feature].dropna(), color= 'navy', label= 'is_canceled: No', shade='True') ax1 = sns.kdeplot(df[df['is_canceled'] == 1][feature].dropna(), color= 'orange', label= 'is_canceled: Yes',shade='True') plt.xlabel(xlabel) #设置字体大小 plt.rcParams.update({'font.size': 20}) plt.legend(fontsize=10) %matplotlib inline kdeplot('lead_time','lead_time')

可以看到提前预定1-10天内取消预订的概率相对较小，30天以上取消预订的概率逐渐增大，酒店可以考虑限制仅允许提前一个月或半个月进行预订

周末和非周末入住 weekend_rate=df.loc[(df['stays_in_weekend_nights']!=0)&(df['is_canceled']==1),:].count()[0]/df.loc[df['stays_in_weekend_nights']!=0,:].count()[0] week_rate=df.loc[(df['stays_in_week_nights']!=0)&(df['is_canceled']==1),:].count()[0]/df.loc[df['stays_in_week_nights']!=0,:].count()[0] print('周末取消预订率为%.2f'%(weekend_rate*100),'%') print('非周末取消预订率为%.2f'%(week_rate*100),'%') 周末取消预订率为29.40 % 非周末取消预订率为28.20 %

周末和非周末对退订影响不大

餐饮类型 def is_canceled_rate_city(feature): print('city hotel') for i in df[feature].value_counts().index: is_canceled_rate = len(df.loc[(df['is_canceled']==1)&(df['hotel']=='City Hotel')&(df[feature]==i),:])/len(df[df['hotel']=='City Hotel']) print(str(i)+'的取消预订比例为%.2f'%(is_canceled_rate*100)+'%') def is_canceled_rate_resort(feature): print('resort hotel') for i in df[feature].value_counts().index: is_canceled_rate = len(df.loc[(df['is_canceled']==1)&(df['hotel']=='Resort Hotel')&(df[feature]==i),:])/len(df[df['hotel']=='Resort Hotel']) print(str(i)+'的取消预订比例为%.2f'%(is_canceled_rate*100)+'%') #不同餐饮类型影响 is_canceled_rate_city('meal') out: city hotel BB的取消预订比例为22.35% SC的取消预订比例为6.28% HB的取消预订比例为1.57% Undefined的取消预订比例为0.00% FB的取消预订比例为0.00% is_canceled_rate_resort('meal') out: BB的取消预订比例为20.81% SC的取消预订比例为3.85% HB的取消预订比例为2.83% Undefined的取消预订比例为0.09% FB的取消预订比例为0.11%

餐饮类型上看，两家酒店没有太大差异

预订房间类型 #查看预订房间类型的影响 reserved_room_type = list( df.groupby('reserved_room_type').size().sort_values(ascending=False).head(20).index) df[df.reserved_room_type.isin(reserved_room_type)].shape[0] / df.shape[0] fig, ax1 = plt.subplots(figsize=(10, 6)) ax2 = ax1.twinx() plt.xticks(range(len(reserved_room_type)),reserved_room_type) ax1.bar( range(len(reserved_room_type)), data[data.reserved_room_type.isin(reserved_room_type)].groupby('reserved_room_type').size().sort_values(ascending=False)) ax1.set_xlabel('meal_type') ax1.set_ylabel('Total Number of Booking') ax2.plot( range(len(reserved_room_type)), data[data.reserved_room_type.isin(reserved_room_type)].groupby('reserved_room_type')['is_canceled'].mean().loc[reserved_room_type], 'ro-') ax2.set_ylabel('Cancellation rate')

A类房型的需求量和退订率相对较高，房型的选择一般与用户出行类型有关(单人，双人，家庭等)，这与后文提到的双人出行模式是消费主体保持了一致

4.2.2 用户行为属性

预订改变次数 #研究预订改变的次数对取消预订率的影响 book_changes = list( df.groupby('booking_changes').size().sort_values(ascending=False).head(20).index) df[df.booking_changes.isin(book_changes)].shape[0] / df.shape[0] fig, ax1 = plt.subplots(figsize=(10, 6)) ax2 = ax1.twinx() plt.xticks(range(len(book_changes)),book_changes) ax1.bar( range(len(book_changes)), data[data.booking_changes.isin(book_changes)].groupby('booking_changes').size().sort_values(ascending=False)) ax1.set_xlabel('Change_times') ax1.set_ylabel('Total Number of Booking') ax2.plot( range(len(book_changes)), data[data.booking_changes.isin(book_changes)].groupby('booking_changes')['is_canceled'].mean().loc[book_changes], 'ro-') ax2.set_ylabel('Cancellation rate')

可以看出订单改变的次数和最终取消订单间没有特别的正相关关系，但有过改变的相对取消率会更低，符合一般经验，可以考虑试推行订单可更改的功能

押金情况 barplot_percentages(‘deposit_type’,orient=‘h’) 两家酒店基本上都是不交押金的，可以尝试押金机制来做一定限制，但是押金制的推行可能会导致损失一部分客源，可针对房源质量好，同时订单取消率高的房源适当推行押金机制重复预订(回头客) def barplot_percentages(feature,orient='v',axis_name="percentage of guests"): #axis_name代表y轴名称， ratios = pd.DataFrame() g = (df.groupby(feature)['is_canceled'].value_counts()/len(df)).to_frame() #聚合feature特征，计算每个特征下不同属性对应的流失率，to_frame()是为了将value_counts()后的结果转化为dataframe，注意value_counts是会按value值从大到小排列的，使用时需要甄别对应的index g.rename(columns={'is_canceled':axis_name},inplace=True) g.reset_index(inplace=True) #print(g) if orient == 'v': ax = sns.barplot(x=feature, y= axis_name, hue='is_canceled', data=g, orient=orient) ax.set_yticklabels(['{:,.0%}'.format(y) for y in ax.get_yticks()]) plt.rcParams.update({'font.size': 13}) #plt.legend(fontsize=10) else: ax = sns.barplot(x= axis_name, y=feature, hue='is_canceled', data=g, orient=orient) ax.set_xticklabels(['{:,.0%}'.format(x) for x in ax.get_xticks()]) plt.legend(fontsize=6) plt.title('is_canceled(Yes/No) Ratio as {0}'.format(feature)) plt.show() %matplotlib inline barplot_percentages('is_repeated_guest')

回头客整体比例较少，仅约为3.6%，但重复预订的客户中取消预订的概率较低，回头客取消的概率远低于新客，可推出会员制，积分制等形式来提升用户黏性

4.3 用户画像属性

订单入住人类型 #研究订单入住人类型 #单人入住 df.loc[(df['adults']==1)&(df['children']==0)&(df['babies']==0),'guest_kind']='single' #双人入住 df.loc[(df['adults']==2)&(df['children']==0)&(df['babies']==0),'guest_kind']='couple' #家庭入住（带有孩子的） df.loc[(df['children']!=0)|(df['babies']!=0),'guest_kind']='family' #团体入住 df.loc[(df['adults']>2)&(df['children']==0)&(df['babies']==0),'guest_kind']='group' #全部订单入住人类型的分布情况，分析主要消费群体 plt.figure(figsize=(15,10)) plt.subplot(121) plt.pie(df['guest_kind'].value_counts(),labels=df['guest_kind'].value_counts().index,autopct="%.2f%%") plt.title('客户类型占比',fontsize=20) plt.subplot(122) barplot_percentages('guest_kind',orient='h') plt.figure(figsize=(6,6)) plt.pie(df.groupby('guest_kind')['adr'].sum(),labels=df.groupby('guest_kind')['adr'].sum().index,autopct="%.2f%%") plt.title('各客户类型消费占比')

可以看到无论是人数还是消费金额，都是双人出行占消费主体，退订率也主要由双人出行模式拉高分析双人出行是否有季节特征

#总预订量随月份的变化 plt.figure(figsize=(20,5)) guest_kind_monthly=df.groupby(['arrival_date_month','guest_kind'])['guest_kind'].count().unstack().loc[ordered_months,:] guest_kind_monthly=guest_kind_monthly/2 guest_kind_monthly.loc[['July','August'],:]=guest_kind_monthly.loc[['July','August'],:]*2/3 plt.plot(guest_kind_monthly) plt.legend(guest_kind_monthly.columns) plt.xlabel('Month') plt.ylabel('Number per month') #退订量随月份的变化 plt.figure(figsize=(20,5)) guest_kind_monthly=df[df['is_canceled']==1].groupby(['arrival_date_month','guest_kind'])['guest_kind'].count().unstack().loc[ordered_months,:] guest_kind_monthly=guest_kind_monthly/2 guest_kind_monthly.loc[['July','August'],:]=guest_kind_monthly.loc[['July','August'],:]*2/3 plt.plot(guest_kind_monthly) plt.legend(guest_kind_monthly.columns) plt.xlabel('Month') plt.ylabel('Number per month')

前述整体趋势在4-7月会产生需求高峰，同时取消预订率上升，双人出行在4-7月份的无论是总预订量还是取消预订量都有高峰产生，familiy模式有小幅上升，其他出行模式没有明显变化，产生的高峰可能是出行旺季，对于用户来说时间相对自由，也有可能是因为市场环境，由后面了解到该季节酒店价格上涨，同行竞争下有部分用户可能因为价格原因流失，具体原因需根据实际情况再行确认

国家属性

plt.figure(figsize=(12, 6)) plt.plot(df[df['is_canceled']==1]['country'].value_counts().to_frame().head(20)) 两家酒店位于葡萄牙，所以退订主要来源于欧洲国家

市场细分属性 #封装一个可以查看取消预订比例的函数 #查看取消预订数量占总体的比例 def is_canceled_rate(feature): for i in df[feature].value_counts().index: is_canceled_rate = len(df.loc[(df['is_canceled']==1)&(df[feature]==i),:])/len(df) print(str(i)+'的取消预订比例为%.2f'%(is_canceled_rate*100)+'%') #查看取消预订数量和自身属性相同的比例 def is_canceled_rate_v2(feature): for i in df[feature].value_counts().index: is_canceled_rate = len(df.loc[(df['is_canceled']==1)&(df[feature]==i),:])/len(df.loc[df[feature]==i,:]) print(str(i)+'的取消预订比例为%.2f'%(is_canceled_rate*100)+'%') #不同细分市场的影响 market_segment_number=df.groupby('market_segment').size().sort_values(ascending=False) plt.figure(figsize=(15,6)) plt.bar(market_segment_number.index,market_segment_number.values) plt.tick_params(labelsize=13) #刻度字体大小13 plt.title('不同细分市场订单数量') is_canceled_rate_v2('market_segment') out: Online TA的取消预订比例为35.55% Offline TA/TO的取消预订比例为14.95% Direct的取消预订比例为14.88% Groups的取消预订比例为27.22% Corporate的取消预订比例为12.23% Complementary的取消预订比例为12.17% Aviation的取消预订比例为19.82% Undefined的取消预订比例为100.00%

Online TA(旅行社)是预订数量最多，也是取消率最高的,继续分析旅行社id是否有集中性

旅行社id #分析不同旅行社id是否有集中性 is_canceled_rate_v2('agent') plt.pie(df['agent'].value_counts(),labels=df['agent'].value_counts().index,autopct="%.2f%%") plt.show() out:9.0的取消预订比例为40.19% 240.0的取消预订比例为38.29% 14.0的取消预订比例为17.48% 7.0的取消预订比例为13.28% 250.0的取消预订比例为17.82% 241.0的取消预订比例为13.33% 28.0的取消预订比例为5.89% 8.0的取消预订比例为28.08% 1.0的取消预订比例为37.28% 6.0的取消预订比例为16.09% 40.0的取消预订比例为7.84% ....

旅行社id为9.0的取消预订比例比较高，占总预订数量38.37%，需要对该旅行社保持重点关注和提升，例如建立信用评分机制，信誉评分较高的可享受额外福利，对于取消预订比例极高的旅行社，甚至可以拉入黑名单

4.3 酒店价格属性

%matplotlib auto plt.figure(figsize=(9, 4)) plt.title("KDE for {0}".format('adr')) ax0 = sns.kdeplot(df[df['is_canceled'] == 0]['adr'].dropna(), color= 'navy', label= 'is_canceled: No', shade='True') ax1 = sns.kdeplot(df[df['is_canceled'] == 1]['adr'].dropna(), color= 'orange', label= 'is_canceled: Yes',shade='True') plt.xlabel('adr') plt.xlim(-1,500) #设置字体大小 plt.rcParams.update({'font.size': 20}) plt.legend(fontsize=10)

#观察不同酒店类型平均价格的变化 df_city=df[df['hotel']=='City Hotel'] df_resort=df[df['hotel']=='Resort Hotel'] %matplotlib auto plt.figure(figsize=(9, 4)) plt.title('city KDE for adr') ax0 = sns.kdeplot(df_city[df_city['is_canceled'] == 0]['adr'].dropna(), color= 'navy', label= 'is_canceled: No', shade='True') ax1 = sns.kdeplot(df_city[df_city['is_canceled'] == 1]['adr'].dropna(), color= 'orange', label= 'is_canceled: Yes',shade='True') plt.xlabel('adr') plt.xlim(-1,200) #设置字体大小 plt.rcParams.update({'font.size': 20}) plt.legend(fontsize=10)

plt.figure(figsize=(9, 4)) plt.title('resort KDE for adr') ax0 = sns.kdeplot(df_resort[df_resort['is_canceled'] == 0]['adr'].dropna(), color= 'navy', label= 'is_canceled: No', shade='True') ax1 = sns.kdeplot(df_resort[df_resort['is_canceled'] == 1]['adr'].dropna(), color= 'orange', label= 'is_canceled: Yes',shade='True') plt.xlabel('adr') plt.xlim(-1,500) #设置字体大小 plt.rcParams.update({'font.size': 20}) plt.legend(fontsize=10)

无论城市类型还是度假类型，价格越高，退订的概率相对大于不取消的概率，总的来说价格小于100欧元时，取消预订的概率相对较小

5、小结和建议

整体上：

两家酒店整体取消比例都很高，其中城市类型酒店比度假类型酒店的退订率更高每年的4-8月份期间是双人、家庭类型用户出行的高峰期，期间预订需求量大，但退订率相应更高。

用户需求属性上：

从预订到入住之间间隔的天数对退订率有一定影响，间隔天数越长，取消预订的概率越大，其中提前预定1-10天内取消预订的概率相对较小，而30天以上退订概率明显上升，由此酒店可考虑推行仅允许提前一个月或半个月进行预订周末/非周末入住、餐饮类型等需求属性对退订率没有明显影响

用户行为属性上：

1.预订订单有过改变的，取消率相对更低，可以考虑推行订单可更改功能，以防订单信息填写错误等情况带来的退订损失 2.当前两家酒店没有推测未推行押金机制，建议可针对房源质量好，同时退订率较高的旅行社/个人（可结合信用评分机制）适当推行押金机制或收取退订费用 3. 有过重复预订的用户成功入住的概率很高，所以提升用户粘性是必要的，例如通过用户会员制、积分制等方式

用户画像属性上：

双人出行为消费主体，针对双人出行模式，酒店可推出适合双人的房间环境、功能，尤其是在流量旺季4-7月份吸引客户入住退订存在集中于某些旅行社的现象，建议增设信誉评分机制，信誉评分较高的可享受额外福利等，对于取消预订比例极高的旅行社/个人，甚至可拉入黑名单

酒店价格属性上：

无论是城市还是度假类型酒店，价格越高，退订概率便相对大于正常入住的概率，根据数据，当价格小于100欧元时退订概率相对较小

6、退订预测

本小姐主要针对是否退订进行建模预测，通过对特征的选取和编码，并尝试各种模型，初步探究随机森林在此数据集上表现最好。

6.1 特征选择

#删除没有分析价值的列 df_predict = df.copy() df_predict.drop(['adults','children','babies','meal','country','distribution_channel','reserved_room_type','assigned_room_type','arrival_date_day_of_month','reservation_status','reservation_status_date'],axis=1,inplace=True) #获取离散特征 cateCols = [c for c in df_predict.columns if df_predict[c].dtype == 'object' or c == 'is_canceled' or c == 'is_repeated_guest' or c=='same_room_type'] dfCate=df_predict[cateCols].copy() dfCate.columns dfCate['arrival_date_month']=df_predict['arrival_date_month'].replace(['August','July','May','April','June','March','October','September','February','December','November','January'],['8','7','5','4','6','3','10','9','2','12','11','1']).astype('int') cateCols.remove('arrival_date_month')

6.2 特征编码

#进行特征编码 for col in cateCols: if dfCate[col].nunique() == 2: #nunique返回的是惟一值的个数，unique()返回的是唯一值数组 dfCate[col] = pd.factorize(dfCate[col])[0] #数值映射，factorize() ，在这里如果特征只有两个值那就使用数值映射，如果超过两个值就使用one-hot编码 else: dfCate = pd.get_dummies(dfCate, columns=[col]) #get_dummied,one-hot编码,哑变量 #agent列不能单纯当做数值列，本质还是字符串，所以做下数值映射 dfCate['agent']=pd.factorize(dfCate['agent'])[0] dfCate['hotel']=pd.factorize(dfCate['hotel'])[0] #补充剩下的一些特征 col=['lead_time','arrival_date_year','arrival_date_week_number','stays_in_weekend_nights','stays_in_week_nights','previous_cancellations','previous_bookings_not_canceled','booking_changes','agent','days_in_waiting_list','adr','required_car_parking_spaces','total_of_special_requests'] for i in col: dfCate[i] = df_predict[i] #查看每个特征和流失率之间的关系 %matplotlib inline plt.figure(figsize=(16,8)) dfCate.corr()['is_canceled'].sort_values(ascending=False).plot(kind='bar') plt.show()

6.3建立模型

#选取特征 target=dfCate['is_canceled'].values columns=dfCate.columns.to_list() #导入相关sklern包 from sklearn.model_selection import GridSearchCV from sklearn.svm import LinearSVC as SVC from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_validate from sklearn.preprocessing import LabelEncoder #构造训练集数据和测试集数据 columns.remove('is_canceled') #包含所有特征的dataframe features = dfCate[columns].values # 30% 作为测试集，其余作为训练集 # random_state = 1表示重复试验随机得到的数据集始终不变 # stratify = target 表示按标识的类别，作为训练数据集、测试数据集内部的分配比例 from sklearn.model_selection import train_test_split train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.30, stratify = target, random_state = 1) #特征缩放，进行归一化 from sklearn.preprocessing import StandardScaler scaler = StandardScaler() x = scaler.fit_transform(train_x) y = train_y #引入模型 MLA = [LogisticRegression(n_jobs=-1, random_state=42), SVC(random_state = 42), DecisionTreeClassifier(random_state = 1, criterion = 'gini'), RandomForestClassifier(random_state = 1, criterion = 'gini'), KNeighborsClassifier(n_jobs=-1) ] MLA_columns = [ 'MLA Name', 'Train Accuracy Mean', 'Test Accuracy Mean', 'Test F1 Score', 'Time' ] MLA_compare = pd.DataFrame(np.zeros([len(MLA), len(MLA_columns)]), columns=MLA_columns) row_index = 0 for alg in MLA: MLA_compare.iloc[row_index, 0] = alg.__class__.__name__ cv_results = cross_validate(alg, x, y, scoring=('accuracy', 'f1'), cv=3, return_train_score=True) MLA_compare.iloc[row_index, 1] = round(cv_results['train_accuracy'].mean() * 100, 3) MLA_compare.iloc[row_index, 2] = round(cv_results['test_accuracy'].mean() * 100, 3) MLA_compare.iloc[row_index, 3] = round(cv_results['test_f1'].mean() * 100, 3) MLA_compare.iloc[row_index, 4] = cv_results['fit_time'].mean() row_index += 1 MLA_compare

从以上观测可知，随机森林的表现相对最好，但看起来有些过拟合，后续往模型复杂度低的方向调整试试，因为随机森林是不用归一化的，这里取未归一化后的训练特征集，继续观察下模型效果

#观测模型初步表现，n_estimators设置为25 from sklearn.model_selection import cross_val_score rfc = RandomForestClassifier(n_estimators=25) rfc_s = cross_val_score(rfc,x,y,cv=10) rfc_s out： array([0.81461323, 0.81065479, 0.81296388, 0.8045522 , 0.81246907, 0.8116444 , 0.81131453, 0.81560284, 0.8078509 , 0.8091389 ])

从交叉验证的结果来看，模型最终成绩在81%左右浮动，继续对模型进行调参

#画出n_estimators的学习曲线 superpa = [] from sklearn.model_selection import cross_val_score for i in range(0,100,10): rfc = RandomForestClassifier(n_estimators=i+1,n_jobs=-1) rfc_s = cross_val_score(rfc,x,y,cv=3).mean() superpa.append(rfc_s) print(max(superpa),superpa.index(max(superpa))) out: 0.8135512144023972 9 plt.figure(figsize=[20,5]) plt.plot(range(1,101,10),superpa) plt.show() #看得出模型仅有很小的提升，最优n_estimators在20-40之间

#进一步细化n_estimators,选择最优值 superpa = [] from sklearn.model_selection import cross_val_score for i in range(20,40,1): rfc = RandomForestClassifier(n_estimators=i+1,n_jobs=-1) rfc_s = cross_val_score(rfc,x,y,cv=3).mean() superpa.append(rfc_s) plt.figure(figsize=[20,5]) plt.plot(range(20,40,1),superpa) plt.show()

n_estimators为37达到最大值81.3%

#使用学习曲线继续探索模型最大深度max_depth最优值 superpa = [] for i in range(0, 30, 1): rfc = RandomForestClassifier(n_estimators=37,max_depth=i+1,n_jobs=-1) rfc_s = cross_val_score(rfc,x,y,cv=3).mean() superpa.append(rfc_s) print(max(superpa),superpa.index(max(superpa))) plt.figure(figsize=[20,5]) plt.plot(range(0, 30, 1),superpa) plt.show()

最大深度为20的时候达到最优值，模型准确率相对提升0.6%

#使用网格搜索确定min_samples_leaf和min_samples_split参数 param_grid={'min_samples_leaf':np.arange(1, 1+10, 1), 'min_samples_split':np.arange(2, 2+20, 1)} rfc = RandomForestClassifier(n_estimators=37 ,max_depth=20 ,random_state=90 ) GS = GridSearchCV(rfc,param_grid,cv=3) GS.fit(x,y) GS.best_params_ GS.best_score_ out: 0.8186251367199141

模型效果提升非常有限，说明已经达到模型最优值附近，因前期特征选择的过程比较暴力，丢失了较多信息，鉴于这里主要是追求建模实现的过程，如若继续优化最求高精确度，在特征选择、编码上应该有进一步优化空间，同时可考虑梯度类强学习器，提升预测效果

Processed: 0.008, SQL: 9

酒店数据集退订分析与预测