kaggle波士顿房价预测,score=0.12986

    技术2025-03-14  27

    作为一个机器学习小白,之前拿titanic数据集练过手,遇到波士顿房价数据集(81个特征)刚开始是有点懵,主要就懵在不知道如何下手处理数据,参考一些资料后,勉强跑通了流程,在此记录一下。大神请自动绕过。

    1、加载数据集

    数据集可以到kaggle官网下载

    %matplotlib inline import numpy as np import pandas as pd from matplotlib import pyplot as plt df_train = pd.read_csv("./datasets/housing_price/train.csv") df_test = pd.read_csv("./datasets/housing_price/test.csv") df_train.info()

    各个属性的数值类型以及是否有缺失值统计如下:

    <class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): Id 1460 non-null int64 MSSubClass 1460 non-null int64 MSZoning 1460 non-null object LotFrontage 1201 non-null float64 LotArea 1460 non-null int64 Street 1460 non-null object Alley 91 non-null object LotShape 1460 non-null object LandContour 1460 non-null object Utilities 1460 non-null object LotConfig 1460 non-null object LandSlope 1460 non-null object Neighborhood 1460 non-null object Condition1 1460 non-null object Condition2 1460 non-null object BldgType 1460 non-null object HouseStyle 1460 non-null object OverallQual 1460 non-null int64 OverallCond 1460 non-null int64 YearBuilt 1460 non-null int64 YearRemodAdd 1460 non-null int64 RoofStyle 1460 non-null object RoofMatl 1460 non-null object Exterior1st 1460 non-null object Exterior2nd 1460 non-null object MasVnrType 1452 non-null object MasVnrArea 1452 non-null float64 ExterQual 1460 non-null object ExterCond 1460 non-null object Foundation 1460 non-null object BsmtQual 1423 non-null object BsmtCond 1423 non-null object BsmtExposure 1422 non-null object BsmtFinType1 1423 non-null object BsmtFinSF1 1460 non-null int64 BsmtFinType2 1422 non-null object BsmtFinSF2 1460 non-null int64 BsmtUnfSF 1460 non-null int64 TotalBsmtSF 1460 non-null int64 Heating 1460 non-null object HeatingQC 1460 non-null object CentralAir 1460 non-null object Electrical 1459 non-null object 1stFlrSF 1460 non-null int64 2ndFlrSF 1460 non-null int64 LowQualFinSF 1460 non-null int64 GrLivArea 1460 non-null int64 BsmtFullBath 1460 non-null int64 BsmtHalfBath 1460 non-null int64 FullBath 1460 non-null int64 HalfBath 1460 non-null int64 BedroomAbvGr 1460 non-null int64 KitchenAbvGr 1460 non-null int64 KitchenQual 1460 non-null object TotRmsAbvGrd 1460 non-null int64 Functional 1460 non-null object Fireplaces 1460 non-null int64 FireplaceQu 770 non-null object GarageType 1379 non-null object GarageYrBlt 1379 non-null float64 GarageFinish 1379 non-null object GarageCars 1460 non-null int64 GarageArea 1460 non-null int64 GarageQual 1379 non-null object GarageCond 1379 non-null object PavedDrive 1460 non-null object WoodDeckSF 1460 non-null int64 OpenPorchSF 1460 non-null int64 EnclosedPorch 1460 non-null int64 3SsnPorch 1460 non-null int64 ScreenPorch 1460 non-null int64 PoolArea 1460 non-null int64 PoolQC 7 non-null object Fence 281 non-null object MiscFeature 54 non-null object MiscVal 1460 non-null int64 MoSold 1460 non-null int64 YrSold 1460 non-null int64 SaleType 1460 non-null object SaleCondition 1460 non-null object SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB

    2、统计特征缺失比例

    missing = df_train.isnull().sum() fig, ax = plt.subplots(1, 2, figsize=(10, 6)) # print(ax.shape) # 显示缺失值的数量 ax[0].set_ylabel("missing count") missing[missing > 0].sort_values().plot.bar(ax=ax[0]) # 显示缺失值的比例 ax[1].set_ylabel("missing percent") missing_percent = missing[missing > 0].sort_values() / len(df_train) missing_percent.plot.bar(ax=ax[1])

    可以看到,有几个特征的缺失值比例还是很高的,超过了50%。

    3、删除缺失值数量超过50%的属性

    missing_percent_gt_50 = missing_percent[missing_percent > 0.5].index.values df_train_drop_missing = df_train.drop(missing_percent_gt_50, axis=1) df_train_drop_missing.drop(["Id"], axis=1, inplace=True) df_train_drop_missing.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 76 columns): MSSubClass 1460 non-null int64 MSZoning 1460 non-null object LotFrontage 1201 non-null float64 LotArea 1460 non-null int64 Street 1460 non-null object LotShape 1460 non-null object LandContour 1460 non-null object Utilities 1460 non-null object LotConfig 1460 non-null object LandSlope 1460 non-null object Neighborhood 1460 non-null object Condition1 1460 non-null object Condition2 1460 non-null object BldgType 1460 non-null object HouseStyle 1460 non-null object OverallQual 1460 non-null int64 OverallCond 1460 non-null int64 YearBuilt 1460 non-null int64 YearRemodAdd 1460 non-null int64 RoofStyle 1460 non-null object RoofMatl 1460 non-null object Exterior1st 1460 non-null object Exterior2nd 1460 non-null object MasVnrType 1452 non-null object MasVnrArea 1452 non-null float64 ExterQual 1460 non-null object ExterCond 1460 non-null object Foundation 1460 non-null object BsmtQual 1423 non-null object BsmtCond 1423 non-null object BsmtExposure 1422 non-null object BsmtFinType1 1423 non-null object BsmtFinSF1 1460 non-null int64 BsmtFinType2 1422 non-null object BsmtFinSF2 1460 non-null int64 BsmtUnfSF 1460 non-null int64 TotalBsmtSF 1460 non-null int64 Heating 1460 non-null object HeatingQC 1460 non-null object CentralAir 1460 non-null object Electrical 1459 non-null object 1stFlrSF 1460 non-null int64 2ndFlrSF 1460 non-null int64 LowQualFinSF 1460 non-null int64 GrLivArea 1460 non-null int64 BsmtFullBath 1460 non-null int64 BsmtHalfBath 1460 non-null int64 FullBath 1460 non-null int64 HalfBath 1460 non-null int64 BedroomAbvGr 1460 non-null int64 KitchenAbvGr 1460 non-null int64 KitchenQual 1460 non-null object TotRmsAbvGrd 1460 non-null int64 Functional 1460 non-null object Fireplaces 1460 non-null int64 FireplaceQu 770 non-null object GarageType 1379 non-null object GarageYrBlt 1379 non-null float64 GarageFinish 1379 non-null object GarageCars 1460 non-null int64 GarageArea 1460 non-null int64 GarageQual 1379 non-null object GarageCond 1379 non-null object PavedDrive 1460 non-null object WoodDeckSF 1460 non-null int64 OpenPorchSF 1460 non-null int64 EnclosedPorch 1460 non-null int64 3SsnPorch 1460 non-null int64 ScreenPorch 1460 non-null int64 PoolArea 1460 non-null int64 MiscVal 1460 non-null int64 MoSold 1460 non-null int64 YrSold 1460 non-null int64 SaleType 1460 non-null object SaleCondition 1460 non-null object SalePrice 1460 non-null int64 dtypes: float64(3), int64(34), object(39) memory usage: 867.0+ KB

    4、划分数值特征和类别特征

    因为数值特征和分类特征需要进行不同的处理,所以在这里将他们分开,这里只是简单的依据特征的类别是不是object来判断是否为数值型特征。在上面的info统计结果中,MSSubClass的类型孙然是int,但实际上它是一个类别特征。

    num_features = [feature for feature in df_train_drop_missing.columns if df_train_drop_missing.dtypes[feature] != "object"] cate_features = [feature for feature in df_train_drop_missing.columns if df_train_drop_missing.dtypes[feature] == "object"] num_features.remove("MSSubClass") cate_features.append("MSSubClass") df_train_num = df_train_drop_missing[num_features].copy() df_train_cate = df_train_drop_missing[cate_features].copy() # 将MSSubClass特征由int转为str,使其变成分类特征 df_train_cate["MSSubClass"] = df_train_cate["MSSubClass"].apply(lambda x: str(x))

    5、查看数值特征和SalePrice房价之间的关系

    for feature in df_train_num.columns[:-1]: df_train_num.plot.scatter(x=feature, y="SalePrice")

    图比较长,这里就只贴出LotFrontage、OverallQual与SalePrice的关系图:

    由上图1可以看出,LotFrontage与SalePrice之间存在明显的线性关系的,而且LotFrontage好像有两个异常值,OverallQual与SalePrice之间也存在明显的线性关系,而且OverallQual也可以转换为分类特征。

    6、查看数值特征之间的相关关系

    num_corr = df_train_num.corr() import seaborn as sns fig, ax = plt.subplots(figsize=(10, 10)) sns.heatmap(num_corr, square=True, ax=ax)

    7、调整目标变量,使目标变量分布更接近正态分布

    查看SalePrice房价的数值分布:

    df_train_drop_missing["SalePrice"].hist(bins=30)

    可以看到SalePrice的数值分布柱状图尾巴拖的比较长(重垂尾分布),可以使用log变化,使其分布更接近正态分布。由于log和exp是可以进行相互变换的,所以这里对目标变量的值进行log变换并不会影响模型最终的预测结果,使用log变换后的目标变量进行模型训练,对模型的预测结果进行exp指数变换得到的就是真实数值了。

    df_train_drop_missing["SalePrice"].apply(lambda x: np.log1p(x)).hist(bins=30)

    在这里验证一下log和exp是不是可以相互变换的?先进行log变换,在进行exp变换,得到的结果应该是不变的。

    df_train_drop_missing["SalePrice"].apply(lambda x: np.log1p(x)).apply(lambda x: np.expm1(x)).hist(bins=30)

    结果符合预期。

    8、处理类别特征

    df_train_cate.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 40 columns): MSZoning 1460 non-null object Street 1460 non-null object LotShape 1460 non-null object LandContour 1460 non-null object Utilities 1460 non-null object LotConfig 1460 non-null object LandSlope 1460 non-null object Neighborhood 1460 non-null object Condition1 1460 non-null object Condition2 1460 non-null object BldgType 1460 non-null object HouseStyle 1460 non-null object RoofStyle 1460 non-null object RoofMatl 1460 non-null object Exterior1st 1460 non-null object Exterior2nd 1460 non-null object MasVnrType 1452 non-null object ExterQual 1460 non-null object ExterCond 1460 non-null object Foundation 1460 non-null object BsmtQual 1423 non-null object BsmtCond 1423 non-null object BsmtExposure 1422 non-null object BsmtFinType1 1423 non-null object BsmtFinType2 1422 non-null object Heating 1460 non-null object HeatingQC 1460 non-null object CentralAir 1460 non-null object Electrical 1459 non-null object KitchenQual 1460 non-null object Functional 1460 non-null object FireplaceQu 770 non-null object GarageType 1379 non-null object GarageFinish 1379 non-null object GarageQual 1379 non-null object GarageCond 1379 non-null object PavedDrive 1460 non-null object SaleType 1460 non-null object SaleCondition 1460 non-null object MSSubClass 1460 non-null object dtypes: object(40) memory usage: 456.3+ KB

    查看类别特征中各种属性数量额柱状图:

    import warnings warnings.filterwarnings("ignore") for cate_col in df_train_cate.columns.values: fig = plt.figure(figsize=(3, 3)) df_train_cate[cate_col].value_counts().plot.bar(legend=True)

    图很长。你应该看到了。。。。

    9、查看存在缺失值的类别属性

    df_train_cate_missing_sum = df_train_cate.isnull().sum() cate_missing_features = df_train_cate_missing_sum[df_train_cate_missing_sum > 0].index.values cate_missing_features array(['MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'], dtype=object)。这些家伙存在缺失值

    10、填充缺失值,这里简单的使用每个特征中出现最多的类别进行填充

    fill_missing_value = df_train_cate.mode().loc[0].values for index, feature in enumerate(cate_missing_features): print(feature, fill_missing_value[index]) df_train_cate[feature].fillna(fill_missing_value[index], inplace=True) df_train_cate.info()

    看结果:

    <class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 40 columns): MSZoning 1460 non-null object Street 1460 non-null object LotShape 1460 non-null object LandContour 1460 non-null object Utilities 1460 non-null object LotConfig 1460 non-null object LandSlope 1460 non-null object Neighborhood 1460 non-null object Condition1 1460 non-null object Condition2 1460 non-null object BldgType 1460 non-null object HouseStyle 1460 non-null object RoofStyle 1460 non-null object RoofMatl 1460 non-null object Exterior1st 1460 non-null object Exterior2nd 1460 non-null object MasVnrType 1460 non-null object ExterQual 1460 non-null object ExterCond 1460 non-null object Foundation 1460 non-null object BsmtQual 1460 non-null object BsmtCond 1460 non-null object BsmtExposure 1460 non-null object BsmtFinType1 1460 non-null object BsmtFinType2 1460 non-null object Heating 1460 non-null object HeatingQC 1460 non-null object CentralAir 1460 non-null object Electrical 1460 non-null object KitchenQual 1460 non-null object Functional 1460 non-null object FireplaceQu 1460 non-null object GarageType 1460 non-null object GarageFinish 1460 non-null object GarageQual 1460 non-null object GarageCond 1460 non-null object PavedDrive 1460 non-null object SaleType 1460 non-null object SaleCondition 1460 non-null object MSSubClass 1460 non-null object dtypes: object(40) memory usage: 456.3+ KB

    缺失值都补上了。

    11、对类别特征进行OneHot编码,因为有些模型智能处理数值类型的特征,所以要进行转换

    df_train_cate_dummies = pd.get_dummies(df_train_cate) df_train_cate_dummies.head()

    12、处理数值特征的缺失值,这里简单的使用数值特征的均值进行填充

    # 计算特征均值 num_feature_mean = df_train_num.mean(axis=0) # 使用均值填充每列的缺失值 df_train_num = df_train_num.apply(lambda x: x.fillna(num_feature_mean.loc[x.index.values[0]]), axis=1) df_train_num.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 36 columns): LotFrontage 1460 non-null float64 LotArea 1460 non-null float64 OverallQual 1460 non-null float64 OverallCond 1460 non-null float64 YearBuilt 1460 non-null float64 YearRemodAdd 1460 non-null float64 MasVnrArea 1460 non-null float64 BsmtFinSF1 1460 non-null float64 BsmtFinSF2 1460 non-null float64 BsmtUnfSF 1460 non-null float64 TotalBsmtSF 1460 non-null float64 1stFlrSF 1460 non-null float64 2ndFlrSF 1460 non-null float64 LowQualFinSF 1460 non-null float64 GrLivArea 1460 non-null float64 BsmtFullBath 1460 non-null float64 BsmtHalfBath 1460 non-null float64 FullBath 1460 non-null float64 HalfBath 1460 non-null float64 BedroomAbvGr 1460 non-null float64 KitchenAbvGr 1460 non-null float64 TotRmsAbvGrd 1460 non-null float64 Fireplaces 1460 non-null float64 GarageYrBlt 1460 non-null float64 GarageCars 1460 non-null float64 GarageArea 1460 non-null float64 WoodDeckSF 1460 non-null float64 OpenPorchSF 1460 non-null float64 EnclosedPorch 1460 non-null float64 3SsnPorch 1460 non-null float64 ScreenPorch 1460 non-null float64 PoolArea 1460 non-null float64 MiscVal 1460 non-null float64 MoSold 1460 non-null float64 YrSold 1460 non-null float64 SalePrice 1460 non-null float64 dtypes: float64(36) memory usage: 410.7 KB

    数值型特征的缺失值也都补上了。下面看下数值特征的数值范围:

    df_train_num.describe() LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 mean 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.500959 443.639726 46.549315 567.240411 1057.429452 1162.626712 346.992466 5.844521 1515.463699 0.425342 0.057534 1.565068 0.382877 2.866438 1.046575 6.517808 0.613014 1872.626059 1.767123 472.980137 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890 std 22.024023 9981.264932 1.382997 1.112799 30.202904 20.645407 180.586195 456.098091 161.319273 441.866955 438.705324 386.587738 436.528436 48.623081 525.480383 0.518911 0.238753 0.550916 0.502885 0.815778 0.220338 1.625393 0.644666 437.679677 0.747315 213.804841 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883 min 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 0.000000 0.000000 0.000000 334.000000 0.000000 0.000000 334.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 70.049958 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000 25% 60.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 0.000000 223.000000 795.750000 882.000000 0.000000 0.000000 1129.500000 0.000000 0.000000 1.000000 0.000000 2.000000 1.000000 5.000000 0.000000 1958.000000 1.000000 334.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000 50% 70.049958 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 0.000000 477.500000 991.500000 1087.000000 0.000000 0.000000 1464.000000 0.000000 0.000000 2.000000 0.000000 3.000000 1.000000 6.000000 1.000000 1977.000000 2.000000 480.000000 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000 75% 79.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 164.250000 712.250000 0.000000 808.000000 1298.250000 1391.250000 728.000000 0.000000 1776.750000 1.000000 0.000000 2.000000 1.000000 3.000000 1.000000 7.000000 1.000000 2001.000000 2.000000 576.000000 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000 max 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 1474.000000 2336.000000 6110.000000 4692.000000 2065.000000 572.000000 5642.000000 3.000000 2.000000 3.000000 2.000000 8.000000 3.000000 14.000000 3.000000 2010.000000 4.000000 1418.000000 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

    数值范围参差不齐,处理一下吧。

    13、数值特征标准化,标准化的目的是使各个特征的数值范围相近,使得各个特征对模型的影响也相当

    from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df_train_num_scaled = scaler.fit_transform(df_train_num.as_matrix()[:, :-1]) df_train_num_scaled = pd.DataFrame(df_train_num_scaled, columns=df_train_num.columns[:-1]) df_train_num_scaled.describe() LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold count 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 mean 4.075887e-16 -5.840077e-17 1.387018e-16 3.540547e-16 1.046347e-15 4.496860e-15 -5.840077e-17 -2.433366e-17 -3.406712e-17 -6.600504e-17 2.457699e-16 6.509253e-17 -1.825024e-17 1.216683e-17 -1.277517e-16 2.311697e-17 2.433366e-17 1.180182e-16 2.083569e-17 2.141362e-16 4.501726e-16 -1.022014e-16 -4.866731e-18 -3.394545e-16 1.216683e-16 -1.216683e-17 5.596741e-17 3.041707e-17 -2.311697e-17 4.866731e-18 5.475072e-17 1.946692e-17 -2.676702e-17 7.543433e-17 3.567436e-14 std 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 min -2.227875e+00 -9.237292e-01 -3.688413e+00 -4.112970e+00 -3.287824e+00 -1.689368e+00 -5.733352e-01 -9.730182e-01 -2.886528e-01 -1.284176e+00 -2.411167e+00 -2.144172e+00 -7.951632e-01 -1.202417e-01 -2.249120e+00 -8.199644e-01 -2.410610e-01 -2.841822e+00 -7.616207e-01 -3.514952e+00 -4.751486e+00 -2.780469e+00 -9.512265e-01 -4.119894e+00 -2.365440e+00 -2.212963e+00 -7.521758e-01 -7.044833e-01 -3.593249e-01 -1.163393e-01 -2.702084e-01 -6.869175e-02 -8.768781e-02 -1.969111e+00 -1.367655e+00 25% -4.564744e-01 -2.969908e-01 -7.951515e-01 -5.171998e-01 -5.719226e-01 -8.656586e-01 -5.733352e-01 -9.730182e-01 -2.886528e-01 -7.793259e-01 -5.966855e-01 -7.261556e-01 -7.951632e-01 -1.202417e-01 -7.347485e-01 -8.199644e-01 -2.410610e-01 -1.026041e+00 -7.616207e-01 -1.062465e+00 -2.114536e-01 -9.341298e-01 -9.512265e-01 1.951272e-01 -1.026858e+00 -6.479160e-01 -7.521758e-01 -7.044833e-01 -3.593249e-01 -1.163393e-01 -2.702084e-01 -6.869175e-02 -8.768781e-02 -4.891101e-01 -6.144386e-01 50% 6.454645e-16 -1.040633e-01 -7.183611e-02 -5.171998e-01 5.737148e-02 4.425864e-01 -5.733352e-01 -1.319022e-01 -2.886528e-01 -2.031633e-01 -1.503334e-01 -1.956933e-01 -7.951632e-01 -1.202417e-01 -9.797004e-02 -8.199644e-01 -2.410610e-01 7.897405e-01 -7.616207e-01 1.637791e-01 -2.114536e-01 -3.186833e-01 6.004949e-01 2.385528e-01 3.117246e-01 3.284429e-02 -7.521758e-01 -3.270298e-01 -3.593249e-01 -1.163393e-01 -2.702084e-01 -6.869175e-02 -8.768781e-02 -1.191097e-01 1.387775e-01 75% 4.065156e-01 1.087080e-01 6.514792e-01 3.817427e-01 9.516316e-01 9.271216e-01 3.365144e-01 5.891327e-01 -2.886528e-01 5.450557e-01 5.491227e-01 5.915905e-01 8.731117e-01 -1.202417e-01 4.974036e-01 1.107810e+00 -2.410610e-01 7.897405e-01 1.227585e+00 1.637791e-01 -2.114536e-01 2.967633e-01 6.004949e-01 2.934062e-01 3.117246e-01 4.820057e-01 5.886506e-01 3.221901e-01 -3.593249e-01 -1.163393e-01 -2.702084e-01 -6.869175e-02 -8.768781e-02 6.208910e-01 8.919936e-01 max 1.103492e+01 2.051827e+01 2.821425e+00 3.078570e+00 1.282839e+00 1.217843e+00 8.289736e+00 1.140575e+01 8.851638e+00 4.004295e+00 1.152095e+01 9.132681e+00 3.936963e+00 1.164775e+01 7.855574e+00 4.963359e+00 8.138680e+00 2.605522e+00 3.216791e+00 6.294997e+00 8.868612e+00 4.604889e+00 3.703938e+00 3.139762e-01 2.988889e+00 4.421526e+00 6.087635e+00 7.554198e+00 8.675309e+00 1.721723e+01 8.341462e+00 1.830618e+01 3.116527e+01 2.100892e+00 1.645210e+00

    经过标准化后的数值特征范围就比较统一了,均值为0,方差为1。

    14、准备训练集

    df_train_y = df_train_num["SalePrice"].apply(lambda x: np.log1p(x)) df_train_y_matrix = df_train_y.as_matrix() df_train_x_matrix = pd.concat([df_train_num_scaled, df_train_cate_dummies], axis=1).as_matrix() train_X = df_train_x_matrix.copy() train_y = df_train_y_matrix.copy() print(train_X.shape, train_y.shape)

    (1460, 301) (1460,)

    训练集准备好了之后,这就要进入主题了吗?莫急,莫急!原本数据集加上SalePrice属性只有81个特征,经过上面一系列的操作之后,现在的数据集有301个特征,这些特征都是啥?都有什么用?他们对最终的训练结果会带来好的还是坏的影响?多了这么多特征训练的时候计算量不是大了好多?

    15、特征选择,这里使用Lasso模型(加入L1正则化的线性回归模型)来进行特征选择,使用sklearn进行特征选择的方法可以参考:scikit-learn工具包中常用的特征选择方法介绍

    from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import LassoCV lasso = LassoCV() lasso.fit(train_X, train_y) model = SelectFromModel(lasso, prefit=True) train_X_new = model.transform(train_X) print(train_X_new.shape) print("特征选择标记:", model.get_support())

    (1460, 108) 特征选择标记: [False  True  True  True  True  True  True  True  True False  True  True False  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True False False  True False False False  True False False  True False False False False False  True  True False False False False  True False False False False  True  True False  True  True False False  True  True  True False False  True  True False False False False  True  True False False  True False  True False False False False False False False False False False  True False False False  True False False  True False False False False False False False False False False  True False False False False  True False False False False False False False False False False True False False  True False  True False False False  True  True False False False False False False False False False False False False False False  True False False False False False False  True False False  True  True False False False False  True  True False  True False False False   True False False False  True False False False False  True False  True  True False  True  True False False  True False False  True False  True False False False False  True False False  True False False False  True  True  True False False  True False False False False False False False   True False False  True False  True False False False False  True False False  True  True False False False  True False False False False False False False False  True False False  True False False False False False  True False False False  True False  True  True False False False False False  True False  True  True False False  True  True False False  True False False  True  True False False False  True False False False False False]

    输出结果为True对应的特征是被选择的特征。经过特征选择后,保留了108个特征。

    16、下面进入正题,模型训练

    1)、首先使用RandomForest模型来试试:

    from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import GridSearchCV rfr_params = { "n_estimators": [100, 200, 300], # "criterion": ["gini"], "max_depth": [2, 3, 4], "min_samples_split": [2, 3, 5], "min_samples_leaf": [2, 3, 5], "max_features": ["sqrt", "log2", "auto"] } rfr = RandomForestRegressor(n_jobs=-1) rfr_grid_search = GridSearchCV(rfr, rfr_params, cv=5, scoring="neg_mean_squared_error") rfr_grid_search.fit(train_X_new, train_y) best_params = rfr_grid_search.best_params_ best_scores = rfr_grid_search.best_score_

    使用网格搜索的RandomForest最佳超参数看下在训练集上交叉验证的结果:

    from sklearn.model_selection import cross_val_score best_rfr = RandomForestRegressor(**best_params) scores = cross_val_score(best_rfr, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error") scores_rmse = np.sqrt(-scores) print("score_rmse: {}, {} +/ {}".format(scores_rmse, np.mean(scores_rmse), np.std(scores_rmse))) score_rmse: [0.16991974 0.18253832 0.17872277 0.16521411 0.17719227], 0.17471744250579152 +/ 0.006271601381559145 绘制RandomForest的误差学习曲线 from sklearn.model_selection import learning_curve best_params = { 'max_depth': 4, 'max_features': 'auto', 'min_samples_leaf': 10, 'min_samples_split': 10, 'n_estimators': 200 } best_rfr = RandomForestRegressor(**best_params) train_sizes, train_neg_mse, test_neg_mse = learning_curve(best_rfr, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error", train_sizes=np.linspace(0.1, 1.0, 20)) train_rmse = np.mean(np.sqrt(-train_neg_mse), axis=1) test_rmse = np.mean(np.sqrt(-test_neg_mse), axis=1) plt.plot(train_sizes, train_rmse, color="r", marker="o", label="train error") plt.plot(train_sizes, test_rmse, color="b", marker="^", label="test error") plt.xlabel("Train Size") plt.ylabel("rmse") plt.legend(loc="upper right")

    通过误差曲线判断,模型训练基本没有出现过拟合,结果也还可以。

    2)、使用GBDT进行模型训练

    from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import GridSearchCV gbdt_params = { "n_estimators": [50, 100], "max_depth": [3, 5, 7], "min_samples_split": [3, 5, 7], "min_samples_leaf": [3, 5, 7], "max_features": ["sqrt", "log2"], # "loss": ["deviance", "exponential"], "learning_rate": [0.1, 0.5, 0.05, 0.01, 0.005], "subsample": [0.5, 0.7] } gbdt = GradientBoostingRegressor() gbdt_grid_search = GridSearchCV(gbdt, gbdt_params, cv=5, scoring="neg_mean_squared_error") gbdt_grid_search.fit(train_X_new, train_y) gbdt_best_params = gbdt_grid_search.best_params_ gbdt_best_scores = gbdt_grid_search.best_score_ gbdt_best_params, gbdt_best_scores

    看下GBDT在训练集上交叉验证结果:

    from sklearn.model_selection import cross_val_score best_gbdt = GradientBoostingRegressor(**gbdt_best_params) scores = cross_val_score(best_gbdt, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error") scores_rmse = np.sqrt(-scores) print("score_rmse: {}, {} +/ {}".format(scores_rmse, np.mean(scores_rmse), np.std(scores_rmse))) score_rmse: [0.11242787 0.13998242 0.13887749 0.11183915 0.13257266], 0.1271399183125363 +/ 0.012512177359738105

    RMSE=0.127,比RandomForest提升不少。

    绘制误差学习曲线

    由于发现GBDT网格搜索得到的最佳差参数绘制的误差曲线存在过拟合,所以手动调整了部分参数:

    from sklearn.model_selection import learning_curve best_params = { 'learning_rate': 0.03, 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 7, 'min_samples_split': 7, 'n_estimators': 100, 'subsample': 0.7 } best_gbdt = GradientBoostingRegressor(**best_params) train_sizes, train_neg_mse, test_neg_mse = learning_curve(best_gbdt, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error", train_sizes=np.linspace(0.1, 1.0, 20)) train_rmse = np.mean(np.sqrt(-train_neg_mse), axis=1) test_rmse = np.mean(np.sqrt(-test_neg_mse), axis=1) plt.plot(train_sizes, train_rmse, color="r", marker="o", label="train error") plt.plot(train_sizes, test_rmse, color="b", marker="^", label="test error") plt.xlabel("Train Size") plt.ylabel("rmse") plt.legend(loc="upper right")

    结果看上去还是稍微有点过拟合,不过也还行。

    17、模型训练完成后,就是要在测试集上进行预测了,Let's Go !

    1)、首先要使用相同的方式填充测试数据中的缺失值。

    # 删除缺失值过多的属性 df_test_drop_missing = df_test.drop(missing_percent_gt_50, axis=1) df_test_drop_missing.drop(["Id"], axis=1, inplace=True) # df_test_drop_missing.info() df_test_num = df_test_drop_missing[num_features[:-1]].copy() df_test_cate = df_test_drop_missing[cate_features].copy() df_test_cate["MSSubClass"] = df_test_cate["MSSubClass"].apply(lambda x: str(x)) # 处理类别属性缺失值 for index, feature in enumerate(df_test_cate.columns.values): print(feature, fill_missing_value[index]) df_test_cate[feature].fillna(fill_missing_value[index], inplace=True) # 处理数值属性缺失值 df_test_num = df_test_num.apply(lambda x: x.fillna(num_feature_mean.loc[x.index.values[0]]), axis=1)

    2)、然后,将测试数据中的类别特征进行OneHot编码

    df_test_cate_dummies = pd.get_dummies(df_test_cate) df_test_cate_dummies.info()

    3)、对测试数据的数值特征进行与训练数据的数值特征相同的标准化处理

    df_test_num_scaled = scaler.transform(df_test_num.as_matrix()) df_test_num_scaled = pd.DataFrame(df_test_num_scaled, columns=df_test_num.columns) df_test_num_scaled.info()

    4)、将数值特征和类别特征组装成测试集

    df_test_x_matrix = pd.concat([df_test_num_scaled, df_test_cate_dummies], axis=1).as_matrix() df_test_x_matrix.shape (1459, 274)

    什么???你以为这就完了,这就可以输入到模型中进行预测了?还记得前面说过训练数据经过处理后生成了301个特征,这里怎么才274个特征,缺少的27个特征去哪了?他们都是那些特征?为什么会缺少这些特征?

    导致缺少这些特征的原因是因为测试集中的分类特征的分类可能比训练集的少。是什么意思呢,就是加入训练集和测试集中都有一个特征叫color颜色,训练集中的颜色有“红、黄、蓝”三种颜色,但是测试集中的颜色只有“红、黄”两种,那么在进行get_dummies进行OneHot编码时,训练集就会生成"color_红、color_黄、color_蓝"三个特征,测试集只会生成“color_红、color_黄”两个特征,这样测试集就比训练集少了一个特征。少的特征就是这么来的,知道了原因,那现在把缺少的特征补上不就得了吗。来吧,继续。

    index = 0 for feature in df_train_cate_dummies.columns: if feature not in df_test_cate_dummies: index += 1 print(index, feature) df_test_cate_dummies[feature] = 0 df_test_cate_dummies.info()

    把缺少的特征都补上0。

    什么?你以为这样就可以了?想一想,既然训练集能比测试集的特征多,那有没有可能测试集的特征比训练集多呢?当然有可能,对于这种情况就把测试集多出来的特征删除掉就行了。

    你以为现在就可以进行预测了吗?预测结果肯定惨不忍睹。现在虽然测试集的特征数量和训练集的一样,测试里面的特征类型和训练集的也一样,但是测试集里面特征的顺序和训练集肯定不一样。需要将测试集的特征顺序和训练集对齐。

    df_test_cate_dummies = df_test_cate_dummies.reindex(columns=df_train_cate_dummies.columns) df_test_x_matrix = pd.concat([df_test_num_scaled, df_test_cate_dummies], axis=1).as_matrix() df_test_x_matrix.shape

    (1459, 301)

    然后对测试集进行相同的特征选择:

    test_X_new = model.transform(df_test_x_matrix) test_X_new.shape

    (1459, 108)

    18、预测结果,终于到最后一步,哈哈哈啊哈

    y_pred = best_rfr.predict(test_X_new) y_pred.shape y_pred[:10]

    array([11.74767867, 11.87027992, 12.03688675, 12.09013487, 12.26120576, 12.07010157, 11.89011736, 12.05931896, 12.14915021, 11.73820217])

    还记得前面对SalePrice目标变量进行了log变换吧,现在要把结果还原回来。

    y_pred = np.expm1(y_pred) y_pred[:10]

    array([126458.66356193, 142953.24835482, 168869.38535629, 178105.12766462,  211335.25272186, 174572.57668388, 145817.40958437, 172700.3291329 , 188932.46869204, 125265.92920833])

    保存预测结果到csv文件:

    output = pd.DataFrame( { "Id": df_test["Id"], "SalePrice": y_pred } ) from datetime import datetime now_time = datetime.now().strftime("%Y%m%d%H%M%S%f") output.to_csv("./outputs/random_forest_%s.csv" % now_time, index=False)

    最后,就是把预测结果文件提交到kaggle试试吧!

    参考链接

    Processed: 0.010, SQL: 9