当前位置：首页 > news >正文

网站开发定制公司国内wordpress主题商

news 2026/1/21 22:11:17

网站开发定制公司,国内wordpress主题商,网站建设与推广方案,做网站实训报告房价预测案例#xff08;进阶版#xff09; 这是进阶版的notebook。主要是为了比较几种模型框架。所以前面的特征工程部分内容#xff0c;我也并没有做任何改动#xff0c;重点都在后面的模型建造section Step 1: 检视源数据集 import numpy as np import pandas as pd读…房价预测案例进阶版这是进阶版的notebook。主要是为了比较几种模型框架。所以前面的特征工程部分内容我也并没有做任何改动重点都在后面的模型建造section Step 1: 检视源数据集 import numpy as np import pandas as pd读入数据一般来说源数据的index那一栏没什么用我们可以用来作为我们pandas dataframe的index。这样之后要是检索起来也省事儿。有人的地方就有鄙视链。跟知乎一样。Kaggle的也是个处处呵呵的危险地带。Kaggle上默认把数据放在input文件夹下。所以我们没事儿写个教程什么的也可以依据这个convention来显得自己很有逼格。。 train_df pd.read_csv(../input/train.csv, index_col0) test_df pd.read_csv(../input/test.csv, index_col0)检视源数据 train_df.head()MSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfig...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePriceId160RL65.08450PaveNaNRegLvlAllPubInside...0NaNNaNNaN022008WDNormal208500220RL80.09600PaveNaNRegLvlAllPubFR2...0NaNNaNNaN052007WDNormal181500360RL68.011250PaveNaNIR1LvlAllPubInside...0NaNNaNNaN092008WDNormal223500470RL60.09550PaveNaNIR1LvlAllPubCorner...0NaNNaNNaN022006WDAbnorml140000560RL84.014260PaveNaNIR1LvlAllPubFR2...0NaNNaNNaN0122008WDNormal250000 5 rows × 80 columns 这时候大概心里可以有数哪些地方需要人为的处理一下以做到源数据更加好被process。 Step 2: 合并数据这么做主要是为了用DF进行数据预处理的时候更加方便。等所有的需要的预处理进行完之后我们再把他们分隔开。首先SalePrice作为我们的训练目标只会出现在训练集中不会在测试集中要不然你测试什么。所以我们先把SalePrice这一列给拿出来不让它碍事儿。我们先看一下SalePrice长什么样纸 %matplotlib inline prices pd.DataFrame({price:train_df[SalePrice], log(price 1):np.log1p(train_df[SalePrice])}) prices.hist()array([[matplotlib.axes._subplots.AxesSubplot object at 0x0000000009B8DE48,matplotlib.axes._subplots.AxesSubplot object at 0x0000000009BF4710]],dtypeobject)可见label本身并不平滑。为了我们分类器的学习更加准确我们会首先把label给“平滑化”正态化这一步大部分同学会miss掉导致自己的结果总是达不到一定标准。这里我们使用最有逼格的log1p, 也就是 log(x1)避免了复值的问题。记住哟如果我们这里把数据都给平滑化了那么最后算结果的时候要记得把预测到的平滑数据给变回去。按照“怎么来的怎么去”原则log1p()就需要expm1(); 同理log()就需要exp(), … etc. y_train np.log1p(train_df.pop(SalePrice))然后我们把剩下的部分合并起来 all_df pd.concat((train_df, test_df), axis0)此刻我们可以看到all_df就是我们合在一起的DF all_df.shape(2919, 79)而y_train则是SalePrice那一列 y_train.head()Id 1 12.247699 2 12.109016 3 12.317171 4 11.849405 5 12.429220 Name: SalePrice, dtype: float64Step 3: 变量转化类似『特征工程』。就是把不方便处理或者不unify的数据给统一了。正确化变量属性首先我们注意到MSSubClass 的值其实应该是一个category 但是Pandas是不会懂这些事儿的。使用DF的时候这类数字符号会被默认记成数字。这种东西就很有误导性我们需要把它变回成string all_df[MSSubClass].dtypesdtype(int64)all_df[MSSubClass] all_df[MSSubClass].astype(str)变成str以后做个统计就很清楚了 all_df[MSSubClass].value_counts()20 1079 60 575 50 287 120 182 30 139 70 128 160 128 80 118 90 109 190 61 85 48 75 23 45 18 180 17 40 6 150 1 Name: MSSubClass, dtype: int64把category的变量转变成numerical表达形式当我们用numerical来表达categorical的时候要注意数字本身有大小的含义所以乱用数字会给之后的模型学习带来麻烦。于是我们可以用One-Hot的方法来表达category。 pandas自带的get_dummies方法可以帮你一键做到One-Hot。 pd.get_dummies(all_df[MSSubClass], prefixMSSubClass).head()MSSubClass_120MSSubClass_150MSSubClass_160MSSubClass_180MSSubClass_190MSSubClass_20MSSubClass_30MSSubClass_40MSSubClass_45MSSubClass_50MSSubClass_60MSSubClass_70MSSubClass_75MSSubClass_80MSSubClass_85MSSubClass_90Id1000000000010000020000010000000000300000000001000004000000000001000050000000000100000 此刻MSSubClass被我们分成了12个column每一个代表一个category。是就是1不是就是0。同理我们把所有的category数据都给One-Hot了 all_dummy_df pd.get_dummies(all_df) all_dummy_df.head()LotFrontageLotAreaOverallQualOverallCondYearBuiltYearRemodAddMasVnrAreaBsmtFinSF1BsmtFinSF2BsmtUnfSF...SaleType_ConLwSaleType_NewSaleType_OthSaleType_WDSaleCondition_AbnormlSaleCondition_AdjLandSaleCondition_AllocaSaleCondition_FamilySaleCondition_NormalSaleCondition_PartialId165.084507520032003196.0706.00.0150.0...0001000010280.0960068197619760.0978.00.0284.0...0001000010368.0112507520012002162.0486.00.0434.0...0001000010460.0955075191519700.0216.00.0540.0...0001100000584.0142608520002000350.0655.00.0490.0...0001000010 5 rows × 303 columns 处理好numerical变量就算是numerical的变量也还会有一些小问题。比如有一些数据是缺失的 all_dummy_df.isnull().sum().sort_values(ascendingFalse).head(10)LotFrontage 486 GarageYrBlt 159 MasVnrArea 23 BsmtHalfBath 2 BsmtFullBath 2 BsmtFinSF2 1 GarageCars 1 TotalBsmtSF 1 BsmtUnfSF 1 GarageArea 1 dtype: int64可以看到缺失最多的column是LotFrontage 处理这些缺失的信息得靠好好审题。一般来说数据集的描述里会写的很清楚这些缺失都代表着什么。当然如果实在没有的话也只能靠自己的『想当然』。。在这里我们用平均值来填满这些空缺。 mean_cols all_dummy_df.mean() mean_cols.head(10)LotFrontage 69.305795 LotArea 10168.114080 OverallQual 6.089072 OverallCond 5.564577 YearBuilt 1971.312778 YearRemodAdd 1984.264474 MasVnrArea 102.201312 BsmtFinSF1 441.423235 BsmtFinSF2 49.582248 BsmtUnfSF 560.772104 dtype: float64all_dummy_df all_dummy_df.fillna(mean_cols)看看是不是没有空缺了 all_dummy_df.isnull().sum().sum()0标准化numerical数据这一步并不是必要但是得看你想要用的分类器是什么。一般来说regression的分类器都比较傲娇最好是把源数据给放在一个标准分布内。不要让数据间的差距太大。这里我们当然不需要把One-Hot的那些0/1数据给标准化。我们的目标应该是那些本来就是numerical的数据先来看看哪些是numerical的 numeric_cols all_df.columns[all_df.dtypes ! object] numeric_colsIndex([uLotFrontage, uLotArea, uOverallQual, uOverallCond,uYearBuilt, uYearRemodAdd, uMasVnrArea, uBsmtFinSF1,uBsmtFinSF2, uBsmtUnfSF, uTotalBsmtSF, u1stFlrSF, u2ndFlrSF,uLowQualFinSF, uGrLivArea, uBsmtFullBath, uBsmtHalfBath,uFullBath, uHalfBath, uBedroomAbvGr, uKitchenAbvGr,uTotRmsAbvGrd, uFireplaces, uGarageYrBlt, uGarageCars,uGarageArea, uWoodDeckSF, uOpenPorchSF, uEnclosedPorch,u3SsnPorch, uScreenPorch, uPoolArea, uMiscVal, uMoSold,uYrSold],dtypeobject)计算标准分布(X-X’)/s 让我们的数据点更平滑更便于计算。注意我们这里也是可以继续使用Log的我只是给大家展示一下多种“使数据平滑”的办法。 numeric_col_means all_dummy_df.loc[:, numeric_cols].mean() numeric_col_std all_dummy_df.loc[:, numeric_cols].std() all_dummy_df.loc[:, numeric_cols] (all_dummy_df.loc[:, numeric_cols] - numeric_col_means) / numeric_col_stdStep 4: 建立模型把数据集分回训练/测试集 dummy_train_df all_dummy_df.loc[train_df.index] dummy_test_df all_dummy_df.loc[test_df.index]dummy_train_df.shape, dummy_test_df.shape((1460, 303), (1459, 303))X_train dummy_train_df.values X_test dummy_test_df.values做一点高级的Ensemble 一般来说单个分类器的效果真的是很有限。我们会倾向于把N多的分类器合在一起做一个“综合分类器”以达到最好的效果。我们从刚刚的试验中得知Ridge(alpha15)给了我们最好的结果 from sklearn.linear_model import Ridge ridge Ridge(15)Bagging Bagging把很多的小分类器放在一起每个train随机的一部分数据然后把它们的最终结果综合起来多数投票制。 Sklearn已经直接提供了这套构架我们直接调用就行 from sklearn.ensemble import BaggingRegressor from sklearn.model_selection import cross_val_scoreE:\Anaconda2\soft\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.from numpy.core.umath_tests import inner1d在这里我们用CV结果来测试不同的分类器个数对最后结果的影响。注意我们在部署Bagging的时候要把它的函数base_estimator里填上你的小分类器ridge params [1, 10, 15, 20, 25, 30, 40] test_scores [] for param in params:clf BaggingRegressor(n_estimatorsparam, base_estimatorridge)test_score np.sqrt(-cross_val_score(clf, X_train, y_train, cv10, scoringneg_mean_squared_error))test_scores.append(np.mean(test_score))import matplotlib.pyplot as plt %matplotlib inline plt.plot(params, test_scores) plt.title(n_estimator vs CV Error);可见前一个版本中ridge最优结果也就是0.135而这里我们使用25个小ridge分类器的bagging达到了低于0.132的结果。当然了你如果并没有提前测试过ridge模型你也可以用Bagging自带的DecisionTree模型代码是一样的把base_estimator给删去即可 params [10, 15, 20, 25, 30, 40, 50, 60, 70, 100] test_scores [] for param in params:clf BaggingRegressor(n_estimatorsparam)test_score np.sqrt(-cross_val_score(clf, X_train, y_train, cv10, scoringneg_mean_squared_error))test_scores.append(np.mean(test_score))import matplotlib.pyplot as plt %matplotlib inline plt.plot(params, test_scores) plt.title(n_estimator vs CV Error);咦看来单纯用DT不太灵光的。最好的结果也就0.140 Boosting Boosting比Bagging理论上更高级点它也是揽来一把的分类器。但是把他们线性排列。下一个分类器把上一个分类器分类得不好的地方加上更高的权重这样下一个分类器就能在这个部分学得更加“深刻”。 from sklearn.ensemble import AdaBoostRegressorparams [10, 15, 20, 25, 30, 35, 40, 45, 50] test_scores [] for param in params:clf BaggingRegressor(n_estimatorsparam, base_estimatorridge)test_score np.sqrt(-cross_val_score(clf, X_train, y_train, cv10, scoringneg_mean_squared_error))test_scores.append(np.mean(test_score))plt.plot(params, test_scores) plt.title(n_estimator vs CV Error);AdaboostRidge在这里25个小分类器的情况下也是达到了接近0.132的效果。同理这里你也可以不必输入Base_estimator使用Adaboost自带的DT。 params [10, 15, 20, 25, 30, 35, 40, 45, 50] test_scores [] for param in params:clf BaggingRegressor(n_estimatorsparam)test_score np.sqrt(-cross_val_score(clf, X_train, y_train, cv10, scoringneg_mean_squared_error))test_scores.append(np.mean(test_score))plt.plot(params, test_scores) plt.title(n_estimator vs CV Error);看来我们也许要先tune一下我们的DT模型再做这个实验。。? XGBoost 最后我们来看看巨牛逼的XGBoost外号Kaggle神器这依旧是一款Boosting框架的模型但是却做了很多的改进。 from xgboost import XGBRegressor用Sklearn自带的cross validation方法来测试模型 params [1,2,3,4,5,6] test_scores [] for param in params:clf XGBRegressor(max_depthparam)test_score np.sqrt(-cross_val_score(clf, X_train, y_train, cv10, scoringneg_mean_squared_error))test_scores.append(np.mean(test_score))存下所有的CV值看看哪个alpha值更好也就是『调参数』 import matplotlib.pyplot as plt %matplotlib inline plt.plot(params, test_scores) plt.title(max_depth vs CV Error);惊了深度为5的时候错误率缩小到0.127 这就是为什么浮躁的竞赛圈人人都在用XGBoost ?

查看全文

http://www.yutouwan.com/news/409094/