机器学习—回归与分类4-3(AdaBoost算法)


使用AdaBoost预测黑色星期五花销

主要步骤流程:

  • 1. 导入包
  • 2. 导入数据集
  • 3. 数据预处理
    • 3.1 检测并处理缺失值
    • 3.2 删除无用的列
    • 3.3 检查类别型变量
    • 3.4 标签编码&独热编码
    • 3.5 得到自变量和因变量
    • 3.6 拆分训练集和测试集
    • 3.7 特征缩放
  • 4. 使用不同的参数构建AdaBoost回归模型
    • 4.1 模型1:构建AdaBoost回归模型
      • 4.1.1 构建模型
      • 4.1.2 测试集做预测
      • 4.1.3 评估模型性能
    • 4.2 模型2:构建AdaBoost回归模型

数据集链接:

 

1. 导入包

In [2]:
# 导入包
import numpy as np
import pandas as pd

 

2. 导入数据集

In [3]:
# 导入数据集
data = pd.read_csv('BlackFriday.csv')
data
Out[3]:
 User_IDProduct_IDGenderAgeOccupationCity_CategoryStay_In_Current_City_YearsMarital_StatusProduct_Category_1Product_Category_2Product_Category_3Purchase
0 1000001 P00069042 F 0-17 10 A 2 0 3 NaN NaN 8370
1 1000001 P00248942 F 0-17 10 A 2 0 1 6.0 14.0 15200
2 1000001 P00087842 F 0-17 10 A 2 0 12 NaN NaN 1422
3 1000001 P00085442 F 0-17 10 A 2 0 12 14.0 NaN 1057
4 1000002 P00285442 M 55+ 16 C 4+ 0 8 NaN NaN 7969
... ... ... ... ... ... ... ... ... ... ... ... ...
49995 1001649 P00102642 M 18-25 19 C 2 1 4 8.0 9.0 1374
49996 1001649 P00035842 M 18-25 19 C 2 1 5 6.0 9.0 5372
49997 1001649 P00052842 M 18-25 19 C 2 1 10 15.0 NaN 18879
49998 1001649 P00183142 M 18-25 19 C 2 1 15 NaN NaN 17029
49999 1001650 P00155642 M 26-35 19 C 1 0 8 NaN NaN 6093

50000 rows × 12 columns

 

3. 数据预处理

3.1 检测并处理缺失值

In [4]:
# 检测缺失值
null_df = data.isnull().sum()
null_df
Out[4]:
User_ID                           0
Product_ID                        0
Gender                            0
Age                               0
Occupation                        0
City_Category                     0
Stay_In_Current_City_Years        0
Marital_Status                    0
Product_Category_1                0
Product_Category_2            15721
Product_Category_3            34817
Purchase                          0
dtype: int64
In [5]:
# 删除缺失列
data = data.drop(['Product_Category_2', 'Product_Category_3'], axis = 1)
In [6]:
# 再次检测缺失值
null_df = data.isnull().sum()
null_df
Out[6]:
User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Purchase                      0
dtype: int64

3.2 删除无用的列

In [7]:
# 删除无用的列
data = data.drop(['User_ID', 'Product_ID'], axis = 1)

3.3 检查类别型变量

In [8]:
# 检查类别型变量
print(data.dtypes)
Gender                        object
Age                           object
Occupation                     int64
City_Category                 object
Stay_In_Current_City_Years    object
Marital_Status                 int64
Product_Category_1             int64
Purchase                       int64
dtype: object
In [9]:
# 转换变量类型
data['Stay_In_Current_City_Years'].replace('4+', 4, inplace = True)
data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].astype('int64')
data['Product_Category_1'] = data['Product_Category_1'].astype('object')
data['Occupation'] = data['Occupation'].astype('object')
data['Marital_Status'] = data['Marital_Status'].astype('object')
In [10]:
# 检查类别型变量
print(data.dtypes)
Gender                        object
Age                           object
Occupation                    object
City_Category                 object
Stay_In_Current_City_Years     int64
Marital_Status                object
Product_Category_1            object
Purchase                       int64
dtype: object

3.4 标签编码&独热编码

In [11]:
# 标签编码&独热编码
data = pd.get_dummies(data, drop_first = True)
data
Out[11]:
 Stay_In_Current_City_YearsPurchaseGender_MAge_18-25Age_26-35Age_36-45Age_46-50Age_51-55Age_55+Occupation_1...Product_Category_1_9Product_Category_1_10Product_Category_1_11Product_Category_1_12Product_Category_1_13Product_Category_1_14Product_Category_1_15Product_Category_1_16Product_Category_1_17Product_Category_1_18
0 2 8370 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 2 15200 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 2 1422 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 2 1057 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
4 4 7969 1 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2 1374 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49996 2 5372 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49997 2 18879 1 1 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
49998 2 17029 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
49999 1 6093 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

50000 rows × 49 columns

3.5 得到自变量和因变量

In [12]:
# 得到自变量和因变量
y = data['Purchase'].values
data = data.drop(['Purchase'], axis = 1)
x = data.values

3.6 拆分训练集和测试集

In [13]:
# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(35000, 48)
(15000, 48)
(35000,)
(15000,)

3.7 特征缩放

In [14]:
# 特征缩放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))

 

4. 使用不同的参数构建AdaBoost回归模型

4.1 模型1:构建AdaBoost回归模型

4.1.1 构建模型

In [15]:
# 使用不同的参数构建AdaBoost回归模型
# 模型1:构建AdaBoost回归模型(base_estimator=None, n_estimators=50, learning_rate=1)
from sklearn.ensemble import AdaBoostRegressor
regressor = AdaBoostRegressor(n_estimators=50, learning_rate=1, loss='linear', random_state=0)
regressor.fit(x_train, y_train)
Out[15]:
AdaBoostRegressor(learning_rate=1, random_state=0)

4.1.2 测试集做预测

In [16]:
# 在测试集做预测
y_pred = regressor.predict(x_test)
y_pred[:5]
Out[16]:
array([ 0.65004612,  0.18805047,  0.55246384,  0.55246384, -0.01299079])
In [17]:
# y_pred变回特征缩放之前的
y_pred = sc_y.inverse_transform(y_pred)
y_pred[:5]
Out[17]:
array([12501.03102111, 10206.94045722, 12016.4753753 , 12016.4753753 ,
        9208.64781524])

4.1.3 评估模型性能

In [18]:
# 评估模型性能
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R2 Score:", r2)
R2 Score: 0.3075923325444474

4.2 模型2:构建AdaBoost回归模型

In [29]:
# 模型2:构建AdaBoost回归模型(base_estimator=DecisionTreeRegressor, n_estimators=2000, learning_rate=0.1)
from sklearn.tree import DecisionTreeRegressor
regressor = AdaBoostRegressor(base_estimator = DecisionTreeRegressor(min_samples_split=100, max_depth=10, min_samples_leaf=10), 
                              n_estimators=1000, learning_rate=0.2, loss='linear', random_state=0)
regressor.fit(x_train, y_train)
Out[29]:
AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=10,
                                                       min_samples_leaf=10,
                                                       min_samples_split=100),
                  learning_rate=0.2, n_estimators=1000, random_state=0)
In [30]:
# 在测试集做预测
y_pred = regressor.predict(x_test)
y_pred = sc_y.inverse_transform(y_pred) # y_pred变回特征缩放之前的
In [31]:
# 评估模型性能
r2 = r2_score(y_test, y_pred)
print("R2 Score:", r2)
R2 Score: 0.6070474824774648
 

结论: 由上面2个模型可见,不同超参数对模型性能的影响不同