主成分回归
思路
利用主成分分析,从变量中提取主成分,进而利用主成分做回归
关于主成分分析的基本步骤,可参考> >
步骤
Step1. 原始数据标准化
Step2. 计算自变量之间的相关矩阵
Step3. 检验是否适合做主成分分析
Step4. 主成分提取
Step5. 计算主成分得分
Step6. 做主成分回归
实例
首先我们导入数据
ce <- read.table(file.choose(), header = T)
ce
## X1 X2 X3 X4 Y
## 1 7 26 6 60 78.5
## 2 1 29 15 52 74.3
## 3 11 56 8 20 104.3
## 4 11 31 8 47 87.6
## 5 7 52 6 33 95.9
## 6 11 55 9 22 109.2
## 7 3 71 17 6 102.7
## 8 1 31 22 44 72.5
## 9 2 54 18 22 93.1
## 10 21 47 4 26 115.9
## 11 1 40 23 34 83.8
## 12 11 66 9 12 113.3
## 13 10 68 8 12 109.4
Step1. 原始数据标准化
scale(ce) -> ce.s
ce.s
## X1 X2 X3 X4 Y
## 1 -0.07846099 -1.42368840 -0.9007209 1.7923096 -1.12492615
## 2 -1.09845379 -1.23089726 0.5044037 1.3143603 -1.40411237
## 3 0.60153422 0.50422298 -0.5884710 -0.5974365 0.59007490
## 4 0.60153422 -1.10236984 -0.5884710 1.0156421 -0.52002268
## 5 -0.07846099 0.24716813 -0.9007209 0.1792310 0.03170246
## 6 0.60153422 0.43995926 -0.4323460 -0.4779492 0.91579215
## 7 -0.75845619 1.46817866 0.8166536 -1.4338477 0.48371824
## 8 -1.09845379 -1.10236984 1.5972783 0.8364111 -1.52376360
## 9 -0.92845499 0.37569555 0.9727785 -0.4779492 -0.15442168
## 10 2.30152223 -0.07415044 -1.2129708 -0.2389746 1.36116064
## 11 -1.09845379 -0.52399643 1.7534033 0.2389746 -0.77261973
## 12 0.60153422 1.14686010 -0.4323460 -1.0753857 1.18833108
## 13 0.43153542 1.27538753 -0.5884710 -1.0753857 0.92908673
## attr(,"scaled:center")
## X1 X2 X3 X4 Y
## 7.461538 48.153846 11.769231 30.000000 95.423077
## attr(,"scaled:scale")
## X1 X2 X3 X4 Y
## 5.882394 15.560881 6.405126 16.738180 15.043723
Step2. 计算自变量之间的相关矩阵
cor(ce[,1:4]) -> ce.cor # 使自变量之间的相关矩阵,因此选前四列
Step3. 检验是否适合做主成分分析
- KMO法
KMO(ce.cor)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = ce.cor)
## Overall MSA = 0.24
## MSA for each item =
## X1 X2 X3 X4
## 0.22 0.26 0.20 0.26
发现 \(MSA = 0.24 < 0.7\)
- bartlett法
cortest.bartlett(ce.cor, n = 13) # n为样本个数,即有几行
## $chisq
## [1] 67.28248
##
## $p.value
## [1] 1.473373e-12
##
## $df
## [1] 6
发现 \(p << 0.05\).
虽然 KMO 检验未通过,但 bartlett 检验通过了,我们主要看bartlett检验,因此认为样本适合做主成分.
Step4. 主成分提取
princomp(ce[,1:4], cor = T) -> ce.pca
summary(ce.pca, loadings = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.495227 1.2554147 0.43197934
## Proportion of Variance 0.558926 0.3940165 0.04665154
## Cumulative Proportion 0.558926 0.9529425 0.99959406
## Comp.4
## Standard deviation 0.0402957285
## Proportion of Variance 0.0004059364
## Cumulative Proportion 1.0000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## X1 0.476 0.509 0.676 0.241
## X2 0.564 -0.414 -0.314 0.642
## X3 -0.394 -0.605 0.638 0.268
## X4 -0.548 0.451 -0.195 0.677
因此可以选两个或三个主成分,这里我们选三个. 那么保留了 99.96% 的样本信息.
Step5. 计算主成分得分
as.matrix(ce.s[, 1:4]) %*% ce.pca$loadings[, 1:3] -> Z1
Z1
## Comp.1 Comp.2 Comp.3
## 1 -1.4672378 1.9030357 -0.53000021
## 2 -2.1358287 0.2383537 -0.29018631
## 3 1.1298705 0.1838772 -0.01071260
## 4 -0.6598955 1.5767742 0.17920364
## 5 0.3587646 0.4835379 -0.74012228
## 6 0.9666396 0.1699440 0.08570239
## 7 0.9307051 -2.1348165 -0.17298608
## 8 -2.2321380 -0.6916707 0.45971977
## 9 -0.3515156 -1.4322451 -0.03156438
## 10 1.6625430 1.8280966 0.85119311
## 11 -1.6401800 -1.2951128 0.49417845
## 12 1.6925941 -0.3922488 -0.01981007
## 13 1.7456787 -0.4375255 -0.27461543
Step6. 做主成分回归
ZZ1 <- data.frame(Z1) # 化为数据框
lm(ce.s[, 5] ~ z1 + z2 + z3, data = ZZ1) -> ce.fit1 # data 必须要是数据框
summary(ce.fit1)
##
## Call:
## lm(formula = ce.s[, 5] ~ z1 + z2 + z3, data = ZZ1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20228 -0.12766 0.02411 0.07911 0.25621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.831e-16 4.281e-02 0.000 1.0000
## z1 6.570e-01 2.980e-02 22.045 3.84e-09 ***
## z2 -8.309e-03 3.549e-02 -0.234 0.8202
## z3 3.028e-01 1.031e-01 2.935 0.0166 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1544 on 9 degrees of freedom
## Multiple R-squared: 0.9821, Adjusted R-squared: 0.9762
## F-statistic: 164.9 on 3 and 9 DF, p-value: 3.5e-08
常数项明显未通过,剔除
lm(ce.s[, 5] ~ 0 + z1 + z2 + z3, data = ZZ1) -> ce.fit2 #data 必须要是数据框
summary(ce.fit2)
##
## Call:
## lm(formula = ce.s[, 5] ~ 0 + z1 + z2 + z3, data = ZZ1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20228 -0.12766 0.02411 0.07911 0.25621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z1 0.656958 0.028271 23.238 4.93e-10 ***
## z2 -0.008309 0.033671 -0.247 0.8101
## z3 0.302770 0.097856 3.094 0.0114 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1464 on 10 degrees of freedom
## Multiple R-squared: 0.9821, Adjusted R-squared: 0.9768
## F-statistic: 183.2 on 3 and 10 DF, p-value: 4.895e-09
z2 检验未通过,但是主成分回归不太在乎显著性检验,因此可以剔除也可以不剔除,这里我们不剔除.
于是
\[y = 0.656958z_1 - 0.008309z_2 + 0.302770z_3 \]下面得到 \(y\) 关于 \(x\) 的回归方程
rep(ce.fit2$coefficients, each = 4)*ce.pca$loadings[, 1:3] -> tempp
apply(tempp, 1, sum) -> ce.fit3
ce.fit3
## X1 X2 X3 X4
## 0.51297502 0.27868114 -0.06078483 -0.42288461
于是
\[y = 0.51297502x_1 + 0.27868114x_2 ? 0.06078483x_3 ? 0.42288461x_4 \]