您现在的位置是:首页 >技术教程 >稳健OLS回归方法网站首页技术教程
稳健OLS回归方法
稳健OLS回归方法
1 什么是稳健OLS
在使用OLS(普最小二乘)回归时,如果存在离群值或极端值(不是人为记录错误),那么OLS回归变得非常困难,因为没有充分证据可对极端数据进行剔除。此时稳健OLS成为较好选择:稳健OLS在剔除极端数据和保留极端数据中寻找到一种权衡,而非如OLS那样“平等”利用各观测数据。在具体估计时,对观测数据的稳健情况进行赋权,也就是将OLS转变为加权的OLS(WLS),也称为稳健最小二乘估计。
2 稳健OLS的R实现
稳健WLS的思想就是对残差较大的观测赋予较低的权重,残差较小观测赋予较大权重。在下列矩条件满足时
∑
i
=
1
n
w
i
(
y
i
−
x
′
b
)
x
i
′
=
0
sum_{i=1}^{n} w_{i}left(y_{i}-x^{prime} b
ight) x_{i}^{prime}=0
i=1∑nwi(yi−x′b)xi′=0
需要使目标函数
∑
i
=
1
n
w
i
2
e
i
2
sum_{i=1}^{n} w_{i}^{2} e_{i}^{2}
∑i=1nwi2ei2最小。这里权重
w
i
w_i
wi依赖于残差的大小,而残差的大小又依赖于权重的取值,因此现需要采取迭代WLS方法直至收敛。在R语言MASS包中,rlm命令提供了bisquare和Huber两种拟合形式。其中Huber方法使残差较小的权重赋予1,残差较大的权重依次递减;bisquare使所有非0残差的权重依次递减。具体代码如下
x = seq(1,100,1)
y = numeric()
set.seed(1110)
# 数据模拟
for (i in 1:1:98){
y[i] <- 1 + 2*x[i] + rnorm(1,0,8)
}
y[99] = 300;y[100] = 310
mydata = data.frame(x,y)
# 描述统计
summary(mydata)
# x y
# Min. : 1.00 Min. : 0.8483
# 1st Qu.: 25.75 1st Qu.: 58.0826
# Median : 50.50 Median : 98.6099
# Mean : 50.50 Mean :106.0761
# 3rd Qu.: 75.25 3rd Qu.:157.4425
# Max. :100.00 Max. :310.0000
plot(x,y)
接下来使用OLS方法估计
OLS = lm(y~1+x)
summary(OLS)
# Call:
# lm(formula = y ~ 1 + x)
#
# Residuals:
# Min 1Q Median 3Q Max
# -25.260 -7.035 -2.377 4.342 98.452
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.52615 3.28683 -0.464 0.643
# x 2.13074 0.05651 37.708 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 16.31 on 98 degrees of freedom
# Multiple R-squared: 0.9355, Adjusted R-squared: 0.9349
# F-statistic: 1422 on 1 and 98 DF, p-value: < 2.2e-16
opar = par(mfrow = c(2,2),oma = c(0,0,1.1,0))
plot(OLS,las =1)
使用cook估计量判定离群值
library(MASS)
dl = cooks.distance(OLS)
# cooks统计量判定离群值,越大表明数据越极端;判定临界点
# 为cook距离的4/n除,n为样本观测数
r = stdres(OLS) # 标准化残差
a = cbind(mydata,dl,r) # 合并到数据框
a[dl>4/N,] # 查看cook距离较大观测
# x y dl r
# 94 94 173.5033 0.04192049 -1.574604
# 99 99 300.0000 0.63732420 5.662780
# 100 100 310.0000 0.77793006 6.158492
# 还可以通过查看残差值大小判定离群值
# 生成残差值绝对值变量
absr = abs(r) # abs()绝对值函数
b = cbind(a,absr) # 合并到数据框
# 降序absr
b[order(-absr),]
# x y dl r absr
# 100 100 310.0000000 7.779301e-01 6.15849188 6.15849188
# 99 99 300.0000000 6.373242e-01 5.66278028 5.66278028
# 94 94 173.5032991 4.192049e-02 -1.57460371 1.57460371
# 86 86 158.5037817 2.677241e-02 -1.44139991 1.44139991
# 27 27 77.4304636 1.483583e-02 1.32469026 1.32469026(推文显示部分)
现在使用稳健OLS方法,第一种时huber方法,代码如下:
# 稳健OLS(通过迭代法完成)
# Huber权重函数
rhuber = rlm(y~1+x,data = mydata)
summary(rhuber)
# Call: rlm(formula = y ~ 1 + x, data = mydata)
# Residuals:
# Min 1Q Median 3Q Max
# -19.3314 -5.7089 -0.9472 5.5412 104.9583
#
# Coefficients:
# Value Std. Error t value
# (Intercept) 1.5907 1.7747 0.8964
# x 2.0345 0.0305 66.6849
#
# Residual standard error: 8.621 on 98 degrees of freedom
# 查看权重和残差
hweight = data.frame(x = mydata$x,resid = rhuber$resid,
weight = rhuber$w)
hweight2 = hweight[order(rhuber$w),]
hweight2
# x resid weight
# 100 100 104.9582697 0.1104698
# 99 99 96.9927798 0.1195422
# 27 27 20.9079675 0.5545405
# 67 67 19.6591831 0.5897938
# 94 94 -19.3313709 0.5997452
# 86 86 -18.0548078 0.6421543
# 54 54 -17.6629891 0.6564270
# 51 51 -16.9657262 0.6834076
# 6 6 15.5185894 0.7470942
# 24 24 15.3279583 0.7564064
# 21 21 -15.1943676 0.7631149
# 14 14 -14.8653757 0.7800132
# 41 41 13.7075274 0.8458463
# 7 7 13.6350060 0.8502942
# 20 20 -13.5334190 0.8567773
# 44 44 -12.6116221 0.9193613 (推文显示部分)
第二种是bisquare方法
rbisquare = rlm(y~1+x,data = mydata,psi = psi.bisquare)
summary(rbisquare)
# Call: rlm(formula = y ~ 1 + x, data = mydata, psi = psi.bisquare)
# Residuals:
# Min 1Q Median 3Q Max
# -18.1435 -5.2241 -0.3856 6.1269 106.2587
#
# Coefficients:
# Value Std. Error t value
# (Intercept) 2.1666 1.7440 1.2424
# x 2.0157 0.0300 67.2331
#
# Residual standard error: 8.636 on 98 degrees of freedom
# 查看权重和残差
bweight = data.frame(x = mydata$x,resid = rbisquare$resid,
weight = rbisquare$w)
bweight2 = bweight[order(rbisquare$w),]
bweight2
# 99 99 98.27447921 0.0000000
# 100 100 106.25873293 0.0000000
# 27 27 20.83867476 0.5398275
# 67 67 20.34044153 0.5584386
# 94 94 -18.14349030 0.6381676
# 54 54 -17.22565979 0.6702838
# 86 86 -17.01703753 0.6774221
# 51 51 -16.58468819 0.6921409
# 21 21 -15.37624301 0.7319834
# 24 24 15.20237417 0.7375744
# 14 14 -15.17859764 0.7383154
# 6 6 15.05525725 0.7422384(推文显示部分)