R worksheet: influential datapoints
Influential datapoints are datapoints whose inclusion has an outsize influence on the estimated coefficients. Functions like dfbeta() and which.influence() let you test whether you have such datapoints. What to do if you do find such datapoints? Then you report the coefficients for you whole dataset, and also state that you had influential datapoints, and how the coefficients would change without each of them.
# influential datapoints in regression
# linear regression
library(languageR)
# Hinton "smiling" data again
# which was created explicitly to demonstrate outsize influence of datapoints
hinton.smile = data.frame(smile.time = c(0.4, 0.8, 0.8, 1.2, 1.4, 1.8, 2.2, 2.6, 3.0),
sold = c(16, 12, 20, 16, 34, 30, 26, 22, 38))
lm.obj = lm(smile.time ~ sold, data = hinton.smile)
summary(lm.obj)
# coefficients:
# Intercept -0.07252
# sold 0.06941
# we are looking for datapoints
# whose elimination would lead to a change in a coefficient
# that is t times the size of the coefficient:
# where some recommend t=0.5, others t=0.2
# dfbeta gives the adjustment for each coefficient
# for each datapoint.
# We transform its output to a data frame
# so we can more conveniently check whether
# any datapoint leads to an outsized adjustment
# there is none.
dfbeta.df = data.frame(dfbeta(lm.obj))
# large adjustment: |adjustment| > 0.2 * |coefficient|
dfbeta.df[abs(dfbeta.df$sold) > 0.2 * abs(lm.obj$coefficient[2]),]
# with ols, we would use the functions
# which.influence and show.influence
library(rms)
# We have to specify x=T and y=T to be able to run which.influence
ols.obj = ols(smile.time ~ sold, data = hinton.smile, x = T, y = T)
which.influence(ols.obj, cutoff = 0.2)
# inspecting the values for influential datapoints
show.influence(which.influence(ols.obj, cutoff = 0.2), hinton.smile)
# logistic regression: we can again use
# which.influence and show.influence
lrm.obj = lrm(Regularity ~ WrittenFrequency + Auxiliary, data = regularity, x = T, y = T)
which.influence(lrm.obj)
# again, influential datapoints
show.influence(which.influence(lrm.obj), regularity)