R worksheet: influential datapoints

Influential datapoints are datapoints whose inclusion has an outsize influence on the estimated coefficients. Functions like dfbeta() and which.influence() let you test whether you have such datapoints. What to do if you do find such datapoints? Then you report the coefficients for you whole dataset, and also state that you had influential datapoints, and how the coefficients would change without each of them.

# influential datapoints in regression

# linear regression

library(languageR)

# Hinton "smiling" data again

# which was created explicitly to demonstrate outsize influence of datapoints

hinton.smile = data.frame(smile.time = c(0.4, 0.8, 0.8, 1.2, 1.4, 1.8, 2.2, 2.6, 3.0),

                          sold = c(16, 12, 20, 16, 34, 30, 26, 22, 38))

lm.obj = lm(smile.time  ~ sold, data = hinton.smile)

summary(lm.obj)

# coefficients:

# Intercept -0.07252

# sold 0.06941

# we are looking for datapoints

# whose elimination would lead to a change in a coefficient

# that is t times  the size of the coefficient:

# where some recommend t=0.5, others t=0.2

# dfbeta gives the adjustment for each coefficient

# for each datapoint.

# We transform its output to a data frame

# so we can more conveniently check whether

# any datapoint leads to an outsized adjustment

# there is none.

dfbeta.df = data.frame(dfbeta(lm.obj))

# large adjustment: |adjustment| > 0.2 * |coefficient|

dfbeta.df[abs(dfbeta.df$sold) > 0.2 * abs(lm.obj$coefficient[2]),]

# with ols, we would use the functions

# which.influence and show.influence

library(rms)

# We have to specify x=T and y=T to be able to run which.influence

ols.obj = ols(smile.time ~ sold, data = hinton.smile, x = T, y = T)

which.influence(ols.obj, cutoff = 0.2)

# inspecting the values for influential datapoints

show.influence(which.influence(ols.obj, cutoff = 0.2), hinton.smile)

# logistic regression: we can again use

# which.influence and show.influence

lrm.obj = lrm(Regularity ~ WrittenFrequency + Auxiliary, data = regularity, x = T, y = T)

which.influence(lrm.obj)

# again, influential datapoints

show.influence(which.influence(lrm.obj), regularity)