R code snippets

Useful R code snippets

Transforming the output of xtabs to a dataframe

First method, which changes formatting:

> realization.xt = xtabs(~ verbs$Verb + verbs$RealizationOfRec) > real.xt.df = data.frame(realization.xt) > head(real.xt.df) verbs.Verb verbs.RealizationOfRec Freq 1 accord NP 1 2 allocate NP 0 3 allow NP 6 4 assess NP 1 5 assure NP 2 6 award NP 7

Second method, keeps formatting:

> real.xt.df.2 = data.frame(realization.xt[,]) > head(real.xt.df2) NP PP accord 1 0 allocate 0 3 allow 6 0 assess 1 0 assure 2 0 award 7 11

Computing the mode of a distribution

First method, only gives you one mode, even if the distribution has multiple items that occur with maximum frequency:

> x = c(3,6,8,5,2,1,1,9,34,23,4567,1) > freq.df = data.frame(xtabs(~x)) > freq.df[which.max(freq.df$Freq),]$x [1] 1

Second method, which gives you all modes:

> x = c(1,1,2,2) > freq.df = data.frame(xtabs(~x)) > freq.df[which(freq.df$Freq == max(freq.df$Freq)),]$x [1] 1 2 Levels: 1 2

Controlling the number of digits printed

Use options() to set general options, among them the number of digits printed:

> pi [1] 3.141593 > options(digits=3) > pi [1] 3.14

Switching the levels of the dependent variable in logistic regression

In logistic regression, if the values of the dependent variable are a factor, they are re-coded into 0 and 1.By default, the alphabetically prior level is coded as 0, and the other as 1:

> library(languageR)

> library(rms)

> lrm(RealizationOfRecipient ~ LengthOfTheme,data=dative)

Logistic Regression Model

Model Likelihood Discrimination Rank Discrim.

Ratio Test Indexes Indexes

Obs 3263 LR chi2 157.66 R2 0.069 C 0.655

0 849 d.f. 1 g 0.655 Dxy 0.311

1 2414 Pr(> chi2) <0.0001 gr 1.925 gamma 0.367

max |deriv| 3e-08 gp 0.095 tau-a 0.120

Brier 0.184

Coef S.E. Wald Z Pr(>|Z|)

Intercept -0.4430 0.0644 -6.88 <0.0001

LengthOfTheme -0.1668 0.0160 -10.39 <0.0001

Here is how to switch them:

> switch.dependent.levels <- function(vec) { abs(as.numeric(vec) - 2)}

> lrm(switch.dependent.levels(dative$RealizationOfRecipient) ~ LengthOfTheme, data = dative)

Logistic Regression Model

lrm(formula = switch.dependent.levels(dative$RealizationOfRecipient) ~

LengthOfTheme, data = dative)

Model Likelihood Discrimination Rank Discrim.

Ratio Test Indexes Indexes

Obs 3263 LR chi2 157.66 R2 0.069 C 0.655

0 849 d.f. 1 g 0.655 Dxy 0.311

1 2414 Pr(> chi2) <0.0001 gr 1.925 gamma 0.367

max |deriv| 3e-08 gp 0.095 tau-a 0.120

Brier 0.184

Coef S.E. Wald Z Pr(>|Z|)

Intercept 0.4430 0.0644 6.88 <0.0001

LengthOfTheme 0.1668 0.0160 10.39 <0.0001

Or, more succinctly:

> lrm(abs(as.numeric(RealizationOfRecipient) - 2) ~ LengthOfTheme,data=dative)

Plotting partial effects of predictors when using ols()

Baayen's code for plotting partial effects (p.175), which worked for the Design package, does not work for the rms package. Here is how you need to adapt your code. We use as an example the same regression model with a nonlinearity that Baayen uses: predicting reaction time in the "english" data frame.

First, in this case you do need a datadist object.

english.dd = datadist(english)

options(datadist = "english.dd"

The regression call is the same:

english.olsB = ols(RTlexdec ~ pol(WrittenFrequency, 2) + AgeSubject + LengthInLetters, data = english)

Now instead of plotting english.olsB directly, you just plot the Predict object.

plot(Predict(english.olsB))

Converting categorical variables to a binary vector for use in clustering

To convert a many-valued categorical nominal (not ordered) variable for a dataset to a series of binary vectors, for e.g. clustering, you can do use the dummies package as follows:

install.packages("dummies")

library(dummies)

dummy(dative$SemanticClass)

Say you want to do get distances between datapoints with multiple many-valued features(e.g. in order to allow you to do clustering), you will need to convert them all and combine them. For example to convert and combine two categorical variables in the first 100 datapoints from the dative dataset, you can do the following:

recodeddativecats = cbind(dummy(dative[1:100,]$SemanticClass),dummy(dative[1:100,]$AccessOfTheme))

This represents each data-point as a series of binary values. You can then find the distances between the datapoints as follows:

x = dist(recodeddativecats,method="binary")

Comparing two nested lrm models M1 and M2 using a likelihood ratio test

LRtest <- function(model1,model2){ print(model1$dev[2] - model2$dev[2]) print(1 - pchisq((model1$dev[2] - model2$dev[2]),(model2$stats['d.f.'] - model1$stats['d.f.']))) } LRtest(M1,M2)

Calculating AIC for an lrm model M1

aic <- function(model) { round(unname(model$dev[2] + 2*(model$stats['d.f.']+1)),2) } aic(M1)

Model comparison for mlogit models M1, M2

For nested mlogit models you can use the command lrtest to perform a likelihood ratio test. So for the example models from class you might do the following:

> lrtest(M1,M2) Likelihood ratio test Model 1: EtymAge ~ 1 Model 2: EtymAge ~ 1 | NcountStem #Df LogLik Df Chisq Pr(>Chisq) 1 4 -332.39 2 8 -328.63 4 7.5124 0.1112

For non-nested models, the following function can be used to get AIC values. Just paste this into the console and you can then do get to AIC value for a model M1 by calling mlogitAIC(M1).

mlogitAIC <- function(model){ (-model$logLik[1]*2) + 2*length(model$coeff) }