R code snippets

Useful R code snippets

Transforming the output of xtabs to a dataframe

First method, which changes formatting:

> realization.xt = xtabs(~ verbs$Verb + verbs$RealizationOfRec) > real.xt.df = data.frame(realization.xt) > head(real.xt.df)   verbs.Verb verbs.RealizationOfRec Freq 1     accord                     NP    1 2   allocate                     NP    0 3      allow                     NP    6 4     assess                     NP    1 5     assure                     NP    2 6      award                     NP    7

Second method, keeps formatting:

> real.xt.df.2 = data.frame(realization.xt[,]) > head(real.xt.df2)          NP PP accord    1  0 allocate  0  3 allow     6  0 assess    1  0 assure    2  0 award     7 11

Computing the mode of a distribution

First method, only gives you one mode, even if the distribution has multiple items that occur with maximum frequency:

> x = c(3,6,8,5,2,1,1,9,34,23,4567,1) > freq.df = data.frame(xtabs(~x)) > freq.df[which.max(freq.df$Freq),]$x [1] 1

Second method, which gives you all modes:

> x = c(1,1,2,2) > freq.df = data.frame(xtabs(~x)) > freq.df[which(freq.df$Freq == max(freq.df$Freq)),]$x [1] 1 2 Levels: 1 2

Controlling the number of digits printed

Use options() to set general options, among them the number of digits printed:

> pi [1] 3.141593 > options(digits=3) > pi [1] 3.14

Switching the levels of the dependent variable in logistic regression

 In logistic regression, if the values of the dependent variable are a factor, they are re-coded into 0 and 1.By default, the alphabetically prior level is coded as 0, and the other as 1:

> library(languageR)

> library(rms)

> lrm(RealizationOfRecipient ~ LengthOfTheme,data=dative)

Logistic Regression Model

                      Model Likelihood     Discrimination    Rank Discrim.   

                         Ratio Test            Indexes          Indexes      

Obs          3263    LR chi2     157.66    R2       0.069    C       0.655   

 0            849    d.f.             1    g        0.655    Dxy     0.311   

 1           2414    Pr(> chi2) <0.0001    gr       1.925    gamma   0.367   

max |deriv| 3e-08                          gp       0.095    tau-a   0.120   

                                           Brier    0.184                    

              Coef    S.E.   Wald Z Pr(>|Z|)

Intercept     -0.4430 0.0644  -6.88 <0.0001

LengthOfTheme -0.1668 0.0160 -10.39 <0.0001

Here is how to switch them:

> switch.dependent.levels <- function(vec) { abs(as.numeric(vec) - 2)}

> lrm(switch.dependent.levels(dative$RealizationOfRecipient) ~ LengthOfTheme, data = dative)

Logistic Regression Model

lrm(formula = switch.dependent.levels(dative$RealizationOfRecipient) ~

    LengthOfTheme, data = dative)

                      Model Likelihood     Discrimination    Rank Discrim.   

                         Ratio Test            Indexes          Indexes      

Obs          3263    LR chi2     157.66    R2       0.069    C       0.655   

 0            849    d.f.             1    g        0.655    Dxy     0.311   

 1           2414    Pr(> chi2) <0.0001    gr       1.925    gamma   0.367   

max |deriv| 3e-08                          gp       0.095    tau-a   0.120   

                                           Brier    0.184                    

              Coef   S.E.   Wald Z Pr(>|Z|)

Intercept     0.4430 0.0644  6.88  <0.0001

LengthOfTheme 0.1668 0.0160 10.39  <0.0001

Or, more succinctly:

> lrm(abs(as.numeric(RealizationOfRecipient) - 2) ~ LengthOfTheme,data=dative)

Plotting partial effects of predictors when using ols()

Baayen's code for plotting partial effects (p.175), which worked for the Design package, does not work for the rms package. Here is how you need to adapt your code. We use as an example the same regression model with a nonlinearity that Baayen uses: predicting reaction time in the "english" data frame.

First, in this case you do need a datadist object.

english.dd = datadist(english)

options(datadist = "english.dd"

The regression call is the same:

english.olsB = ols(RTlexdec ~ pol(WrittenFrequency, 2) + AgeSubject + LengthInLetters, data = english)

Now instead of plotting english.olsB directly, you just plot the Predict object.

plot(Predict(english.olsB))

Converting categorical variables to a binary vector for use in clustering

To convert a many-valued categorical nominal (not ordered) variable for a dataset to a series of binary vectors, for e.g. clustering, you can do use the dummies package as follows:

install.packages("dummies")

library(dummies)

dummy(dative$SemanticClass)

Say you want to do get distances between datapoints with multiple many-valued features(e.g. in order to allow you to do clustering), you will need to convert them all and combine them. For example to convert and combine two categorical variables in the first 100 datapoints from the dative dataset, you can do the following: 

recodeddativecats = cbind(dummy(dative[1:100,]$SemanticClass),dummy(dative[1:100,]$AccessOfTheme))

This represents each data-point as a series of binary values. You can then find the distances between the datapoints as follows:

x = dist(recodeddativecats,method="binary")

Comparing two nested lrm models M1 and M2 using a likelihood ratio test

LRtest <- function(model1,model2){ print(model1$dev[2] - model2$dev[2]) print(1 - pchisq((model1$dev[2] - model2$dev[2]),(model2$stats['d.f.'] - model1$stats['d.f.']))) }  LRtest(M1,M2)

Calculating AIC for an lrm model M1

aic <- function(model) { round(unname(model$dev[2] + 2*(model$stats['d.f.']+1)),2)  }  aic(M1)

Model comparison for mlogit models M1, M2

For nested mlogit models you can use the command lrtest to perform a likelihood ratio test. So for the example models from class you might do the following:

> lrtest(M1,M2) Likelihood ratio test  Model 1: EtymAge ~ 1 Model 2: EtymAge ~ 1 | NcountStem #Df LogLik Df Chisq Pr(>Chisq) 1 4 -332.39 2 8 -328.63 4 7.5124 0.1112

For non-nested models, the following function can be used to get AIC values. Just paste this into the console and you can then do get to AIC value for a model M1 by calling mlogitAIC(M1).

mlogitAIC <- function(model){ (-model$logLik[1]*2) + 2*length(model$coeff) }