R code snippets
Useful R code snippets
Transforming the output of xtabs to a dataframe
First method, which changes formatting:
> realization.xt = xtabs(~ verbs$Verb + verbs$RealizationOfRec) > real.xt.df = data.frame(realization.xt) > head(real.xt.df) verbs.Verb verbs.RealizationOfRec Freq 1 accord NP 1 2 allocate NP 0 3 allow NP 6 4 assess NP 1 5 assure NP 2 6 award NP 7
Second method, keeps formatting:
> real.xt.df.2 = data.frame(realization.xt[,]) > head(real.xt.df2) NP PP accord 1 0 allocate 0 3 allow 6 0 assess 1 0 assure 2 0 award 7 11
Computing the mode of a distribution
First method, only gives you one mode, even if the distribution has multiple items that occur with maximum frequency:
> x = c(3,6,8,5,2,1,1,9,34,23,4567,1) > freq.df = data.frame(xtabs(~x)) > freq.df[which.max(freq.df$Freq),]$x [1] 1
Second method, which gives you all modes:
> x = c(1,1,2,2) > freq.df = data.frame(xtabs(~x)) > freq.df[which(freq.df$Freq == max(freq.df$Freq)),]$x [1] 1 2 Levels: 1 2
Controlling the number of digits printed
Use options() to set general options, among them the number of digits printed:
> pi [1] 3.141593 > options(digits=3) > pi [1] 3.14
Switching the levels of the dependent variable in logistic regression
In logistic regression, if the values of the dependent variable are a factor, they are re-coded into 0 and 1.By default, the alphabetically prior level is coded as 0, and the other as 1:
> library(languageR)
> library(rms)
> lrm(RealizationOfRecipient ~ LengthOfTheme,data=dative)
Logistic Regression Model
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 3263 LR chi2 157.66 R2 0.069 C 0.655
0 849 d.f. 1 g 0.655 Dxy 0.311
1 2414 Pr(> chi2) <0.0001 gr 1.925 gamma 0.367
max |deriv| 3e-08 gp 0.095 tau-a 0.120
Brier 0.184
Coef S.E. Wald Z Pr(>|Z|)
Intercept -0.4430 0.0644 -6.88 <0.0001
LengthOfTheme -0.1668 0.0160 -10.39 <0.0001
Here is how to switch them:
> switch.dependent.levels <- function(vec) { abs(as.numeric(vec) - 2)}
> lrm(switch.dependent.levels(dative$RealizationOfRecipient) ~ LengthOfTheme, data = dative)
Logistic Regression Model
lrm(formula = switch.dependent.levels(dative$RealizationOfRecipient) ~
LengthOfTheme, data = dative)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 3263 LR chi2 157.66 R2 0.069 C 0.655
0 849 d.f. 1 g 0.655 Dxy 0.311
1 2414 Pr(> chi2) <0.0001 gr 1.925 gamma 0.367
max |deriv| 3e-08 gp 0.095 tau-a 0.120
Brier 0.184
Coef S.E. Wald Z Pr(>|Z|)
Intercept 0.4430 0.0644 6.88 <0.0001
LengthOfTheme 0.1668 0.0160 10.39 <0.0001
Or, more succinctly:
> lrm(abs(as.numeric(RealizationOfRecipient) - 2) ~ LengthOfTheme,data=dative)
Plotting partial effects of predictors when using ols()
Baayen's code for plotting partial effects (p.175), which worked for the Design package, does not work for the rms package. Here is how you need to adapt your code. We use as an example the same regression model with a nonlinearity that Baayen uses: predicting reaction time in the "english" data frame.
First, in this case you do need a datadist object.
english.dd = datadist(english)
options(datadist = "english.dd"
The regression call is the same:
english.olsB = ols(RTlexdec ~ pol(WrittenFrequency, 2) + AgeSubject + LengthInLetters, data = english)
Now instead of plotting english.olsB directly, you just plot the Predict object.
plot(Predict(english.olsB))
Converting categorical variables to a binary vector for use in clustering
To convert a many-valued categorical nominal (not ordered) variable for a dataset to a series of binary vectors, for e.g. clustering, you can do use the dummies package as follows:
install.packages("dummies")
library(dummies)
dummy(dative$SemanticClass)
Say you want to do get distances between datapoints with multiple many-valued features(e.g. in order to allow you to do clustering), you will need to convert them all and combine them. For example to convert and combine two categorical variables in the first 100 datapoints from the dative dataset, you can do the following:
recodeddativecats = cbind(dummy(dative[1:100,]$SemanticClass),dummy(dative[1:100,]$AccessOfTheme))
This represents each data-point as a series of binary values. You can then find the distances between the datapoints as follows:
x = dist(recodeddativecats,method="binary")
Comparing two nested lrm models M1 and M2 using a likelihood ratio test
LRtest <- function(model1,model2){ print(model1$dev[2] - model2$dev[2]) print(1 - pchisq((model1$dev[2] - model2$dev[2]),(model2$stats['d.f.'] - model1$stats['d.f.']))) } LRtest(M1,M2)
Calculating AIC for an lrm model M1
aic <- function(model) { round(unname(model$dev[2] + 2*(model$stats['d.f.']+1)),2) } aic(M1)
Model comparison for mlogit models M1, M2
For nested mlogit models you can use the command lrtest to perform a likelihood ratio test. So for the example models from class you might do the following:
> lrtest(M1,M2) Likelihood ratio test Model 1: EtymAge ~ 1 Model 2: EtymAge ~ 1 | NcountStem #Df LogLik Df Chisq Pr(>Chisq) 1 4 -332.39 2 8 -328.63 4 7.5124 0.1112
For non-nested models, the following function can be used to get AIC values. Just paste this into the console and you can then do get to AIC value for a model M1 by calling mlogitAIC(M1).
mlogitAIC <- function(model){ (-model$logLik[1]*2) + 2*length(model$coeff) }