### R code: clustering

 `# K-means clustering``# first example: made-up data``# Draw 100 numbers, normally distributed, with a mean of 0,``# and organize them in 2 columns so we have 50 points with 2 coordinates each.``cl1 = matrix(rnorm(100, sd = 0.3), ncol = 2)``# Draw 100 numbers with a mean of 1.5,``# again organized in 2 columns for 50 datapoints with 2 coordinates each``cl2 = matrix(rnorm(100, mean=1.5, sd = 0.3), ncol = 2)``# plotting both``plot(cl1, xlim = range(cl1, cl2), ylim = range(cl1, cl2))``points(cl2)``# combine the two sets of points, and let's see ``# if the clustering method can take them apart again``data = rbind(cl1, cl2)``# doing the clustering``clustering = kmeans(data, 2)``# inspecting the result``clustering``# this looks something like this.``# "Cluster means" are the cluster centers,``# one center per line.``# "Clustering vector" assigns each datapoint in "data"``# a cluster, 1 or 2.``# "Within cluster sum of squares by cluster" is the sum of squared distances``# of the data points from their cluster mean, by cluster.``##``# K-means clustering with 2 clusters of sizes 50, 50``# ``# Cluster means:``##         [,1]       [,2]``## 1 0.01778026 0.01465019``## 2 1.47658108 1.46337011``## Clustering vector:``##    1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1``##   1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2``##   2 2 2 2 2 2 2 2 2 2 2 2``## Within cluster sum of squares by cluster:``##  9.814072 9.629475``##  (between_SS / total_SS =  84.5 %)``## Available components:``##  "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   ``##  "size"         "iter"         "ifault"      ``# how good is the clustering``# withinss: sum of squared distances between each datapoint and its``# cluster mean``# betweenss: sum of squared distances between each cluster mean and``# the data mean``clustering\$withinss``clustering\$betweenss``# visualizing the results of clustering``plot(data, col = clustering\$cluster, pch=clustering\$cluster)``# adding the cluster centers``points(clustering\$centers, col = 1:2, pch = 8, cex = 2)``# for a more challenging dataset, set the second mean to 1.0, ``# or try 3-4 clusters``# In that case do nstart = 25``#######``# K-means clustering, real dataset``# NYT opinion articles``nyt = read.table("nyt_opinion.txt", header=T)``content.words = nyt[,13:29]``# do the clustering of articles by content words.``# 4 clusters, 25 attempts with different random starting points.``nyt.cl = kmeans(content.words, 4, nstart = 25)``# inspecting the clusters``nyt\$cl = nyt.cl\$cluster``# How do the authors distribute over the clusters?``xtabs(~nyt\$cl + nyt\$Author)``# how do content word counts look like by cluster?``aggregate(content.words, list(nyt\$cl), sum)``#############``# Hierarchical clustering``# clustering languages by similarity of morphosyntactic characteristics``library(languageR)``# make an object with pairwise distances of languages``dist.languages = dist(phylogeny[3:ncol(phylogeny)], method="binary")``# do the clustering, plot``# Suggestion: try out methods "ward.D", "ward.D2", "single", "complete", "average" and see how``# the clustering changes.``clustering = hclust(dist.languages, method="ward.D")``plot(clustering)``# make it so we can see which languages got clustered``plotnames = as.character(phylogeny\$Language)``plot(clustering, labels=plotnames, cex=0.8)``###############``# Dimensionality reduction:``# Code from the Baayen book``library(languageR)``# affixProductivity dataset: stems can be combined with different``# affixes, for example warm+th, clever+ness``#``# Question: affix productivity related to stylistic factors?``# affixProductivity dataset lists, for 44 texts with varying``# authors and genres, a productivity index for 27 affixes. Genres:``# religious, children's books, literary texts, and others. ``# Here: use PCA to explore what dimensions give relevant information``# about the data. All columns except the last three have affixes. Last three``# columns encode info about author, genre, and birth year of author. ``# doing the principal component analysis (PCA)``# run with scale=T, as PCA is sensitive to``# different values being on different scales``affixes.pr = prcomp(affixProductivity[,1:(ncol(affixProductivity)-3)], scale=T)``# what do we have in the resulting object?``names(affixes.pr)``# Result: ``#  "sdev"     "rotation" "center"   "scale"    "x"``# summary(affixes.pr) prints the standard deviation for each``# principal component, and the amount of variance explained by``# it. Proportion of variance: squared sdev divided by sum of squared``# sdevs. standard deviation is also accessible via``# affixes.pr\$sdev ``summary(affixes.pr)``# this is the cumulative variance explained for principal components 1-k.``# This is in the summary, but you can also get it as a vector like this:``affixes.pr\$sdev^2 / sum(affixes.pr\$sdev^2)``# Rules of thumb:``# * keep only principal components that account for at``#    least 5% of the variance.``# * visualize proportion of variance explained by each principal``#   component. Only keep components up to first major drop in the graph``# visualization:``plot(affixes.pr)``# The original dataset affixProductivity has one book per line.``# Each of these books has coordinates on the new principal components.``# Here are the coordinates on the first three principal components``# for some books``affixes.pr\$x[c("Mormon", "Austen", "Carroll", "Gao"), 1:3]``# How can we interpret principal components?``# One way is to look at the ``loadings'' of the affixes on the``# principal components. Flip the matrix around, view these loadings as``# coordinates of affixes with new dimensions. ``# Interpret a principal component by checking which affixes``# have a large positive or negative value with it. ``# Loadings are available in affixes.pr\$rotation``affixes.pr\$rotation[1:10, 1:3]``####``# PCA as soft clustering``# texts that have high values on the same``# principal dimensions, or affixes that have high loadings on the same``# principal dimensions, can be viewed as more or less clustered. Each``# principal component can then be viewed as a soft cluster. ``####``# Visualization``# Visualize clustering of texts and clustering of affixes in the same``# plot: biplot.``biplot(affixes.pr, scale = 0, var.axes = F, ``col = c("darkgrey", "black"), cex = c(0.9, 1.2))`