This tutorial introduces Semantic Vector Space (SVM) modeling R. The entire R markdown document for this tutorial can be downloaded here.
SVMs are used to find groups or patterns in data or to predict group membership. As such, they are widely used and applied in machine learning. In linguistics, SVMs are used frequently in distributional semantics to identify and analyzse synonymy and in grammar-based analyses of determine group membership of specific words or word classes..
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install libraries
install.packages(c("cluster", "factoextra", "cluster",
"seriation", "pvclust", "ape", "vcd",
"exact2x2", "factoextra", "seriation",
"NbClust", "pvclust"))
Once you have installed R-Studio and initiated the session by executing the code shown above, you are good to go.
Vector Space Models are particularly useful when dealing with language data as they provide very accurate estimates of semantic similarity based on word embeddings (or co-occurrence profiles). Word embeddings refer to the vectors which hold the frequency information about how frequently a given word has co-occurred with other words. If the ordering of co-occurring words remains constant, then the vectors can be used to determine which words have similar profiles.
To show how vector space models work, we will follow the procedure described in Levshina (2015). However, we will not use her Rling
package, which is not supported my R version 4.0.2, to calculate cosine similarities but rather the coop
package (see Schmidt and Heckendorf 2019). In this tutorial, we investigate similarities among amplifiers based on their co-occurrences (word embeddings) with adjectives. Adjective amplifiers are elements such as those in 1. to 5.
The similarity among adjective amplifiers can then be used to find clusters or groups of amplifiers that “behave” similarly and are interchangeable. To elaborate, adjective amplifiers are interchangeable with some variants but not with others (consider 6. to 8.; the question mark signifies that the example is unlikely to be used or grammatically not acceptable by L1 speakers of English).
We start by loading the required packages, the data, and then displaying the data which is called “vsmdata” and consist of 5,000 observations of adjectives and contains two columns: one column with the adjectives (Adjectives) and another column which has the amplifiers (“0” means that the adjective occurred without an amplifier).
# load packages
library(coop)
library(dplyr)
library(tm)
library(cluster)
library(DT)
# load data
vsmdata <- read.delim("https://slcladal.github.io/data/vsmdata.txt", sep = "\t", header = T)
# inspect data
datatable(vsmdata, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))
For this tutorial, we will reduce the number of amplifiers and adjectives and thus simplify the data to render it easier to understand what is going on. To simplify the data, we remove
In addition, we collapse all amplifiers that occur less than 20 times into a bin category (other).
# simplify data
vsmdata_simp <- vsmdata %>%
# remove non-amplifier adjectives
dplyr::filter(Amplifier != 0,
Adjective != "many",
Adjective != "much") %>%
# collapse infrequent amplifiers
dplyr::group_by(Amplifier) %>%
dplyr::mutate(AmpFreq = dplyr::n()) %>%
dplyr::ungroup() %>%
dplyr::mutate(Amplifier = ifelse(AmpFreq > 20, Amplifier, "other")) %>%
# collapse infrequent adjectives
dplyr::group_by(Adjective) %>%
dplyr::mutate(AdjFreq = dplyr::n()) %>%
dplyr::ungroup() %>%
dplyr::mutate(Adjective = ifelse(AdjFreq > 10, Adjective, "other")) %>%
dplyr::filter(Adjective != "other") %>%
dplyr::select(-AmpFreq, -AdjFreq)
# inspect data
datatable(vsmdata_simp, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))
In a next step, we create a matrix from this data frame which maps how often a given amplifier co-occurred with a given adjective. In text mining, this format is called a text-document matrix or tdm (which is a transposed document-term matrix of dtm).
# tabulate data (create term-document matrix)
tdm <- ftable(vsmdata_simp$Adjective, vsmdata_simp$Amplifier)
# extract amplifiers and adjectives
amplifiers <- as.vector(unlist(attr(tdm, "col.vars")[1]))
adjectives <- as.vector(unlist(attr(tdm, "row.vars")[1]))
# attach row and column names to tdm
rownames(tdm) <- adjectives
colnames(tdm) <- amplifiers
# inspect data
tdm[1:5, 1:5]
## other pretty really so very
## bad 2 1 8 3 2
## big 0 3 4 2 4
## clear 2 1 2 2 4
## different 8 0 2 0 3
## difficult 4 1 2 1 18
Now that we have a term document matrix, we want to remove adjectives that were never amplified. Note however that if we were interested in classifying adjectives (rather than amplifiers) according to their co-occurrence with amplifiers, we would, of course, not do this, as not being amplified would be a relevant feature for adjectives. But since we are interested in classifying amplifiers, not amplified adjectives do not have any information value.
# convert frequencies greater than 1 into 1
tdm <- t(apply(tdm, 1, function(x){ifelse(x > 1, 1, x)}))
# remove adjectives that we never amplified
tdm <- tdm[which(rowSums(tdm) > 1),]
# transpose tdm because we are interested in amplifiers not adjectives
tdm <- t(tdm)
# inspect data
tdm[1:5, 1:5]
##
## bad big clear different difficult
## other 1 0 1 1 1
## pretty 1 1 1 0 1
## really 1 1 1 1 1
## so 1 1 1 0 1
## very 1 1 1 1 1
In a next step, we extract the expected values of the co-occurrences if the amplifiers were distributed homogeneously and calculate the Pointwise Mutual Information (PMI) score and use that to then calculate the Positive Pointwise Mutual Information (PPMI) scores. According to Levshina (2015) 327 - referring to Bullinaria and Levy (2007) - PPMI perform better than PMI as negative values are replaced with zeros. In a next step, we calculate the cosine similarity which will for the bases for the subsequent clustering.
# compute expected values
tdm.exp <- chisq.test(tdm)$expected
## Warning in chisq.test(tdm): Chi-squared approximation may be incorrect
# calculate PMI and PPMI
PMI <- log2(tdm/tdm.exp)
PPMI <- ifelse(PMI < 0, 0, PMI)
# calculate cosine similarity
cosinesimilarity <- cosine(PPMI)
# inspect cosine values
cosinesimilarity[1:5, 1:5]
## bad big clear different difficult
## bad 1.0000 0.6764 1.0000 0.6496 1.0000
## big 0.6764 1.0000 0.6764 0.0000 0.6764
## clear 1.0000 0.6764 1.0000 0.6496 1.0000
## different 0.6496 0.0000 0.6496 1.0000 0.6496
## difficult 1.0000 0.6764 1.0000 0.6496 1.0000
As we have now obtained a similarity measure, we can go ahead and perform a cluster analysis on these similarity values. However, as we have to extract the maximum values in the similarity matrix that is not 1 as we will use this to create a distance matrix. While we could also have simply subtracted the cosine similarity values from 1 to convert the similarity matrix into a distance matrix, we follow the procedure proposed by Levshina (2015).
# find max value that is not 1
cosinesimilarity.test <- apply(cosinesimilarity, 1, function(x){
x <- ifelse(x == 1, 0, x) } )
maxval <- max(cosinesimilarity.test)
# create distance matrix
amplifier.dist <- 1 - (cosinesimilarity/maxval)
clustd <- as.dist(amplifier.dist)
In a next step, we want to determine the optimal number of clusters. This has two reasons: firstly, we need to establish that we have reason to assume that the data is not homogeneous (this would occur if the optimal number of clusters were 1), and, secondly, we want check how many meaningful clusters there are in our data.
# find optimal number of clusters
asw <- as.vector(unlist(sapply(2:nrow(tdm)-1, function(x) pam(clustd, k = x)$silinfo$avg.width)))
# determine the optimal number of clusters (max width is optimal)
optclust <- which(asw == max(asw))+1 # optimal number of clusters
# inspect clustering with optimal number of clusters
amplifier.clusters <- pam(clustd, optclust)
# inspect cluster solution
amplifier.clusters$clustering
## bad big clear different difficult good hard important interesting nice strong
## 1 2 1 3 1 1 2 3 4 4 4
In a next step, we visualize the results of the semantic vector space model as a dendrogram.
# create cluster object
cd <- hclust(clustd, method="ward.D")
# plot cluster object
plot(cd, main = "", sub = "", yaxt = "n", ylab = "", xlab = "", cex = .8)
# add colored ractangles around clusters
rect.hclust(cd, k = 6, border = "gray60")
The clustering solution shows that, as expected, completely and totally - while similar to each other and thus interchangeable with each other - form a separate cluster from all other amplifiers. In addition, very and really form a cluster together with the zero variant. This is likely so because really, very, and the zero variant are the most frequent “variants” but also co-occur with the most variety of adjectives. The results can be interpreted to suggest that really and very are “default” amplifiers that lack distinct semantic profiles.
There are many more useful methods for classifying and grouping data and the tutorial by Gede Primahadi Wijaya Rajeg, Karlina Denistia, and Simon Musgrave (Rajeg, Denistia, and Musgrave 2019) highly recommended to get a better understanding of SVM but this should suffice to get you started.
Schweinberger, Martin. 2020. Semantic Vector Space Models in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/svm.html (Version 2020.12.03).
@manual{schweinberger2020svm,
author = {Schweinberger, Martin},
title = {Semantic Vector Space Models in R},
note = {https://slcladal.github.io/svm.html},
year = {2020},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2020/12/03}
}
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] userfriendlyscience_0.7.2 viridis_0.5.1 viridisLite_0.3.0 MASS_7.3-53
## [5] sentimentr_2.7.1 zoo_1.8-8 gapminder_0.3.0 rvest_0.3.6
## [9] xml2_1.3.2 textrank_0.3.1 sjPlot_2.8.6 ggfortify_0.4.11
## [13] car_3.0-10 carData_3.0-4 lmerTest_3.1-3 lme4_1.1-25
## [17] vip_0.3.2 rms_6.0-1 SparseM_1.78 Hmisc_4.4-1
## [21] Formula_1.2-4 survival_3.2-7 lexRankr_0.5.2 janeaustenr_0.1.5
## [25] hashr_0.1.3 stringdist_0.9.6.3 koRpus.lang.de_0.1-1 coop_0.6-2
## [29] hunspell_3.0 koRpus.lang.en_0.1-4 koRpus_0.13-3 sylly_0.1-6
## [33] textdata_0.4.1 here_0.1 tokenizers_0.2.1 readxl_1.3.1
## [37] cowplot_1.1.0 magick_2.5.2 gutenbergr_0.2.0 wordcloud_2.6
## [41] RColorBrewer_1.1-2 ggstatsplot_0.7.1 EnvStats_2.4.0 ggridges_0.5.2
## [45] likert_1.3.5 xtable_1.8-4 SnowballC_0.7.0 scales_1.1.1
## [49] Rmisc_1.5 plyr_1.8.6 lattice_0.20-41 psych_2.0.9
## [53] DescTools_0.99.38 boot_1.3-25 pdftools_2.3.1 collostructions_0.1.2
## [57] igraph_1.2.6 GGally_2.0.0 network_1.16.1 ggdendro_0.1.22
## [61] slam_0.1-47 Matrix_1.2-18 tm_0.7-7 NLP_0.2-1
## [65] tidytext_0.2.6 quanteda_2.1.2 gplots_3.1.0 FactoMineR_2.3
## [69] exact2x2_1.6.5 exactci_1.3-3 ssanv_1.1 vcd_1.4-8
## [73] ape_5.4-1 pvclust_2.2-0 NbClust_3.0 seriation_1.2-9
## [77] factoextra_1.0.7 cluster_2.1.0 cfa_0.10-0 gridExtra_2.3
## [81] fGarch_3042.83.2 fBasics_3042.89.1 timeSeries_3062.100 timeDate_3043.102
## [85] e1071_1.7-4 ggpubr_0.4.0 flextable_0.5.11 forcats_0.5.0
## [89] stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.4.0
## [93] tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.3 tidyverse_1.3.0
## [97] DT_0.16 kableExtra_1.3.1 knitr_1.30
##
## loaded via a namespace (and not attached):
## [1] PMCMRplus_1.9.0 textshape_1.7.1 minpack.lm_1.2-1 pander_0.6.3
## [5] pbapply_1.4-3 haven_2.3.1 vctrs_0.3.4 expm_0.999-5
## [9] usethis_1.6.3 mgcv_1.8-33 gmp_0.6-1 prodlim_2019.11.13
## [13] later_1.1.0.1 nloptr_1.2.2.2 DBI_1.1.0 rappdirs_0.3.1
## [17] selectr_0.4-2 jpeg_0.1-8.1 MatrixModels_0.4-1 sjmisc_2.8.5
## [21] htmlwidgets_1.5.3 mvtnorm_1.1-1 leaps_3.1 pairwiseComparisons_3.1.3
## [25] parallel_4.0.3 Rcpp_1.0.5 KernSmooth_2.23-17 promises_1.1.1
## [29] kSamples_1.2-9 ggeffects_0.16.0 statsExpressions_1.0.0 RcppParallel_5.0.2
## [33] fs_1.5.0 fastmatch_1.1-0 mnormt_2.0.2 digest_0.6.27
## [37] png_0.1-7 polspline_1.1.19 pkgconfig_2.0.3 gower_0.2.2
## [41] estimability_1.3 iterators_1.0.13 minqa_1.2.4 statnet.common_4.4.1
## [45] lavaan_0.6-7 xfun_0.19 tidyselect_1.1.0 performance_0.5.1
## [49] reshape2_1.4.4 rlang_0.4.8 hexbin_1.28.1 isoband_0.2.2
## [53] syuzhet_1.0.4 Rmpfr_0.8-1 glue_1.4.2 gdtools_0.2.2
## [57] registry_0.5-1 modelr_0.1.8 matrixStats_0.57.0 emmeans_1.5.2-1
## [61] ggcorrplot_0.1.3 multcompView_0.1-8 lava_1.6.8.1 ggsignif_0.6.1
## [65] bayestestR_0.8.2 recipes_0.1.14 labeling_0.4.2 httpuv_1.5.4
## [69] class_7.3-17 TH.data_1.0-10 webshot_0.5.2 jsonlite_1.7.1
## [73] tmvnsim_1.0-2 mime_0.9 systemfonts_0.3.2 Exact_2.1
## [77] stringi_1.5.3 insight_0.13.1 BWStest_0.2.2 bitops_1.0-6
## [81] cli_2.1.0 spatial_7.3-12 data.table_1.13.2 correlation_0.6.0
## [85] officer_0.3.15 rstudioapi_0.11 TSP_1.1-10 nlme_3.1-149
## [89] miniUI_0.1.1.1 textclean_0.9.3 dbplyr_2.0.0 lexicon_1.2.1
## [93] lifecycle_0.2.0 GPArotation_2014.11-1 munsell_0.5.0 cellranger_1.1.0
## [97] proxyC_0.1.5 visNetwork_2.0.9 caTools_1.18.0 codetools_0.2-16
## [ reached getOption("max.print") -- omitted 104 entries ]
Bullinaria, J. A., and J. P. Levy. 2007. “Extracting Semantic Representations from Word Co-Occurrence Statistics: A Computational Study.” Behavior Research Methods 39: 510–26.
Levshina, Natalia. 2015. How to Do Linguistics with R: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company.
Rajeg, Gede Primahadi Wijaya, Karlina Denistia, and Simon Musgrave. 2019. “R Markdown Notebook for Vector Space Model and the Usage Patterns of Indonesian Denominal Verbs.” https://doi.org10.6084/m9.figshare.9970205. https://figshare.com/articles/R\%5FMarkdown\%5FNotebook\%5Ffor\%5Fi\%5FVector\%5Fspace\%5Fmodel\%5Fand\%5Fthe\%5Fusage\%5Fpatterns\%5Fof\%5FIndonesian\%5Fdenominal\%5Fverbs\%5Fi\%5F/9970205.
Schmidt, Drew, and Christian Heckendorf. 2019. Coop: Co-Operation: Fast Covariance, Correlation, and Cosinesimilarity Operations. https://CRAN.R-project.org/package=coop.