This tutorial introduces lexicography with R and shows how to use R to create dictionaries and find synonyms through determining semantic similarity in R. While the initial example focuses on English, subsequent sections show how easily this approach can be generalized to languages other than English (e.g. German, French, Spanish, Italian, or Dutch). The entire R-markdown document for the sections below can be downloaded here.
Traditionally, dictionaries are listing of words that are commonly arranged alphabetically, which may include information on definitions, usage, etymologies, pronunciations, translation, etc. (see Agnes, Goldman, and Soltis 2002; Steiner 1985). If such dictionaries, that are typically published as books contain translations of words in other languages, they are referred to as lexicons. Therefore, lexicographical references show the inter-relationships among lexical data, i.e. words.
Similarly, in computational linguistics, dictionaries represent a specific format of data where elements are linked to or paired with other elements in a systematic way. Computational lexicology refers to a branch of computational linguistics, which is concerned with the use of computers in the study of lexicons. Hence, computational lexicology has been defined as the use of computers in the study of machine-readable dictionaries (see e.g. Amsler 1981). Computational lexicology is distinguished from computational lexicography, which can be defined as the use of computers in the construction of dictionaries which is the focus of this tutorial. It should be noted, thought, that computational lexicology and computational lexicography are often used synonymously.
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).
# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install packages
install.packages("tidyverse")
install.packages("tidytext")
install.packages("textdata")
install.packages("quanteda")
install.packages("koRpus")
install.packages("koRpus.lang.en", dependencies = T)
install.packages("koRpus.lang.de", dependencies = T)
install.packages("hunspell")
install.packages("coop")
install.packages("tm")
install.packages("cluster")
install.packages("tidyr")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
remotes::install_github("rlesur/klippy")
In a next step, we load the packages.
# load packages
library(tidyverse)
library(tidytext)
library(textdata)
library(quanteda)
library(koRpus)
library(koRpus.lang.en)
library(hunspell)
library(coop)
library(tm)
library(cluster)
library(tidyr)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed R and RStudio and once you have initiated the session by executing the code shown above, you are good to go.
In a first step, we load the necessary packages from the library and define the location of the engine which we use for the part-of-speech tagging. In this case, we will use the TreeTagger (see Schmid 1994, 2013; Schmid et al. 2007,). How to install and then use the TreeTagger for English as well as for German, French, Spanish, Italian, and Dutch is demonstrated and explained here.
NOTE
You will have to install TreeTagger and change the path used below ("C:\\TreeTagger\\bin\\tag-english.bat"
) to the location where you have installed TreeTagger on your machine. If you do not know how to install TreeTagger or encounter problems, read this tutorial!
In addition, you can download the pos-tagged text here so you can simply skip the next code chunk and load the data as shown below.
`
`
# define location of pos-tagger engine
# WARNING: this needs to be the path to the TreeTagger on YOUR machine!
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en")
In a next step, we load and process the data which in this tutorial represents the text from George Orwell’s Nineteen Eighty-Four. We will not pre-process the data by for instance repairing broken or otherwise compromised words and continue by directly implementing the part-of-speech tagger.
# load and pos-tag data
orwell_pos <- treetag("https://slcladal.github.io/data/orwell.txt")
# select data frame
orwell_pos <- orwell_pos@tokens
doc_id | token | tag | lemma | lttr | wclass | desc | stop | stem | idx | sntc |
orwell.txt | 1984 | CD | @card@ | 4 | number | 1 | 1 | |||
orwell.txt | George | NP | George | 6 | name | 2 | 1 | |||
orwell.txt | Orwell | NP | Orwell | 6 | name | 3 | 1 | |||
orwell.txt | Part | NP | Part | 4 | name | 4 | 1 | |||
orwell.txt | 1 | CD | @card@ | 1 | number | 5 | 1 | |||
orwell.txt | , | , | , | 1 | comma | 6 | 1 | |||
orwell.txt | Chapter | NP | Chapter | 7 | name | 7 | 1 | |||
orwell.txt | 1 | CD | @card@ | 1 | number | 8 | 1 | |||
orwell.txt | It | PP | it | 2 | pronoun | 9 | 1 | |||
orwell.txt | was | VBD | be | 3 | verb | 10 | 1 | |||
orwell.txt | a | DT | a | 1 | determiner | 11 | 1 | |||
orwell.txt | bright | JJ | bright | 6 | adjective | 12 | 1 | |||
orwell.txt | cold | JJ | cold | 4 | adjective | 13 | 1 | |||
orwell.txt | day | NN | day | 3 | noun | 14 | 1 | |||
orwell.txt | in | IN | in | 2 | preposition | 15 | 1 |
If you could not pos-tag the text, you can simply execute the following code chunk which loads the pos-tagged text from the LADAL repository.
# load pos-taged data
orwell_pos <- base::readRDS(url("https://slcladal.github.io/data/orwell_pos.rda", "rb"))
We can now use the resulting table to generate a first, basic dictionary that holds information about the word form (token), the part-of speech tag (tag), the lemmatized word type (lemma), the general word category (wclass), and the frequency with which the word form is used as that part-of speech.
# generate dictionary
orwell_dic_raw <- orwell_pos %>%
dplyr::select(token, tag, lemma, wclass) %>%
dplyr::group_by(token, tag, lemma, wclass) %>%
dplyr::summarise(frequency = dplyr::n()) %>%
dplyr::arrange(lemma)
token | tag | lemma | wclass | frequency |
' | '' | ' | punctuation | 1,826 |
' | POS | ' | possessive | 104 |
's | POS | 's | possessive | 272 |
- | : | - | punctuation | 9 |
-'complete | NN | -'complete | noun | 1 |
-- | : | -- | punctuation | 363 |
-if | NN | -if | noun | 1 |
-said | JJ | -said | adjective | 1 |
-that | NP | -that | name | 1 |
! | SENT | ! | fullstop | 250 |
( | ( | ( | punctuation | 35 |
) | ) | ) | punctuation | 35 |
, | , | , | comma | 5,854 |
. | SENT | . | fullstop | 5,607 |
... | : | ... | punctuation | 7 |
However, as the resulting table shows, the data is still very noisy, i.e. it contains a lot of non-words, i.e words that may be mis-spelled, broken, or otherwise compromised. In order to get rid of these, we can simply check if the word lemma exists in an existing dictionary. When you aim to identify exactly those words that are not yet part of an established dictionary, you could of course do it the other way around and remove all words that are already present in an existing dictionary.
# generate dictionary
orwell_dic_clean <- orwell_dic_raw %>%
dplyr::filter(hunspell_check(lemma)) %>%
dplyr::filter(!stringr::str_detect(lemma, "\\W\\w{1,}"))
token | tag | lemma | wclass | frequency |
. | SENT | . | fullstop | 5,607 |
... | : | ... | punctuation | 7 |
3rd | JJ | 3rd | adjective | 1 |
4th | JJ | 4th | adjective | 3 |
a | DT | a | determiner | 2,277 |
A | DT | a | determiner | 110 |
A | NP | A | name | 3 |
aback | RB | aback | adverb | 2 |
abandon | VV | abandon | verb | 3 |
abandoned | VVD | abandon | verb | 1 |
abandoned | VVN | abandon | verb | 3 |
abashed | VVN | abash | verb | 1 |
abbreviated | VVN | abbreviate | verb | 1 |
abiding | JJ | abiding | adjective | 1 |
ability | NN | ability | noun | 1 |
We have now checked the entries against an existing dictionary and removed non-word elements. As such, we are left with a clean dictionary based on George Orwell’s Nineteen Eighty-Four.
Extending dictionaries, that is adding additional layers of information or other types of annotation, e.g. url’s to relevant references or sources, is fortunately very easy in R and can be done without much additional computing.
We will begin to extend our dictionary by adding an additional column (called annotation
) in which we will add information.
# generate dictionary
orwell_dic_ext <- orwell_dic_clean %>%
dplyr::mutate(annotation = NA) %>%
dplyr::mutate(annotation = ifelse(token == "3rd", "also 3.",
ifelse(token == "4th", "also 4.", annotation)))
token | tag | lemma | wclass | frequency | annotation |
. | SENT | . | fullstop | 5,607 | |
... | : | ... | punctuation | 7 | |
3rd | JJ | 3rd | adjective | 1 | also 3. |
4th | JJ | 4th | adjective | 3 | also 4. |
a | DT | a | determiner | 2,277 | |
A | DT | a | determiner | 110 | |
A | NP | A | name | 3 | |
aback | RB | aback | adverb | 2 | |
abandon | VV | abandon | verb | 3 | |
abandoned | VVD | abandon | verb | 1 | |
abandoned | VVN | abandon | verb | 3 | |
abashed | VVN | abash | verb | 1 | |
abbreviated | VVN | abbreviate | verb | 1 | |
abiding | JJ | abiding | adjective | 1 | |
ability | NN | ability | noun | 1 |
To make it a bit more interesting but also keep this tutorial simple and straight-forward, we will add information about the polarity and emotionally of the words in our dictionary. We can do this by performing a sentiment analysis on the lemmas using the tidytext
package.
The tidytext
package contains three sentiment dictionaries (nrc
, bing
, and afinn
). For the present purpose, we use the ncr
dictionary which represents the Word-Emotion Association Lexicon (Mohammad and Turney 2013). The Word-Emotion Association Lexicon which comprises 10,170 terms, and in which lexical elements are assigned scores based on ratings gathered through the crowd-sourced Amazon Mechanical Turk service. For the Word-Emotion Association Lexicon raters were asked whether a given word was associated with one of eight emotions. The resulting associations between terms and emotions are based on 38,726 ratings from 2,216 raters who answered a sequence of questions for each word which were then fed into the emotion association rating (cf. Mohammad and Turney 2013). Each term was rated 5 times. For 85 percent of words, at least 4 raters provided identical ratings. For instance, the word cry or tragedy are more readily associated with SADNESS while words such as happy or beautiful are indicative of JOY and words like fit or burst may indicate ANGER. This means that the sentiment analysis here allows us to investigate the expression of certain core emotions rather than merely classifying statements along the lines of a crude positive-negative distinction.
To be able to use the Word-Emotion Association Lexicon we need to add another column to our data frame called word
which simply contains the lemmatized word. The reason is that the lexicon expects this column and only works if it finds a word column in the data. The code below shows how to add the emotion and polarity entries to our dictionary.
# generate dictionary
orwell_dic_ext <- orwell_dic_ext %>%
dplyr::mutate(word = lemma) %>%
dplyr::left_join(get_sentiments("nrc")) %>%
tidyr::spread(sentiment, sentiment)
token | tag | lemma | wclass | frequency | annotation | word | anger | anticipation | disgust | fear | joy | negative | positive | sadness | surprise | trust | <NA> |
. | SENT | . | fullstop | 5,607 | . | ||||||||||||
... | : | ... | punctuation | 7 | ... | ||||||||||||
3rd | JJ | 3rd | adjective | 1 | also 3. | 3rd | |||||||||||
4th | JJ | 4th | adjective | 3 | also 4. | 4th | |||||||||||
a | DT | a | determiner | 2,277 | a | ||||||||||||
A | DT | a | determiner | 110 | a | ||||||||||||
A | NP | A | name | 3 | A | ||||||||||||
aback | RB | aback | adverb | 2 | aback | ||||||||||||
abandon | VV | abandon | verb | 3 | abandon | fear | negative | sadness | |||||||||
abandoned | VVD | abandon | verb | 1 | abandon | fear | negative | sadness | |||||||||
abandoned | VVN | abandon | verb | 3 | abandon | fear | negative | sadness | |||||||||
abashed | VVN | abash | verb | 1 | abash | ||||||||||||
abbreviated | VVN | abbreviate | verb | 1 | abbreviate | ||||||||||||
abiding | JJ | abiding | adjective | 1 | abiding | ||||||||||||
ability | NN | ability | noun | 1 | ability | positive |
The resulting extended dictionary now contains not only the token, the pos-tag, the lemma, and the generalized word class, but also the emotional and polarity scores from the Word-Emotion Association Lexicon.
As mentioned above, the procedure for generating dictionaries can easily be applied to languages other than English. If you want to follow exactly the procedure described above, then the language set of the TreeTagger is the limiting factors as its R implementation only supports English, German, French, Italian, Spanish, and Dutch. fa part-of-speech tagged text in another language is already available to you, and you do not require the TreeTagger for the part-of-speech tagging, then you can skip the code chunk that is related to the tagging and you can modify the procedure described above to virtually any language.
We will now briefly create a German dictionary based on a subsection of the fairy tales collected by the brothers Grimm to show how the above procedure can be applied to a language other than English. In a first step, we load a German text into R.
# install german language package
#install.koRpus.lang("de")
# activate german language package
library(koRpus.lang.de)
# define location of pos-tagger engine
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-german.bat", lang="de")
NOTE
You would need to adapt the path to the TreeTagger engine (see previous code chunk) as well as the path to the data (see following code chunk)! The paths below are specified to my computer.
`
`
# pos-tag data
grimm_pos <- koRpus::treetag(file = here::here("data", "grimm.txt"),
lang = "de")
For other languages, you would need to adapt the path to the pos-tagger engine to match the language of your text. The code chunk below shows how you would do that for French, Spanish, Italian, and Dutch. Again note that you would need to adapt the path to the TreeTagger engine as well as the path to the data! The paths below are specified to my computer.
# French
# install french language package
install.koRpus.lang("fr")
# activate french language package
library(koRpus.lang.fr)
# define location of pos-tagger engine
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-french.bat", lang="fr")
# pos-tag data
text_pos <- treetag("PathToFrenchText.txt")
# Spanish
# install spanish language package
install.koRpus.lang("es")
# activate spanish language package
library(koRpus.lang.es)
# define location of pos-tagger engine
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-spanish.bat", lang="es")
# pos-tag data
text_pos <- treetag("PathToSpanishText.txt")
# Italian
# install italian language package
install.koRpus.lang("it")
# activate italian language package
library(koRpus.lang.it)
# define location of pos-tagger engine
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-italian.bat", lang="it")
# pos-tag data
text_pos <- treetag("PathToItalianText.txt")
# Dutch
# install dutch language package
install.koRpus.lang("nl")
# activate dutch language package
library(koRpus.lang.nl)
# define location of pos-tagger engine
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-dutch.bat", lang="nl")
# pos-tag data
text_pos <- treetag("PathToDutchText.txt")
We will now continue with generating the dictionary based on teh brothers’ Grimm fairy tales. We go through the same steps as for the English dictionary and collapse all the steps into a single code block.
# select data frame
grimm_pos <- grimm_pos@tokens
# generate dictionary
grimm_dic_raw <- grimm_pos %>%
dplyr::select(token, tag, lemma, wclass) %>%
dplyr::group_by(token, tag, lemma, wclass) %>%
dplyr::summarise(frequency = dplyr::n()) %>%
dplyr::arrange(lemma)
# clean dictionary
grimm_dic_clean <- grimm_dic_raw %>%
dplyr::filter(!stringr::str_detect(lemma, "\\W\\w{1,}"),
!stringr::str_detect(lemma, "\\W{1,}"),
!stringr::str_detect(lemma, "[:digit:]{1,}"),
nchar(token) > 1)
token | tag | lemma | wclass | frequency |
Aanten | NN | Aanten | noun | 1 |
Aar | NN | Aar | noun | 1 |
ab | APPR | ab | preposition | 1 |
ab | PTKVZ | ab | particle | 85 |
abgebissen | VVPP | abbeißen | verb | 1 |
abblasen | VVINF | abblasen | verb | 1 |
abbrechen | VVFIN | abbrechen | verb | 1 |
abbrechen | VVINF | abbrechen | verb | 4 |
abgebrochen | VVPP | abbrechen | verb | 2 |
Abend | ADV | abend | adverb | 2 |
Abend | NN | Abend | noun | 49 |
Abends | NN | Abend | noun | 28 |
Abendessen | NN | Abendessen | noun | 1 |
Abends | ADV | abends | adverb | 6 |
Abentheuer | NN | Abentheuer | noun | 3 |
As with the English dictionary, we have created a customized German dictionary based of a subsample of the brothers’ Grimm fairy tales holding the word form(token), the part-of-speech tag (tag), the lemmatized word type (lemma), the general word class (wclass), ad the frequency with which a word form occurs as a part-of-speech in the data (frequency).
Another task that is quite common in lexicography is to determine if words share some form of relationship such as whether they are synonyms or antonyms. In computational linguistics, this is commonly determined based on the collocational profiles of words. These collocational profiles are also called word vectors or word embeddings and approaches which determine semantic similarity based on collocational profiles or word embeddings are called distributional approaches (or distributional semantics). The basic assumption of distributional approaches is that words that occur in the same context and therefore have similar collocational profiles are also semantically similar. In fact, various packages, such as qdap
or , wordnet
already provide synonyms for terms (all of which are based on similar collocational profiles) but we would like to determine if words are similar without knowing it in advance.
In this example, we want to determine if two degree adverbs (such as very, really, so, completely, totally, amazingly, etc.) are synonymous and can therefore be exchanged without changing the meaning of the sentence (or, at least, not changing it dramatically). This is relevant in lexicography as such terms can then be linked to each other and inform readers that these words are interchangeable.
As a first step, we load the data which contains three columns:
one column holding the degree adverbs which is called pint
one column called adjs holding the adjectives that the degree adverbs have modified
one column called remove which contains the word keep and which we will remove as it is not relevant for this tutorial
When loading the data, we
remove the remove column
rename the pint column as degree_adverb
rename the adjs column as adjectives
filter out all instances where the degree adverb column has the value 0
(which means that the adjective was not modified)
remove instances where well functions as a degree adverb (because it behaves rather differently from other degree adverbs)
# load data
degree_adverbs <- base::readRDS(url("https://slcladal.github.io/data/dad.rda", "rb")) %>%
dplyr::select(-remove) %>%
dplyr::rename(degree_adverb = pint,
adjective = adjs) %>%
dplyr::filter(degree_adverb != "0",
degree_adverb != "well")
degree_adverb | adjective |
real | bad |
really | nice |
very | good |
really | early |
really | bad |
really | bad |
so | long |
really | wonderful |
pretty | good |
really | easy |
pretty | strong |
really | long |
really | funny |
really | good |
really | nice |
In a next step, we create a matrix from this data frame which maps how often a given amplifier co-occurred with a given adjective. In text mining, this format is called a text-document matrix or tdm (which is a transposed document-term matrix of dtm).
# tabulate data (create term-document matrix)
tdm <- ftable(degree_adverbs$adjective, degree_adverbs$degree_adverb)
# extract amplifiers and adjectives
amplifiers <- as.vector(unlist(attr(tdm, "col.vars")[1]))
adjectives <- as.vector(unlist(attr(tdm, "row.vars")[1]))
# attach row and column names to tdm
rownames(tdm) <- adjectives
colnames(tdm) <- amplifiers
# inspect data
tdm[1:5, 1:5]
## completely extremely pretty real really
## able 0 1 0 0 0
## actual 0 0 0 1 0
## amazing 0 0 0 0 4
## available 0 0 0 0 1
## bad 0 0 1 2 3
In a next step, we extract the expected values of the co-occurrences if the amplifiers were distributed homogeneously and calculate the Pointwise Mutual Information (PMI) score and use that to then calculate the Positive Pointwise Mutual Information (PPMI) scores. According to Levshina (2015) 327 - referring to Bullinaria and Levy (2007) - PPMI perform better than PMI as negative values are replaced with zeros. In a next step, we calculate the cosine similarity which will for the bases for the subsequent clustering.
# compute expected values
tdm.exp <- chisq.test(tdm)$expected
# calculate PMI and PPMI
PMI <- log2(tdm/tdm.exp)
PPMI <- ifelse(PMI < 0, 0, PMI)
# calculate cosine similarity
cosinesimilarity <- cosine(PPMI)
# inspect cosine values
cosinesimilarity[1:5, 1:5]
## completely extremely pretty real really
## completely 1.0000000000000 0.20418872529547 0.00000000000000 0.0530435361915 0.12666843354700
## extremely 0.2041887252955 1.00000000000000 0.00731931640239 0.0000000000000 0.00423534563153
## pretty 0.0000000000000 0.00731931640239 1.00000000000000 0.0944129947050 0.06232327122570
## real 0.0530435361915 0.00000000000000 0.09441299470503 1.0000000000000 0.13195747269597
## really 0.1266684335470 0.00423534563153 0.06232327122570 0.1319574726960 1.00000000000000
As we have now obtained a similarity measure, we can go ahead and perform a cluster analysis on these similarity values. However, as we have to extract the maximum values in the similarity matrix that is not 1 as we will use this to create a distance matrix. While we could also have simply subtracted the cosine similarity values from 1 to convert the similarity matrix into a distance matrix, we follow the procedure proposed by Levshina (2015).
# find max value that is not 1
cosinesimilarity.test <- apply(cosinesimilarity, 1, function(x){
x <- ifelse(x == 1, 0, x) } )
maxval <- max(cosinesimilarity.test)
# create distance matrix
amplifier.dist <- 1 - (cosinesimilarity/maxval)
clustd <- as.dist(amplifier.dist)
In a next step, we visualize the results of the semantic vector space model as a dendrogram.
# create cluster object
cd <- hclust(clustd, method="ward.D")
# plot cluster object
plot(cd, main = "", sub = "", yaxt = "n", ylab = "", xlab = "", cex = .8)
The clustering solution shows that, as expected, completely, extremely, and totally - while similar to each other and thus interchangeable with each other - form a separate cluster from all other amplifiers. In addition, real and really form a cluster together. The clustering of very, pretty, so, really, and real suggest that these amplifiers are more or less interchangeable with each other but not with totally, completely, and extremely.
To extract synonyms automatically, we can use the cosine similarity matrix that a´we generated before. This is what we need to do:
syntb <- cosinesimilarity %>%
as.data.frame() %>%
dplyr::mutate(word = colnames(cosinesimilarity)) %>%
dplyr::mutate_each(funs(replace(., . == 1, 0))) %>%
dplyr::mutate(synonym = colnames(.)[apply(.,1,which.max)]) %>%
dplyr::select(word, synonym)
syntb
## word synonym
## completely completely extremely
## extremely extremely completely
## pretty pretty real
## real real really
## really really real
## so so real
## totally totally completely
## very very so
NOTE
Remember that this is only a tutorial! A proper study would have to take the syntactic context into account because, while we can say This really great tutorial helped me a lot. we probably would not say This so great tutorial helped me a lot. This is because so syntactically more restricted and is strongly disfavored in attributive contexts. Therefore, the syntactic context would have to be considered in a more thorough study.
`
`
There are many more useful methods for identifying semantic similarity. A very useful method (which we have implemented here but only superficially is Semantic Vector Space Modeling. If you want to know more about this, this tutorial by Gede Primahadi Wijaya Rajeg, Karlina Denistia, and Simon Musgrave (Rajeg, Denistia, and Musgrave 2019) is highly recommended and will give a better understanding of SVM but this should suffice to get you started.
Dictionaries commonly contain information about elements. Bilingual or translation dictionaries represent a sub-category of dictionaries that provide a specific type of information about a given word: the translation of that word in another language. In principle, generating translation dictionaries is relatively easy and straight forward. However, not only is the devil hidden in the details but the generation of data-driven translation dictionaries also require a substantial data set consisting of sentences and their translation. This is often quite tricky as well aligned translations are unfortunately, and unexpectedly, rather hard to come by.
Despite these issues, if you have access to clean and well aligned, parallel multilingual data, then you simply need to check which correlation between the word in language A and language B is the highest and you have a likely candidate for its translation. The same procedure can be extended to generate multilingual dictionaries. Problems arise due to grammatical differences between languages, idiomatic expressions, homonymy and polysemy as well as due to word class issues. The latter, word class issues, can be solved by part-of-speech tagging and then only considering words that belong to the same (or realistically similar) parts-of speech. The other issues can also be solved but require substantial amounts of (annotated) data.
To explore how to generate a multilingual lexicon, we load a sample of English sentences and their German translations.
# load translations
translations <- readLines("https://slcladal.github.io/data/translation.txt",
encoding = "UTF-8", skipNul = T)
. |
Guten Tag! — Good day! |
Guten Morgen! — Good morning! |
Guten Abend! — Good evening! |
Hallo! — Hello! |
Wo kommst du her? — Where are you from? |
Woher kommen Sie? — Where are you from? |
Ich bin aus Hamburg. — I am from Hamburg. |
Ich komme aus Hamburg. — I come from Hamburg. |
Ich bin Deutscher. — I am German. |
Schön Sie zu treffen. — Pleasure to meet you! |
Wie lange lebst du schon in Brisbane? — How long have you been living in Brisbane? |
Leben Sie schon lange hier? — Have you been living here for long? |
Welcher Bus geht nach Brisbane? — Which bus goes to Brisbane? |
Von welchem Gleis aus fährt der Zug? — Which platform is the train leaving from? |
Ist dies der Bus nach Toowong? — Is this the bus going to Toowong? |
In a next step, we generate separate tables which hold the German and English sentences. However, the sentences and their translations are identified by an identification number (id) so that we keep the information about which sentence is linked to which translation.
# german sentences
german <- str_remove_all(translations, " — .*") %>%
str_remove_all(., "[:punct:]")
# english sentences
english <- str_remove_all(translations, ".* — ") %>%
str_remove_all(., "[:punct:]")
# sentence id
sentence <- 1:length(german)
# combine into table
germantb <- data.frame(sentence, german)
# sentence id
sentence <- 1:length(english)
# combine into table
englishtb <- data.frame(sentence, english)
sentence | german |
1 | Guten Tag |
2 | Guten Morgen |
3 | Guten Abend |
4 | Hallo |
5 | Wo kommst du her |
6 | Woher kommen Sie |
7 | Ich bin aus Hamburg |
8 | Ich komme aus Hamburg |
9 | Ich bin Deutscher |
10 | Schön Sie zu treffen |
11 | Wie lange lebst du schon in Brisbane |
12 | Leben Sie schon lange hier |
13 | Welcher Bus geht nach Brisbane |
14 | Von welchem Gleis aus fährt der Zug |
15 | Ist dies der Bus nach Toowong |
We now unnest the tokens (split the sentences into words) and subsequently add the translations which we again unnest. The resulting table consists of two columns holding German and English words. The relevant point here is that each German word is linked with each English word that occurs in the translated sentence.
library(plyr)
# tokenize by sentence: german
transtb <- germantb %>%
unnest_tokens(word, german) %>%
# add english data
plyr::join(., englishtb, by = "sentence") %>%
unnest_tokens(trans, english) %>%
dplyr::rename(german = word,
english = trans) %>%
dplyr::select(german, english) %>%
dplyr::mutate(german = factor(german),
english = factor(english))
german | english |
guten | good |
guten | day |
tag | good |
tag | day |
guten | good |
guten | morning |
morgen | good |
morgen | morning |
guten | good |
guten | evening |
abend | good |
abend | evening |
hallo | hello |
wo | where |
wo | are |
Based on this table, we can now generate a term-document matrix which shows how frequently each word co-occurred in the translation of any of the sentences. For instance, the German word alles occurred one time in a translation of a sentence which contained the English word all.
# tabulate data (create term-document matrix)
tdm <- ftable(transtb$german, transtb$english)
# extract amplifiers and adjectives
german <- as.vector(unlist(attr(tdm, "col.vars")[1]))
english <- as.vector(unlist(attr(tdm, "row.vars")[1]))
# attach row and column names to tdm
rownames(tdm) <- english
colnames(tdm) <- german
# inspect data
tdm[1:10, 1:10]
## a accident all am ambulance an and any anything are
## ab 0 0 0 0 0 0 0 0 0 0
## abend 0 0 0 0 0 0 0 0 0 0
## allem 0 0 0 0 0 0 0 0 0 0
## alles 0 0 1 0 0 0 0 0 0 0
## am 0 0 0 0 0 0 0 0 0 0
## an 0 0 0 0 0 0 0 0 0 0
## anderen 1 0 0 0 0 0 0 0 0 0
## apotheke 1 0 0 1 0 0 0 0 0 0
## arzt 1 0 0 0 0 0 0 0 0 0
## auch 3 0 0 0 0 0 0 0 1 0
Now, we reformat this co-occurrence matrix so that we have the frequency information that is necessary for setting up 2x2 contingency tables which we will use to calculate the co-occurrence strength between each word and its potential translation.
coocdf <- as.data.frame(as.matrix(tdm))
cooctb <- coocdf %>%
dplyr::mutate(German = rownames(coocdf)) %>%
tidyr::gather(English, TermCoocFreq,
colnames(coocdf)[1]:colnames(coocdf)[ncol(coocdf)]) %>%
dplyr::mutate(German = factor(German),
English = factor(English)) %>%
dplyr::mutate(AllFreq = sum(TermCoocFreq)) %>%
dplyr::group_by(German) %>%
dplyr::mutate(TermFreq = sum(TermCoocFreq)) %>%
dplyr::ungroup(German) %>%
dplyr::group_by(English) %>%
dplyr::mutate(CoocFreq = sum(TermCoocFreq)) %>%
dplyr::arrange(German) %>%
dplyr::mutate(a = TermCoocFreq,
b = TermFreq - a,
c = CoocFreq - a,
d = AllFreq - (a + b + c)) %>%
dplyr::mutate(NRows = nrow(coocdf))%>%
dplyr::filter(TermCoocFreq > 0)
German | English | TermCoocFreq | AllFreq | TermFreq | CoocFreq | a | b | c | d | NRows |
ab | departing | 1 | 3,504 | 5 | 5 | 1 | 4 | 4 | 3,495 | 215 |
ab | is | 1 | 3,504 | 5 | 116 | 1 | 4 | 115 | 3,384 | 215 |
ab | the | 1 | 3,504 | 5 | 125 | 1 | 4 | 124 | 3,375 | 215 |
ab | train | 1 | 3,504 | 5 | 16 | 1 | 4 | 15 | 3,484 | 215 |
ab | when | 1 | 3,504 | 5 | 27 | 1 | 4 | 26 | 3,473 | 215 |
abend | evening | 1 | 3,504 | 2 | 2 | 1 | 1 | 1 | 3,501 | 215 |
abend | good | 1 | 3,504 | 2 | 16 | 1 | 1 | 15 | 3,487 | 215 |
allem | döner | 1 | 3,504 | 5 | 10 | 1 | 4 | 9 | 3,490 | 215 |
allem | everything | 1 | 3,504 | 5 | 5 | 1 | 4 | 4 | 3,495 | 215 |
allem | one | 1 | 3,504 | 5 | 30 | 1 | 4 | 29 | 3,470 | 215 |
allem | please | 1 | 3,504 | 5 | 111 | 1 | 4 | 110 | 3,389 | 215 |
allem | with | 1 | 3,504 | 5 | 22 | 1 | 4 | 21 | 3,478 | 215 |
alles | all | 1 | 3,504 | 6 | 5 | 1 | 5 | 4 | 3,494 | 215 |
alles | for | 1 | 3,504 | 6 | 93 | 1 | 5 | 92 | 3,406 | 215 |
alles | no | 1 | 3,504 | 6 | 7 | 1 | 5 | 6 | 3,492 | 215 |
In a final step, we extract those potential translations that correlate most strongly with each given term. The results then form a list of words and their most likely translation.
translationtb <- cooctb %>%
dplyr::rowwise() %>%
dplyr::mutate(p = round(as.vector(unlist(fisher.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1])), 5),
x2 = round(as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1])), 3)) %>%
dplyr::mutate(phi = round(sqrt((x2/(a + b + c + d))), 3),
expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%
dplyr::filter(TermCoocFreq > expected) %>%
dplyr::arrange(-phi) %>%
dplyr::select(-AllFreq, -a, -b, -c, -d, -NRows, -expected)
German | English | TermCoocFreq | TermFreq | CoocFreq | p | x2 | phi |
hallo | hello | 1 | 1 | 1 | 0.00029 | 875.500 | 0.500 |
abend | evening | 1 | 2 | 2 | 0.00114 | 218.250 | 0.250 |
ja | yes | 1 | 2 | 2 | 0.00114 | 218.250 | 0.250 |
morgen | morning | 1 | 2 | 2 | 0.00114 | 218.250 | 0.250 |
tag | day | 1 | 2 | 2 | 0.00114 | 218.250 | 0.250 |
guten | good | 4 | 13 | 16 | 0.00000 | 201.086 | 0.240 |
brauche | need | 5 | 20 | 27 | 0.00000 | 124.215 | 0.188 |
nein | no | 2 | 9 | 7 | 0.00012 | 122.721 | 0.187 |
bier | beer | 2 | 8 | 8 | 0.00013 | 120.757 | 0.186 |
hamburg | hamburg | 2 | 8 | 8 | 0.00013 | 120.757 | 0.186 |
braucht | he | 1 | 3 | 3 | 0.00257 | 96.501 | 0.166 |
braucht | medication | 1 | 3 | 3 | 0.00257 | 96.501 | 0.166 |
braucht | needs | 1 | 3 | 3 | 0.00257 | 96.501 | 0.166 |
deutscher | german | 1 | 3 | 3 | 0.00257 | 96.501 | 0.166 |
er | he | 1 | 3 | 3 | 0.00257 | 96.501 | 0.166 |
The results show that even using the very limited data base can produce some very reasonable results. In fact, based on the data that we used here, the first translations appear to be very sensible, but the mismatches also show that more data is required to disambiguate potential translations.
While this method still requires manual correction, it is a very handy and useful tool for generating custom bilingual dictionaries that can be extended to any set of languages as long as these languages can be represented as distinct words and as long as parallel data is available.
While it would go beyond the scope of this tutorial, it should be noted that the approach for creating dictionaries can be applied to crowed-sourced dictionaries. To do this, you could, e.g. upload your dictionary to a Git repository such as GitHub or GitLab which would then allow everybody with an account on either of these platforms to add content to the dictionary.
To add to the dictionary, contributors would simply have to fork the repository of the dictionary and then merge with the existing, original dictionary repository. The quality of the data would meanwhile remain under control of the owner of the original repository he they can decide on a case-by-case basis which change they would like to accept. In addition, and because Git is a version control environment, the owner could also go back to previous versions, if they think they erroneously accepted a change (merge).
This option is particularly interesting for the approach to creating dictionaries presented here because R Studio has an integrated and very easy to use pipeline to Git (see, e.g., here and here)
We have reached the end of this tutorial and you now know how to create and modify networks in R and how you can highlight aspects of your data.
Schweinberger, Martin. 2021. Lexicography with R. Brisbane: The University of Queensland. url: https://slcladal.github.io/lex.html (Version 2021.10.02).
@manual{schweinberger2021lex,
author = {Schweinberger, Martin},
title = {Lexicography with R},
note = {https://slcladal.github.io/lex.html},
year = {2021},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2021.10.02}
}
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
## [4] LC_NUMERIC=C LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] koRpus.lang.de_0.1-2 cluster_2.1.2 tm_0.7-8 NLP_0.2-1 coop_0.6-3
## [6] hunspell_3.0.1 koRpus.lang.en_0.1-4 koRpus_0.13-8 sylly_0.1-6 textdata_0.4.1
## [11] tidytext_0.3.1 plyr_1.8.6 flextable_0.6.8 forcats_0.5.1 stringr_1.4.0
## [16] dplyr_1.0.7 purrr_0.3.4 readr_2.0.1 tidyr_1.1.3 tibble_3.1.4
## [21] ggplot2_3.3.5 tidyverse_1.3.1 gutenbergr_0.2.1 quanteda_3.1.0
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.0 lubridate_1.7.10 bit64_4.0.5 httr_1.4.2 rprojroot_2.0.2 SnowballC_0.7.0
## [7] tools_4.1.1 backports_1.2.1 utf8_1.2.2 R6_2.5.1 DBI_1.1.1 lazyeval_0.2.2
## [13] colorspace_2.0-2 withr_2.4.2 tidyselect_1.1.1 bit_4.0.4 compiler_4.1.1 cli_3.0.1
## [19] rvest_1.0.1 xml2_1.3.2 officer_0.4.0 slam_0.1-48 scales_1.1.1 rappdirs_0.3.3
## [25] systemfonts_1.0.2 digest_0.6.27 rmarkdown_2.5 base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.5.2
## [31] dbplyr_2.1.1 fastmap_1.1.0 highr_0.9 rlang_0.4.11 readxl_1.3.1 rstudioapi_0.13
## [37] generics_0.1.0 jsonlite_1.7.2 vroom_1.5.5 zip_2.2.0 tokenizers_0.2.1 magrittr_2.0.1
## [43] Matrix_1.3-4 Rcpp_1.0.7 munsell_0.5.0 fansi_0.5.0 gdtools_0.2.3 lifecycle_1.0.1
## [49] stringi_1.7.4 yaml_2.2.1 grid_4.1.1 parallel_4.1.1 crayon_1.4.1 lattice_0.20-44
## [55] haven_2.4.3 hms_1.1.0 knitr_1.34 klippy_0.0.0.9500 pillar_1.6.3 uuid_0.1-4
## [61] stopwords_2.2 fastmatch_1.1-3 reprex_2.0.1.9000 glue_1.4.2 evaluate_0.14 data.table_1.14.0
## [67] RcppParallel_5.1.4 modelr_0.1.8 vctrs_0.3.8 tzdb_0.1.2 cellranger_1.1.0 gtable_0.3.0
## [73] assertthat_0.2.1 xfun_0.26 sylly.en_0.1-3 broom_0.7.9 janeaustenr_0.1.5 sylly.de_0.1-2
## [79] ellipsis_0.3.2 here_1.0.1
Agnes, Michael, Jonathan L Goldman, and Katherine Soltis. 2002. Webster’s New World Compact Desk Dictionary and Style Guide. Hungry Minds.
Amsler, Robert Alfred. 1981. The Structure of the Merriam-Webster Pocket Dictionary. Austin, TX: he University of Texas at Austin.
Bullinaria, J. A., and J. P. Levy. 2007. “Extracting Semantic Representations from Word Co-Occurrence Statistics: A Computational Study.” Behavior Research Methods 39: 510–26.
Levshina, Natalia. 2015. How to Do Linguistics with R: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company.
Mohammad, Saif M, and Peter D Turney. 2013. “Crowdsourcing a Word-Emotion Association Lexicon.” Computational Intelligence 29 (3): 436–65.
Rajeg, Gede Primahadi Wijaya, Karlina Denistia, and Simon Musgrave. 2019. “R Markdown Notebook for Vector Space Model and the Usage Patterns of Indonesian Denominal Verbs.” https://doi.org10.6084/m9.figshare.9970205. https://figshare.com/articles/R\%5FMarkdown\%5FNotebook\%5Ffor\%5Fi\%5FVector\%5Fspace\%5Fmodel\%5Fand\%5Fthe\%5Fusage\%5Fpatterns\%5Fof\%5FIndonesian\%5Fdenominal\%5Fverbs\%5Fi\%5F/9970205.
Schmid, Helmut. 1994. “TreeTagger-a Language Independent Part-of-Speech Tagger.” Http://Www. Ims. Uni-Stuttgart. De/Projekte/Corplex/TreeTagger/.
———. 2013. “Probabilistic Part-Ofispeech Tagging Using Decision Trees.” In New Methods in Language Processing, 154.
Schmid, Helmut, M Baroni, E Zanchetta, and A Stein. 2007. “The Enriched Treetagger System.” In Proceedings of the Evalita 2007 Workshop.
Steiner, Roger J. 1985. “Dictionaries. The Art and Craft of Lexicography.” Dictionaries: Journal of the Dictionary Society of North America 7 (1): 294–300.