This tutorial introduces regular expressions and how they can be used when working with language data. The entire R markdown document for the sections below can be downloaded here.
How can you search texts for complex patterns or combinations of patterns? This question will answered in this tutorial and at the end you will be able to perform very complex searches yourself. The key concept of this tutorial is that of a regular expression. A regular expression (in short also called regex or regexp) is a special sequence of characters (or string) for describing a search pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids.
If you would like to get deeper into regular expressions, I can recommend Friedl (2006) and, in particular, chapter 17 of Peng (2020) for further study (although the latter uses base R rather than tidyverse functions, but this does not affect the utility of the discussion of regular expressions in any major or meaningful manner). Also, here is a so-called cheatsheet about regular expressions written by Ian Kopacka and provided by RStudio.
Preparation and session set up
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install packages
install.packages("tidyverse")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
remotes::install_github("rlesur/klippy")
In a next step, we load the packages.
library(tidyverse)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed RStudio and have initiated the session by executing the code shown above, you are good to go.
To put regular expressions into practice, we need some text that we will perform out searches on. In this tutorial, we will use texts from wikipedia about grammar.
# read in first text
text1 <- readLines("https://slcladal.github.io/data/testcorpus/linguistics02.txt")
et <- paste(text1, sep = " ", collapse = " ")
# inspect example text
et
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."
In addition, we will split the example text into words to have another resource we can use to understand regular expressions
# split example text
set <- str_split(et, " ") %>%
unlist()
# inspect
head(set)
## [1] "Grammar" "is" "a" "system" "of" "rules"
Before we delve into using regular expressions, we will have a look at the regular expressions that can be used in R and also check what they stand for.
There are three basic types of regular expressions:
regular expressions that stand for individual symbols and determine frequencies
regular expressions that stand for classes of symbols
regular expressions that stand for structural properties
The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.
RegEx Symbol/Sequence | Explanation | Example |
? | The preceding item is optional and will be matched at most once | walk[a-z]? = walk, walks |
* | The preceding item will be matched zero or more times | walk[a-z]* = walk, walks, walked, walking |
+ | The preceding item will be matched one or more times | walk[a-z]+ = walks, walked, walking |
{n} | The preceding item is matched exactly n times | walk[a-z]{2} = walked |
{n,} | The preceding item is matched n or more times | walk[a-z]{2,} = walked, walking |
{n,m} | The preceding item is matched at least n times, but not more than m times | walk[a-z]{2,3} = walked, walking |
The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.
RegEx Symbol/Sequence | Explanation |
[ab] | lower case a and b |
[a-z] | all lower case characters from a to z |
[AB] | upper case a and b |
[A-Z] | all upper case characters from A to Z |
[12] | digits 1 and 2 |
[0-9] | digits: 0 1 2 3 4 5 6 7 8 9 |
[:digit:] | digits: 0 1 2 3 4 5 6 7 8 9 |
[:lower:] | lower case characters: a–z |
[:upper:] | upper case characters: A–Z |
[:alpha:] | alphabetic characters: a–z and A–Z |
[:alnum:] | digits and alphabetic characters |
[:punct:] | punctuation characters: . , ; etc. |
[:graph:] | graphical characters: [:alnum:] and [:punct:] |
[:blank:] | blank characters: Space and tab |
[:space:] | space characters: Space, tab, newline, and other space characters |
The regular expressions that denote classes of symbols are enclosed in []
and :
. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.
RegEx Symbol/Sequence | Explanation |
\\w | Word characters: [[:alnum:]_] |
\\W | No word characters: [^[:alnum:]_] |
\\s | Space characters: [[:blank:]] |
\\S | No space characters: [^[:blank:]] |
\\d | Digits: [[:digit:]] |
\\D | No digits: [^[:digit:]] |
\\b | Word edge |
\\B | No word edge |
< | Word beginning |
> | Word end |
^ | Beginning of a string |
$ | End of a string |
In this section, we will explore how to use regular expressions. At the end, we will go through some exercises to help you understand how you can best utilize regular expressions.
Show all words in the split example text that contain a
or n
.
set[str_detect(set, "[an]")]
## [1] "Grammar" "a" "governs" "production" "and"
## [6] "utterances" "in" "a" "given" "language."
## [11] "apply" "sound" "as" "as" "meaning,"
## [16] "and" "include" "componential" "as" "pertaining"
## [21] "phonology" "organisation" "phonetic" "sound" "formation"
## [26] "and" "composition" "and" "syntax" "formation"
## [31] "and" "composition" "phrases" "and" "sentences)."
## [36] "Many" "modern" "that" "deal" "principles"
## [41] "grammar" "are" "based" "on" "Noam"
## [46] "framework" "generative" "linguistics."
Show all words in the split example text that begin with a lower case a
.
set[str_detect(set, "^a")]
## [1] "a" "and" "a" "apply" "as" "as" "and" "as" "and"
## [10] "and" "and" "and" "are"
Show all words in the split example text that end in a lower case s
.
set[str_detect(set, "s$")]
## [1] "is" "rules" "governs" "utterances" "rules"
## [6] "as" "as" "subsets" "as" "phrases"
## [11] "theories" "principles" "Chomsky's"
Show all words in the split example text in which there is an e
, then any other character, and than another n
.
set[str_detect(set, "e.n")]
## [1] "governs" "meaning," "modern"
Show all words in the split example text in which there is an e
, then two other characters, and than another n
.
set[str_detect(set, "e.{2,2}n")]
## [1] "utterances"
Show all words that consist of exactly three alphabetical characters in the split example text.
set[str_detect(set, "^[:alpha:]{3,3}$")]
## [1] "the" "and" "use" "and" "and" "and" "and" "and" "the" "are"
Show all words that consist of six or more alphabetical characters in the split example text.
set[str_detect(set, "^[:alpha:]{6,}$")]
## [1] "Grammar" "system" "governs" "production" "utterances"
## [6] "include" "componential" "subsets" "pertaining" "phonology"
## [11] "organisation" "phonetic" "morphology" "formation" "composition"
## [16] "syntax" "formation" "composition" "phrases" "modern"
## [21] "theories" "principles" "grammar" "framework" "generative"
Replace all lower case a
s with upper case E
s in the example text.
str_replace_all(et, "a", "E")
## [1] "GrEmmEr is E system of rules which governs the production End use of utterEnces in E given lEnguEge. These rules Epply to sound Es well Es meEning, End include componentiEl subsets of rules, such Es those pertEining to phonology (the orgEnisEtion of phonetic sound systems), morphology (the formEtion End composition of words), End syntEx (the formEtion End composition of phrEses End sentences). MEny modern theories thEt deEl with the principles of grEmmEr Ere bEsed on NoEm Chomsky's frEmework of generEtive linguistics."
Remove all non-alphabetical characters in the split example text.
str_remove_all(set, "\\W")
## [1] "Grammar" "is" "a" "system" "of"
## [6] "rules" "which" "governs" "the" "production"
## [11] "and" "use" "of" "utterances" "in"
## [16] "a" "given" "language" "These" "rules"
## [21] "apply" "to" "sound" "as" "well"
## [26] "as" "meaning" "and" "include" "componential"
## [31] "subsets" "of" "rules" "such" "as"
## [36] "those" "pertaining" "to" "phonology" "the"
## [41] "organisation" "of" "phonetic" "sound" "systems"
## [46] "morphology" "the" "formation" "and" "composition"
## [51] "of" "words" "and" "syntax" "the"
## [56] "formation" "and" "composition" "of" "phrases"
## [61] "and" "sentences" "Many" "modern" "theories"
## [66] "that" "deal" "with" "the" "principles"
## [71] "of" "grammar" "are" "based" "on"
## [76] "Noam" "Chomskys" "framework" "of" "generative"
## [81] "linguistics"
Remove all white spaces in the example text.
str_remove_all(et, " ")
## [1] "Grammarisasystemofruleswhichgovernstheproductionanduseofutterancesinagivenlanguage.Theserulesapplytosoundaswellasmeaning,andincludecomponentialsubsetsofrules,suchasthosepertainingtophonology(theorganisationofphoneticsoundsystems),morphology(theformationandcompositionofwords),andsyntax(theformationandcompositionofphrasesandsentences).ManymoderntheoriesthatdealwiththeprinciplesofgrammararebasedonNoamChomsky'sframeworkofgenerativelinguistics."
Highlighting patterns
We use the str_view
and str_view_all
functions to show the occurrences of regular expressions in the example text.
To begin with, we match an exactly defined pattern (ang
).
str_view_all(et, "ang")
Now, we include . which stands for any symbol (except a new line symbol).
str_view_all(et, ".n.")
EXERCISE TIME!
`
[Ww][Aa][Ll][Kk].*
More exercises will follow - bear with us ;)
`
Schweinberger, Martin. 2021. Regular Expressions in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/regex.html (Version 2021.09.28).
@manual{schweinberger2021regex,
author = {Schweinberger, Martin},
title = {Regular Expressions in R},
note = {https://slcladal.github.io/regex.html},
year = {2021},
organization = {The University of Queensland, School of Languages and Cultures},
address = {Brisbane},
edition = {2021.09.28}
}
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] flextable_0.6.8 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
## [5] purrr_0.3.4 readr_2.0.1 tidyr_1.1.3 tibble_3.1.4
## [9] ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 lubridate_1.7.10 assertthat_0.2.1 digest_0.6.27
## [5] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 backports_1.2.1
## [9] reprex_2.0.1.9000 evaluate_0.14 httr_1.4.2 highr_0.9
## [13] pillar_1.6.2 gdtools_0.2.3 rlang_0.4.11 uuid_0.1-4
## [17] readxl_1.3.1 rstudioapi_0.13 data.table_1.14.0 klippy_0.0.0.9500
## [21] rmarkdown_2.5 htmlwidgets_1.5.4 munsell_0.5.0 broom_0.7.9
## [25] compiler_4.1.1 modelr_0.1.8 xfun_0.26 pkgconfig_2.0.3
## [29] systemfonts_1.0.2 base64enc_0.1-3 htmltools_0.5.2 tidyselect_1.1.1
## [33] fansi_0.5.0 crayon_1.4.1 tzdb_0.1.2 dbplyr_2.1.1
## [37] withr_2.4.2 grid_4.1.1 jsonlite_1.7.2 gtable_0.3.0
## [41] lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1
## [45] zip_2.2.0 cli_3.0.1 stringi_1.7.4 fs_1.5.0
## [49] xml2_1.3.2 ellipsis_0.3.2 generics_0.1.0 vctrs_0.3.8
## [53] tools_4.1.1 glue_1.4.2 officer_0.4.0 hms_1.1.0
## [57] fastmap_1.1.0 yaml_2.2.1 colorspace_2.0-2 rvest_1.0.1
## [61] knitr_1.34 haven_2.4.3
Friedl, Jeffrey EF. 2006. Mastering Regular Expressions. Sebastopol, CA: "O’Reilly Media".
Peng, Roger D. 2020. R Programming for Data Science. Leanpub. https://bookdown.org/rdpeng/rprogdatascience/.