This tutorial shows how to download and clean works from the Project Gutenberg archive using R. Project Gutenberg is a data base whcih contains roughly 60,000 texts for which the US copyright ahs expired. The entire R-markdown document for the sections below can be downloaded here.
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
# install libraries
install.packages("tidyverse")
install.packages("gutenbergr")
install.packages("DT")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
remotes::install_github("rlesur/klippy")
Now that we have installed the packages, we activate them as shown below.
# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# activate packages
library(tidyverse)
library(gutenbergr)
library(DT)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.
In a first step, we inspect which works are available for download. We can do this by typing gutenberg()
or simply gutenberg_metadata
into the console which will output a table containing all available texts.
gutenberg_metadata
The table below shows the first 15 lines of the overview table which shows all available texts. As there are currently 51,997 texts available, we limit the output here to 15.
To find all works by a specific author, you need to specify the author in the gutenberg_works
function as shown below.
# load data
darwin <- gutenberg_works(author == "Darwin, Charles")
To find all texts in, for example, German, you need to specify the language in the gutenberg_works
function as shown below.
# load data
gutenberg_works(languages = "de", all_languages = TRUE) %>%
dplyr::count(language)
## # A tibble: 1 x 2
## language n
## <chr> <int>
## 1 de 1342
To download any of these text, you need to specify the text you want, e.g. by specifying the title. In a next step, you can then use the gutenberg_download
function to download the text. To exemplify how this works we download William Shakespeare’s Romeo and Juliet.
# load data
romeo <- gutenberg_works(title == "Romeo and Juliet") %>%
gutenberg_download(meta_fields = "title")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
gutenberg_id | text | title |
1,513 | THE TRAGEDY OF ROMEO AND JULIET | Romeo and Juliet |
1,513 | Romeo and Juliet | |
1,513 | Romeo and Juliet | |
1,513 | Romeo and Juliet | |
1,513 | by William Shakespeare | Romeo and Juliet |
1,513 | Romeo and Juliet | |
1,513 | Romeo and Juliet | |
1,513 | Contents | Romeo and Juliet |
1,513 | Romeo and Juliet | |
1,513 | THE PROLOGUE. | Romeo and Juliet |
1,513 | Romeo and Juliet | |
1,513 | ACT I | Romeo and Juliet |
1,513 | Scene I. A public place. | Romeo and Juliet |
1,513 | Scene II. A Street. | Romeo and Juliet |
1,513 | Scene III. Room in Capulet’s House. | Romeo and Juliet |
We could also use the gutenberg_id to download this text.
# load data
romeo <- gutenberg_works(gutenberg_id == "1513") %>%
gutenberg_download(meta_fields = "gutenberg_id")
gutenberg_id | text |
1,513 | THE TRAGEDY OF ROMEO AND JULIET |
1,513 | |
1,513 | |
1,513 | |
1,513 | by William Shakespeare |
1,513 | |
1,513 | |
1,513 | Contents |
1,513 | |
1,513 | THE PROLOGUE. |
1,513 | |
1,513 | ACT I |
1,513 | Scene I. A public place. |
1,513 | Scene II. A Street. |
1,513 | Scene III. Room in Capulet’s House. |
To load more than one text, you can use the |
(or) operator to inform R that you want to download the text with the gutenberg_id 768 (Wuthering Heights and the text with the gutenberg_id 1260 which is Jane Eyre (both from Jane Austen).
texts <- gutenberg_download(c(768, 1260), meta_fields = "title",
mirror = "http://mirrors.xmission.com/gutenberg/")
## Text NumberOfLines
## 1 Wuthering Heights 12314
## 2 Jane Eyre 21001
Feel free to have a look at different texts provided by the Project Gutenberg!
Schweinberger, Martin. 2021. Downloading Texts from Project Gutenberg using R. Brisbane: The University of Queensland. url: https://slcladal.github.io/gutenberg.html (Version 2021.10.02).
@manual{schweinberger2021gb,
author = {Schweinberger, Martin},
title = {Downloading Texts from Project Gutenberg using R},
note = {https://slcladal.github.io/gutenberg.html},
year = {2021},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2021.10.02}
}
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
## [4] LC_NUMERIC=C LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] flextable_0.6.8 DT_0.19 gutenbergr_0.2.1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
## [7] purrr_0.3.4 readr_2.0.1 tidyr_1.1.3 tibble_3.1.4 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 bit64_4.0.5 vroom_1.5.5 jsonlite_1.7.2 modelr_0.1.8 assertthat_0.2.1
## [7] triebeard_0.3.0 urltools_1.7.3 highr_0.9 cellranger_1.1.0 yaml_2.2.1 gdtools_0.2.3
## [13] pillar_1.6.3 backports_1.2.1 glue_1.4.2 uuid_0.1-4 digest_0.6.27 rvest_1.0.1
## [19] colorspace_2.0-2 htmltools_0.5.2 pkgconfig_2.0.3 broom_0.7.9 haven_2.4.3 scales_1.1.1
## [25] officer_0.4.0 tzdb_0.1.2 generics_0.1.0 ellipsis_0.3.2 withr_2.4.2 klippy_0.0.0.9500
## [31] lazyeval_0.2.2 cli_3.0.1 magrittr_2.0.1 crayon_1.4.1 readxl_1.3.1 evaluate_0.14
## [37] fs_1.5.0 fansi_0.5.0 xml2_1.3.2 tools_4.1.1 data.table_1.14.0 hms_1.1.0
## [43] lifecycle_1.0.1 munsell_0.5.0 reprex_2.0.1.9000 zip_2.2.0 compiler_4.1.1 jquerylib_0.1.4
## [49] systemfonts_1.0.2 rlang_0.4.11 grid_4.1.1 rstudioapi_0.13 htmlwidgets_1.5.4 crosstalk_1.1.1
## [55] base64enc_0.1-3 rmarkdown_2.5 gtable_0.3.0 curl_4.3.2 DBI_1.1.1 R6_2.5.1
## [61] lubridate_1.7.10 knitr_1.34 fastmap_1.1.0 bit_4.0.4 utf8_1.2.2 stringi_1.7.4
## [67] parallel_4.1.1 Rcpp_1.0.7 vctrs_0.3.8 dbplyr_2.1.1 tidyselect_1.1.1 xfun_0.26