Introduction

This tutorial shows how to convert PDFs to simple txt (editor) files. The R Notebook for this tutorial can be downloaded here.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)
# install libraries
install.packages(c("pdftools", "dplyr", "stringr", "httr", "jsonlite"))

Once you have installed R-Studio and have also initiated the session by executing the code shown above, you are good to go.

How to use the R Notebook for this tutorial

To follow this tutorial interactively (by using the R Notebook), follow the instructions listed below.

  1. Create a folder somewhere on your computer
  2. Download the R Notebook and save it in the folder you have just created
  3. Open R Studio
  4. Click on File in the upper left corner of the R Studio interface
  5. Click on New Project...
  6. Select Existing Directory
  7. Browse to the folder you have just created and click on Open
  8. Now click on Files above the lower right panel
  9. Click on the file convertpdf2txt.Rmd
  • The Markdown file of this tutorial should now be open in the upper left panel of R Studio. To execute the code which prepare the session, load the data, create the graphs, and perform the statistics, simply click on the green arrows in the top right corner of the code boxes.
  • To render a PDF of this tutorial, simply click on Knit above the upper left panel in R Studio.

Converting PDFs into txt files

Now, we load the packages and inspect the data.

# activate packages
library(pdftools)
library(dplyr)
library(stringr)

Next, we define a path and convert the pdf that is located the path into a txt.

# you can use an url or a path that leads to a pdf dcument
pdf_path <- "https://slcladal.github.io/data/PDFs/pdf0.pdf"
# extract text
txt_output <- pdftools::pdf_text(pdf_path) %>%
  paste(sep = " ") %>%
  stringr::str_replace_all(fixed("\n"), " ") %>%
  stringr::str_replace_all(fixed("\r"), " ") %>%
  stringr::str_replace_all(fixed("\t"), " ") %>%
  stringr::str_replace_all(fixed("\""), " ") %>%
  paste(sep = " ", collapse = " ") %>%
  stringr::str_squish() %>%
  stringr::str_replace_all("- ", "") 
# inspect
str(txt_output)
##  chr "Corpus linguistics Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguis"| __truncated__

To convert many pdf-files, we write a function that preforms the conversion for many documents.

convertpdf2txt <- function(dirpath){
  files <- list.files(dirpath, full.names = T)
  x <- sapply(files, function(x){
  x <- pdftools::pdf_text(x) %>%
  paste(sep = " ") %>%
  stringr::str_replace_all(fixed("\n"), " ") %>%
  stringr::str_replace_all(fixed("\r"), " ") %>%
  stringr::str_replace_all(fixed("\t"), " ") %>%
  stringr::str_replace_all(fixed("\""), " ") %>%
  paste(sep = " ", collapse = " ") %>%
  stringr::str_squish() %>%
  stringr::str_replace_all("- ", "") 
  return(x)
    })
}

We can now apply the function to the folder in which we have stored the PDFs. The output is a vector with the texts of the PDFs.

# apply function
txts <- convertpdf2txt("data/PDFs/")
# inspect the structure of the txts element
str(txts)
##  Named chr [1:4] "Corpus linguistics Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguis"| __truncated__ ...
##  - attr(*, "names")= chr [1:4] "data/PDFs/pdf0.pdf" "data/PDFs/pdf1.pdf" "data/PDFs/pdf2.pdf" "data/PDFs/pdf3.pdf"

The output of the str() function shows that we have converted 4 PDFs into txt-files.

Saving the txt-files on your computer

To save the txt-files on your disc, simply replace the predefined location (“D:\Uni\UQ\SLC\LADAL\SLCLADAL.github.io\data/”) with the folder where you want to store the txt files and then execute the code below. Also, we will name the txts-elements as text plus their number.

# add names to txt files
names(txts) <- paste("text", 1:length(txts), sep = "")
# save result to disc
lapply(seq_along(txts), function(i)writeLines(text = unlist(txts[i]),
    con = paste("D:\\Uni\\UQ\\SLC\\LADAL\\SLCLADAL.github.io\\data/", names(txts)[i],".txt", sep = "")))

Citation & Session Info

Schweinberger, Martin. 2020. Converting PDFs to txt files with R. Brisbane: The University of Queensland. url: https://slcladal.github.io/convertpdf2txt.html (Version 2020.12.03).

@manual{schweinberger2020conv,
  author = {Schweinberger, Martin},
  title = {Converting PDFs to txt files with R},
  note = {https://slcladal.github.io/convertpdf2txt.html},
  year = {2020},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2020/12/03}
}
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] pdftools_2.3.1        collostructions_0.1.2 igraph_1.2.6          GGally_2.0.0          network_1.16.1       
##  [6] ggdendro_0.1.22       slam_0.1-47           Matrix_1.2-18         tm_0.7-7              NLP_0.2-1            
## [11] tidytext_0.2.6        quanteda_2.1.2        gplots_3.1.0          FactoMineR_2.3        exact2x2_1.6.5       
## [16] exactci_1.3-3         ssanv_1.1             vcd_1.4-8             ape_5.4-1             pvclust_2.2-0        
## [21] NbClust_3.0           seriation_1.2-9       factoextra_1.0.7      cluster_2.1.0         cfa_0.10-0           
## [26] gridExtra_2.3         fGarch_3042.83.2      fBasics_3042.89.1     timeSeries_3062.100   timeDate_3043.102    
## [31] e1071_1.7-4           ggpubr_0.4.0          flextable_0.5.11      forcats_0.5.0         stringr_1.4.0        
## [36] dplyr_1.0.2           purrr_0.3.4           readr_1.4.0           tidyr_1.1.2           tibble_3.0.4         
## [41] ggplot2_3.3.3         tidyverse_1.3.0       DT_0.16               kableExtra_1.3.1      knitr_1.30           
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1         uuid_0.1-4           backports_1.1.10     fastmatch_1.1-0      systemfonts_0.3.2    plyr_1.8.6          
##   [7] crosstalk_1.1.0.1    SnowballC_0.7.0      usethis_1.6.3        digest_0.6.27        foreach_1.5.1        htmltools_0.5.0     
##  [13] fansi_0.4.1          rle_0.9.2            magrittr_1.5         openxlsx_4.2.3       sna_2.6              modelr_0.1.8        
##  [19] RcppParallel_5.0.2   officer_0.3.15       askpass_1.1          colorspace_1.4-1     rvest_0.3.6          ggrepel_0.8.2       
##  [25] haven_2.3.1          xfun_0.19            crayon_1.3.4         jsonlite_1.7.1       zoo_1.8-8            iterators_1.0.13    
##  [31] glue_1.4.2           registry_0.5-1       stopwords_2.0        gtable_0.3.0         webshot_0.5.2        car_3.0-10          
##  [37] abind_1.4-5          scales_1.1.1         qpdf_1.1             DBI_1.1.0            rstatix_0.6.0        Rcpp_1.0.5          
##  [43] viridisLite_0.3.0    flashClust_1.01-2    foreign_0.8-80       htmlwidgets_1.5.3    httr_1.4.2           RColorBrewer_1.1-2  
##  [49] ellipsis_0.3.1       spatial_7.3-12       reshape_0.8.8        pkgconfig_2.0.3      farver_2.0.3         dbplyr_2.0.0        
##  [55] utf8_1.1.4           tidyselect_1.1.0     labeling_0.4.2       rlang_0.4.8          reshape2_1.4.4       munsell_0.5.0       
##  [61] cellranger_1.1.0     tools_4.0.3          cli_2.1.0            generics_0.1.0       statnet.common_4.4.1 broom_0.7.2         
##  [67] evaluate_0.14        yaml_2.2.1           fs_1.5.0             zip_2.1.1            caTools_1.18.0       nlme_3.1-149        
##  [73] leaps_3.1            xml2_1.3.2           tokenizers_0.2.1     compiler_4.0.3       rstudioapi_0.11      curl_4.3            
##  [79] ggsignif_0.6.1       reprex_0.3.0         stringi_1.5.3        highr_0.8            gdtools_0.2.2        lattice_0.20-41     
##  [85] vctrs_0.3.4          pillar_1.4.6         lifecycle_0.2.0      lmtest_0.9-38        data.table_1.13.2    cowplot_1.1.0       
##  [91] bitops_1.0-6         R6_2.5.0             TSP_1.1-10           KernSmooth_2.23-17   rio_0.5.16           janeaustenr_0.1.5   
##  [97] codetools_0.2-16     MASS_7.3-53          gtools_3.8.2         assertthat_0.2.1     withr_2.3.0          parallel_4.0.3      
## [103] hms_0.5.3            coda_0.19-4          class_7.3-17         rmarkdown_2.5        carData_3.0-4        scatterplot3d_0.3-41
## [109] lubridate_1.7.9      base64enc_0.1-3

Main page