This tutorial shows how to convert PDFs to simple txt (editor) files. The R Notebook for this tutorial can be downloaded here.
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)
# install libraries
install.packages(c("pdftools", "dplyr", "stringr", "httr", "jsonlite"))
Once you have installed R-Studio and have also initiated the session by executing the code shown above, you are good to go.
To follow this tutorial interactively (by using the R Notebook), follow the instructions listed below.
File
in the upper left corner of the R Studio interfaceNew Project...
Existing Directory
Open
Files
above the lower right panelconvertpdf2txt.Rmd
Knit
above the upper left panel in R Studio.Now, we load the packages and inspect the data.
# activate packages
library(pdftools)
library(dplyr)
library(stringr)
Next, we define a path and convert the pdf that is located the path into a txt.
# you can use an url or a path that leads to a pdf dcument
pdf_path <- "https://slcladal.github.io/data/PDFs/pdf0.pdf"
# extract text
txt_output <- pdftools::pdf_text(pdf_path) %>%
paste(sep = " ") %>%
stringr::str_replace_all(fixed("\n"), " ") %>%
stringr::str_replace_all(fixed("\r"), " ") %>%
stringr::str_replace_all(fixed("\t"), " ") %>%
stringr::str_replace_all(fixed("\""), " ") %>%
paste(sep = " ", collapse = " ") %>%
stringr::str_squish() %>%
stringr::str_replace_all("- ", "")
# inspect
str(txt_output)
## chr "Corpus linguistics Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguis"| __truncated__
To convert many pdf-files, we write a function that preforms the conversion for many documents.
convertpdf2txt <- function(dirpath){
files <- list.files(dirpath, full.names = T)
x <- sapply(files, function(x){
x <- pdftools::pdf_text(x) %>%
paste(sep = " ") %>%
stringr::str_replace_all(fixed("\n"), " ") %>%
stringr::str_replace_all(fixed("\r"), " ") %>%
stringr::str_replace_all(fixed("\t"), " ") %>%
stringr::str_replace_all(fixed("\""), " ") %>%
paste(sep = " ", collapse = " ") %>%
stringr::str_squish() %>%
stringr::str_replace_all("- ", "")
return(x)
})
}
We can now apply the function to the folder in which we have stored the PDFs. The output is a vector with the texts of the PDFs.
# apply function
txts <- convertpdf2txt("data/PDFs/")
# inspect the structure of the txts element
str(txts)
## Named chr [1:4] "Corpus linguistics Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguis"| __truncated__ ...
## - attr(*, "names")= chr [1:4] "data/PDFs/pdf0.pdf" "data/PDFs/pdf1.pdf" "data/PDFs/pdf2.pdf" "data/PDFs/pdf3.pdf"
The output of the str()
function shows that we have converted 4 PDFs into txt-files.
To save the txt-files on your disc, simply replace the predefined location (“D:\Uni\UQ\SLC\LADAL\SLCLADAL.github.io\data/”) with the folder where you want to store the txt files and then execute the code below. Also, we will name the txts-elements as text plus their number.
# add names to txt files
names(txts) <- paste("text", 1:length(txts), sep = "")
# save result to disc
lapply(seq_along(txts), function(i)writeLines(text = unlist(txts[i]),
con = paste("D:\\Uni\\UQ\\SLC\\LADAL\\SLCLADAL.github.io\\data/", names(txts)[i],".txt", sep = "")))
Schweinberger, Martin. 2020. Converting PDFs to txt files with R. Brisbane: The University of Queensland. url: https://slcladal.github.io/convertpdf2txt.html (Version 2020.12.03).
@manual{schweinberger2020conv,
author = {Schweinberger, Martin},
title = {Converting PDFs to txt files with R},
note = {https://slcladal.github.io/convertpdf2txt.html},
year = {2020},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2020/12/03}
}
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pdftools_2.3.1 collostructions_0.1.2 igraph_1.2.6 GGally_2.0.0 network_1.16.1
## [6] ggdendro_0.1.22 slam_0.1-47 Matrix_1.2-18 tm_0.7-7 NLP_0.2-1
## [11] tidytext_0.2.6 quanteda_2.1.2 gplots_3.1.0 FactoMineR_2.3 exact2x2_1.6.5
## [16] exactci_1.3-3 ssanv_1.1 vcd_1.4-8 ape_5.4-1 pvclust_2.2-0
## [21] NbClust_3.0 seriation_1.2-9 factoextra_1.0.7 cluster_2.1.0 cfa_0.10-0
## [26] gridExtra_2.3 fGarch_3042.83.2 fBasics_3042.89.1 timeSeries_3062.100 timeDate_3043.102
## [31] e1071_1.7-4 ggpubr_0.4.0 flextable_0.5.11 forcats_0.5.0 stringr_1.4.0
## [36] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.4
## [41] ggplot2_3.3.3 tidyverse_1.3.0 DT_0.16 kableExtra_1.3.1 knitr_1.30
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 uuid_0.1-4 backports_1.1.10 fastmatch_1.1-0 systemfonts_0.3.2 plyr_1.8.6
## [7] crosstalk_1.1.0.1 SnowballC_0.7.0 usethis_1.6.3 digest_0.6.27 foreach_1.5.1 htmltools_0.5.0
## [13] fansi_0.4.1 rle_0.9.2 magrittr_1.5 openxlsx_4.2.3 sna_2.6 modelr_0.1.8
## [19] RcppParallel_5.0.2 officer_0.3.15 askpass_1.1 colorspace_1.4-1 rvest_0.3.6 ggrepel_0.8.2
## [25] haven_2.3.1 xfun_0.19 crayon_1.3.4 jsonlite_1.7.1 zoo_1.8-8 iterators_1.0.13
## [31] glue_1.4.2 registry_0.5-1 stopwords_2.0 gtable_0.3.0 webshot_0.5.2 car_3.0-10
## [37] abind_1.4-5 scales_1.1.1 qpdf_1.1 DBI_1.1.0 rstatix_0.6.0 Rcpp_1.0.5
## [43] viridisLite_0.3.0 flashClust_1.01-2 foreign_0.8-80 htmlwidgets_1.5.3 httr_1.4.2 RColorBrewer_1.1-2
## [49] ellipsis_0.3.1 spatial_7.3-12 reshape_0.8.8 pkgconfig_2.0.3 farver_2.0.3 dbplyr_2.0.0
## [55] utf8_1.1.4 tidyselect_1.1.0 labeling_0.4.2 rlang_0.4.8 reshape2_1.4.4 munsell_0.5.0
## [61] cellranger_1.1.0 tools_4.0.3 cli_2.1.0 generics_0.1.0 statnet.common_4.4.1 broom_0.7.2
## [67] evaluate_0.14 yaml_2.2.1 fs_1.5.0 zip_2.1.1 caTools_1.18.0 nlme_3.1-149
## [73] leaps_3.1 xml2_1.3.2 tokenizers_0.2.1 compiler_4.0.3 rstudioapi_0.11 curl_4.3
## [79] ggsignif_0.6.1 reprex_0.3.0 stringi_1.5.3 highr_0.8 gdtools_0.2.2 lattice_0.20-41
## [85] vctrs_0.3.4 pillar_1.4.6 lifecycle_0.2.0 lmtest_0.9-38 data.table_1.13.2 cowplot_1.1.0
## [91] bitops_1.0-6 R6_2.5.0 TSP_1.1-10 KernSmooth_2.23-17 rio_0.5.16 janeaustenr_0.1.5
## [97] codetools_0.2-16 MASS_7.3-53 gtools_3.8.2 assertthat_0.2.1 withr_2.3.0 parallel_4.0.3
## [103] hms_0.5.3 coda_0.19-4 class_7.3-17 rmarkdown_2.5 carData_3.0-4 scatterplot3d_0.3-41
## [109] lubridate_1.7.9 base64enc_0.1-3