Text Generation: Appendix

13 minute read

Published:

Appendix.knit

This is the appendix for the series on text generation with recursive neural networks. Here, we show how we gathered, cleaned, and transformed the data necessary to train the models we considered. All of the code used in this analysis is available on Github.


First, we load in the necessary packages:

library("tidyverse")
library("here")

Downloading Books

First, we create a function which will automates the process of downloading .txt files from Project Gutenberg.

download_gutenberg <- function(url, author, title) {
  ROOT_DIR <- "Data/Gutenberg/raw"

  if(!dir.exists(here(ROOT_DIR, author))) dir.create(here(ROOT_DIR, author), recursive = TRUE)
  download.file(url, here(ROOT_DIR, author, paste0(title, ".txt")))
}

Now, we download a few files:

# Mary Shelley's Frankenstein
download_gutenberg("https://www.gutenberg.org/files/84/84-0.txt", "Mary Shelley", "Frankenstein")

# Lewis Carroll's Alice in Wonderland Series
download_gutenberg("https://www.gutenberg.org/files/11/11-0.txt", "Lewis Carroll", "Alice_in_Wonderland")
download_gutenberg("https://www.gutenberg.org/files/12/12-0.txt", "Lewis Carroll", "Through_the_Looking_Glass")

# The works of Jane Austen
download_gutenberg("https://www.gutenberg.org/files/158/158-0.txt", "Jane Austen", "Emma")
download_gutenberg("https://www.gutenberg.org/files/1342/1342-0.txt", "Jane Austen", "Pride_and_Prejudice")
download_gutenberg("https://www.gutenberg.org/files/141/141-0.txt", "Jane Austen", "Mansfield_Park")
download_gutenberg("https://www.gutenberg.org/cache/epub/105/pg105.txt", "Jane Austen", "Persuasion")
download_gutenberg("https://www.gutenberg.org/files/121/121-0.txt", "Jane Austen", "Northanger_Abbey")
download_gutenberg("https://www.gutenberg.org/files/161/161-0.txt", "Jane Austen", "Sense_and_Sensibility")

Below, we include the first few lines of the downloaded Frankenstein file:

read_lines(here("Data/Gutenberg/raw/Mary Shelley/Frankenstein.txt"))[1:85]
##  [1] "The Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft (Godwin) Shelley"                    
##  [2] ""                                                                                                        
##  [3] "This eBook is for the use of anyone anywhere in the United States and"                                   
##  [4] "most other parts of the world at no cost and with almost no restrictions"                                
##  [5] "whatsoever. You may copy it, give it away or re-use it under the terms"                                  
##  [6] "of the Project Gutenberg License included with this eBook or online at"                                  
##  [7] "www.gutenberg.org. If you are not located in the United States, you"                                     
##  [8] "will have to check the laws of the country where you are located before"                                 
##  [9] "using this eBook."                                                                                       
## [10] ""                                                                                                        
## [11] "Title: Frankenstein"                                                                                     
## [12] "       or, The Modern Prometheus"                                                                        
## [13] ""                                                                                                        
## [14] "Author: Mary Wollstonecraft (Godwin) Shelley"                                                            
## [15] ""                                                                                                        
## [16] "Release Date: 31, 1993 [eBook #84]"                                                                      
## [17] "[Most recently updated: November 13, 2020]"                                                              
## [18] ""                                                                                                        
## [19] "Language: English"                                                                                       
## [20] ""                                                                                                        
## [21] "Character set encoding: UTF-8"                                                                           
## [22] ""                                                                                                        
## [23] "Produced by: Judith Boss, Christy Phillips, Lynn Hanninen, and David Meltzer. HTML version by Al Haines."
## [24] "Further corrections by Menno de Leeuw."                                                                  
## [25] ""                                                                                                        
## [26] "*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN ***"                                               
## [27] ""                                                                                                        
## [28] ""                                                                                                        
## [29] ""                                                                                                        
## [30] ""                                                                                                        
## [31] "Frankenstein;"                                                                                           
## [32] ""                                                                                                        
## [33] "or, the Modern Prometheus"                                                                               
## [34] ""                                                                                                        
## [35] "by Mary Wollstonecraft (Godwin) Shelley"                                                                 
## [36] ""                                                                                                        
## [37] ""                                                                                                        
## [38] " CONTENTS"                                                                                               
## [39] ""                                                                                                        
## [40] " Letter 1"                                                                                               
## [41] " Letter 2"                                                                                               
## [42] " Letter 3"                                                                                               
## [43] " Letter 4"                                                                                               
## [44] " Chapter 1"                                                                                              
## [45] " Chapter 2"                                                                                              
## [46] " Chapter 3"                                                                                              
## [47] " Chapter 4"                                                                                              
## [48] " Chapter 5"                                                                                              
## [49] " Chapter 6"                                                                                              
## [50] " Chapter 7"                                                                                              
## [51] " Chapter 8"                                                                                              
## [52] " Chapter 9"                                                                                              
## [53] " Chapter 10"                                                                                             
## [54] " Chapter 11"                                                                                             
## [55] " Chapter 12"                                                                                             
## [56] " Chapter 13"                                                                                             
## [57] " Chapter 14"                                                                                             
## [58] " Chapter 15"                                                                                             
## [59] " Chapter 16"                                                                                             
## [60] " Chapter 17"                                                                                             
## [61] " Chapter 18"                                                                                             
## [62] " Chapter 19"                                                                                             
## [63] " Chapter 20"                                                                                             
## [64] " Chapter 21"                                                                                             
## [65] " Chapter 22"                                                                                             
## [66] " Chapter 23"                                                                                             
## [67] " Chapter 24"                                                                                             
## [68] ""                                                                                                        
## [69] ""                                                                                                        
## [70] ""                                                                                                        
## [71] ""                                                                                                        
## [72] "Letter 1"                                                                                                
## [73] ""                                                                                                        
## [74] "_To Mrs. Saville, England._"                                                                             
## [75] ""                                                                                                        
## [76] ""                                                                                                        
## [77] "St. Petersburgh, Dec. 11th, 17—."                                                                        
## [78] ""                                                                                                        
## [79] ""                                                                                                        
## [80] "You will rejoice to hear that no disaster has accompanied the"                                           
## [81] "commencement of an enterprise which you have regarded with such evil"                                    
## [82] "forebodings. I arrived here yesterday, and my first task is to assure"                                   
## [83] "my dear sister of my welfare and increasing confidence in the success"                                   
## [84] "of my undertaking."                                                                                      
## [85] ""

Cleaning Books

Now, we need to clean the files. As illustrated above, Project Gutenberg has a standardized system for marking the beginning and end of the original text. These are indicated by the 1st and 2nd lines containing the string "***". There is still some metadata included in the cleaned files, for example the table of contents and chapter headings. Considering the relatively small volume this takes up, we are okay with leaving it (Project Gutenberg does not standardize these across their files).

Also, the data is currently stored as a character vector with elements corresponding to arbitrary groups of words. The models we will be considering process data on a character level – we need to break up the elements into individual characters.

clean_gutenberg <- function(author, file) {
  # Vectorize the clean_gutenberg() function
  if (length(file) > 1) return(walk(file, \(x) clean_gutenberg(author, x)))
  
  book <- read_lines(file = here("Data/Gutenberg/raw", author, file), skip_empty_rows = FALSE)

  # Project Gutenberg indicates original text with "***"
  text_between <- which(str_detect(book, "\\*{3}"))
  book <- book[(text_between[1] + 1) : (text_between[2] - 1)]
  
  # Remove blank lines
  book <- book[book != ""]

  # Want vector of characters, not words
  book <- map_chr(book, \(x) paste0(x, " ")) |>
    str_extract_all(boundary("character")) |>
    unlist()

  # Write the cleaned file to disk
  if(!dir.exists(here("Data/Gutenberg/cleaned", author))) {
    dir.create(here("Data/Gutenberg/cleaned", author), recursive = TRUE)
  }
  write(book, here("Data/Gutenberg/cleaned", author, file))
}
authors <- list.files(here("Data/Gutenberg/raw")) 
works <- map(authors, \(x) list.files(here("Data/Gutenberg/raw", x)))

walk2(authors, works, clean_gutenberg)

Below, we include the first few lines of the cleaned Frankenstein file:

read_lines(file = here("Data/Gutenberg/cleaned/Mary Shelley/Frankenstein.txt"))[1:79]
##  [1] "F" "r" "a" "n" "k" "e" "n" "s" "t" "e" "i" "n" ";" " " "o" "r" "," " " "t"
## [20] "h" "e" " " "M" "o" "d" "e" "r" "n" " " "P" "r" "o" "m" "e" "t" "h" "e" "u"
## [39] "s" " " "b" "y" " " "M" "a" "r" "y" " " "W" "o" "l" "l" "s" "t" "o" "n" "e"
## [58] "c" "r" "a" "f" "t" " " "(" "G" "o" "d" "w" "i" "n" ")" " " "S" "h" "e" "l"
## [77] "l" "e" "y"

Creating Training Data

Now, we are ready to create the training data for our keras models. We need to do quite a bit of work in order to get the data in the correct form. As we will eventually be training 4 models for each author (corresponding to sequence lengths 1, 5, 10, and 30), we need to create multiple training data sets per author.

Additionally, the keras models need numerical inputs – not characters. Importantly, the ordering and scale of the numerical inputs is irrelevant. All that matters is that the function mapping characters to integers is invertible. For each author, we create an encoder/decoder pair arbitrarily based on the empirical frequency distribution of characters.

Below, we include the helper functions which perform these tasks – the end goal is to create a directory for each author containing the 4 different data sets, the original corpus, and the encoder and decoder functions.

# Function factories for vectorized encoders and decoders:
encoder_fun <- function(char_table) {
  encoder <- function(char) {
    if (length(char) > 1) return(map_dbl(char, encoder))
    which(char_table == char)
  } 
  
  encoder
}

decoder_fun <- function(char_table) {
  decoder <- function(i) {
    if (length(i) > 1) return(map_chr(i, decoder))
    char_table[i]
  } 
  
  decoder
}

# Bundle up all of authors works into list
create_training_data <- function(author) {
  # Get list of books corresponding to author
  files <- list.files(here("Data/Gutenberg/cleaned", author))
  
  # Concatenate books into singular corpus
  books <- map(here("Data/Gutenberg/cleaned", author, files), \(x) read_lines(x, skip_empty_rows = FALSE))
  books <- unlist(books)
  
  # Empirical frequency distribution for encoder/decoder
  char_table <- table(books) |>
    sort(decreasing = TRUE) |>
    names()
  
  encoder <- encoder_fun(char_table)
  decoder <- decoder_fun(char_table)
  
  # Encode corpus
  books <- encoder(books)
  
  list(
    author = author,
    books = books,
    encoder = encoder, 
    decoder = decoder
  )
}

make_col <- function(index, books, batch) {
  col <- tibble(x = books[index:(length(books) - (batch + 1) + index)])
  
  # last column is response
  if (index <= batch) {
    names(col) <- paste0("X", index)
  } else {
    names(col) <- "Y"
  }
  
  col
}

# Create df of sequences of specified batch length
write_seq_df <- function(batch, data, replace = FALSE) {
  if (file.exists(here("Data/Training_Data", data$author, paste0("df_seq_", batch, ".csv"))) & !replace){
    return(NULL)
  } 
  
  map_dfc(1:(batch + 1), make_col, data$books, batch) |>
    write_csv(here("Data/Training_Data", data$author, paste0("df_seq_", batch, ".csv")))
  
  NULL # Don't want to return anything -- these boots are made for walkin'
}

# Create directory with all files necessary for training model and decoding output
write_train_data <- function(author, batches, replace = FALSE) {
  if(!dir.exists(here("Data/Training_Data", author))) {
    dir.create(here("Data/Training_Data", author), recursive = TRUE)
  }
  
  data <- create_training_data(author)
  saveRDS(data, here("Data/Training_Data", author, "data.RDS"))
  walk(batches, write_seq_df, data, replace)
}
walk(authors, write_train_data, c(1, 5, 10, 30))

Below, we include the first few training observations corresponding to Mary Shelley corresponding to the data sets of sequences of lengths 1, 5, and 10 The first set of observations is encoded in the machine-readable format keras needs. The following set is decoded, presented in a human-readable format. Notice, in each case the \(Y\) column is the target – given the previous characters (\(X_i\)) it is our goal to accurately predict the next character (\(Y\)).

At this point, the data is in the format necessary to train our models, see Text Generation 1 and Text Generation 2 for details on implementation and results!

Shelley_data <- read_rds(here("Data/Training_Data/Mary Shelley/data.RDS"))
Shelley_df_seq_1 <- read_csv(here("Data/Training_Data/Mary Shelley/df_seq_1.csv"))
Shelley_df_seq_5 <-  read_csv(here("Data/Training_Data/Mary Shelley/df_seq_5.csv"))
Shelley_df_seq_10 <-  read_csv(here("Data/Training_Data/Mary Shelley/df_seq_10.csv"))
# Printing the first few encoded rows

head(Shelley_df_seq_1)
## # A tibble: 6 × 2
##      X1     Y
##   <dbl> <dbl>
## 1    45     9
## 2     9     4
## 3     4     6
## 4     6    26
## 5    26     2
## 6     2     6

head(Shelley_df_seq_5)
## # A tibble: 6 × 6
##      X1    X2    X3    X4    X5     Y
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    45     9     4     6    26     2
## 2     9     4     6    26     2     6
## 3     4     6    26     2     6     8
## 4     6    26     2     6     8     3
## 5    26     2     6     8     3     2
## 6     2     6     8     3     2     7

head(Shelley_df_seq_10)
## # A tibble: 6 × 11
##      X1    X2    X3    X4    X5    X6    X7    X8    X9   X10     Y
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    45     9     4     6    26     2     6     8     3     2     7
## 2     9     4     6    26     2     6     8     3     2     7     6
## 3     4     6    26     2     6     8     3     2     7     6    27
## 4     6    26     2     6     8     3     2     7     6    27     1
## 5    26     2     6     8     3     2     7     6    27     1     5
## 6     2     6     8     3     2     7     6    27     1     5     9
# Printing the above rows, decoded

head(Shelley_df_seq_1) |> mutate(across(everything(), Shelley_data$decoder))
## # A tibble: 6 × 2
##   X1    Y    
##   <chr> <chr>
## 1 F     r    
## 2 r     a    
## 3 a     n    
## 4 n     k    
## 5 k     e    
## 6 e     n

head(Shelley_df_seq_5) |> mutate(across(everything(), Shelley_data$decoder))
## # A tibble: 6 × 6
##   X1    X2    X3    X4    X5    Y    
##   <chr> <chr> <chr> <chr> <chr> <chr>
## 1 F     r     a     n     k     e    
## 2 r     a     n     k     e     n    
## 3 a     n     k     e     n     s    
## 4 n     k     e     n     s     t    
## 5 k     e     n     s     t     e    
## 6 e     n     s     t     e     i

head(Shelley_df_seq_10) |> mutate(across(everything(), Shelley_data$decoder))
## # A tibble: 6 × 11
##   X1    X2    X3    X4    X5    X6    X7    X8    X9    X10   Y    
##   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 F     r     a     n     k     e     n     s     "t"   "e"   "i"  
## 2 r     a     n     k     e     n     s     t     "e"   "i"   "n"  
## 3 a     n     k     e     n     s     t     e     "i"   "n"   ";"  
## 4 n     k     e     n     s     t     e     i     "n"   ";"   " "  
## 5 k     e     n     s     t     e     i     n     ";"   " "   "o"  
## 6 e     n     s     t     e     i     n     ;     " "   "o"   "r"

References

Allaire, JJ, and François Chollet. 2021. Keras: R Interface to ’Keras’. https://CRAN.R-project.org/package=keras.
Allaire, JJ, and Yuan Tang. 2021. Tensorflow: R Interface to ’TensorFlow’. https://CRAN.R-project.org/package=tensorflow.
Chollet, François. 2018. Deep Learning with r. Manning Publications.
Goodfellow, Ian J., Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.