Text Generation: Model Evaluation

11 minute read



This is the second post in a series on generating text with recurrent neural networks (RNNs). In the previous post, we trained sets of networks based on works by Mary Shelley, Jane Austen, and Lewis Carroll to make predictions based on character sequences of lengths 1, 5, 10, and 30. In this post we will evaluate these networks, comparing the standard method of padding input seeds with the proposed bootstrapping method.

For details on the data wrangling process, see the Appendix. Additionally, all of the code used in this analysis is available on Github.

First, we load several R packages necessary for our analysis:

library("tidyverse"); theme_set(theme_bw()); theme_update(panel.grid.minor = element_blank())


Making predictions

The following function takes an input sequence of arbitrary length (seed), loads the relevant RNN models for a specified author (author), and generates a number additional characters (steps). It does so either by padding the input seed or via the bootstrapping method, according to the bootstrap parameter.

run_model <- function(seed, author, steps = 250, bootstrap = TRUE) {
  model_1 <- load_model_hdf5(here("Models", author, "model_1.h5"))
  model_5 <- load_model_hdf5(here("Models", author, "model_5.h5"))
  model_10 <- load_model_hdf5(here("Models", author, "model_10.h5"))
  model_30 <- load_model_hdf5(here("Models", author, "model_30.h5"))
  data <- read_rds(here("Data/Training_Data/", author, "data.RDS"))
  make_pred <- function(X, k) {
    if (k == 1) return(predict(model_1, X))
    if (k == 5) return(predict(model_5, X))
    if (k == 10) return(predict(model_10, X))
    if (k == 30) return(predict(model_30, X))
  if (!bootstrap) {
    seed <- str_pad(seed, 30, side = "left") 

  seed <- str_extract_all(seed, ".") |>
    unlist() |>
  seed <- seed - 1 # Off-by-one between R and Python
  for (i in 1:steps) {
    seq_length <- length(seed)
    mod_index <- max(which(seq_length - c(1, 5, 10, 30) >= 0))
    k <- c(1, 5, 10, 30)[mod_index]
    X <- rev(seed)[1:k]
    X <- rev(X)
    X <- matrix(X, ncol = k)

    probs <- make_pred(X, k) |>

    choices <- 1:(length(probs)) - 1 # Off-by-one between R and Python
    pred <- sample(x = choices, size = 1, prob = probs)

    seed <- c(seed, pred)
  seed <- seed + 1 # Off-by-one between R and Python
  map_chr(seed, data$decoder) |>
    paste(collapse = "") |>

Comparing standard and bootstrapped models

First, we compare the output of the standard and bootstrap methods of predictions. We do so based on the models trained on Mary Shelley’s Frankenstein, generating 10 sequences of length 100 from the input seed "A". While neither model is great, subjectively it seems that the bootstrap might be producing slightly more realistic content. Especially in the first few characters, the bootstrapped model seems to be more coherent than the padding method. This makes intuitive sense, the effects of padding would be most pronounced in the first few predictions.


map(1:10, \(.) run_model("A", "Mary Shelley", 100, bootstrap = TRUE)) |>
  walk2(1:10, \(pred, i) cat(i, ": ", pred, "\n\n", sep = ""))
1: Asen, I followed impelte with me. If suppose the russlika and the aspect of prepose me on the dashing

2: Ager; yet, my dear sir, you are my infisint which path me the oclurest palta light informating them h

3: As our lips, and wandered or to me, instant to my next me animated towards the bloom of the smiles an

4: Ag of Elizabeth thave by many werem into intence him enchanting selfishness which steppold, never det

5: Ag Warted such seemed to promind thencefing beneath, the by the cows during the own, the conione was 

6: Arsom uncolmon and hopronged find a few journeys to several cause. Succled understood me bus its expr

7: Atth ures, her unberief of thember, when liable worth with circumstances by a creature was sought tra

8: Amugrt suffor the precepity. One hands, and the Argh, but I found that we Are think that whill should

9: Af Be assured himself when I thought of afford you surnes his hand raises; the frien delly which can 

10: Afre and thersen were ruding these longer happened to Lonis, he said evileeted her knowlesged with a 
map(1:10, \(.) run_model("A", "Mary Shelley", 100, bootstrap = FALSE)) |>
  walk2(1:10, \(pred, i) cat(i, ": ", pred, "\n\n", sep = ""))
1: A aguins. Stils, continually relation is so miserad thies. Mared or sleep. You have renothed to whom 

2: A have related by whates, of the greatest disgust? Wereed by the home. Even in the exhaustion was so 

3: A pretard to her, as your coupse with her an Erscable undertaking which her before the murder had not

4: A haish began house, unable to brave eated and wasted an end from the woods, of letters from the idea

5: A abyed that I have pircuis only devetting that my heart and regular monshe that I have memares worwh

6: A think who murdet had turned voyage in my hatred can cas unbeg, they had praviden. At one thought be

7: A hadd of the among its deeper, but I will revenge, whither has atseccive! These I shiped by so stran

8: A participate inminged irounden, but the Tcrespick and revenge firings I hastened to gratufuble, in t

9: A follow his attenate last midd information? You thas man, whilst the barbarous and is so long longed

10: A handshaps of life thus noithy to too he knows continual; and we could with considerable existence a

Comparing models trained on different authors

Below, we include the predictions from the bootstrap models trained on each author based on the input seed "I". Each model seems to have a unique “style,” corresponding to the voice in the original texts. Interestingly, the output from the Jane Austen model seems to be the most coherent – this is likely due to the fact that her corpus was much larger than the others, consisting of the text from 6 books.

authors <- list.files(here("Data/Training_Data"))

results <- map(authors, run_model, seed = "I", steps = 1000) |>
  (\(x) tibble(output = x))() |>
  mutate(author = authors) |>
  mutate(output = str_replace_all(output, "\\s+", " "))
print_results <- function(output, author) {
  cat(author, ":\n\t", output, "\n\n", sep = "")

pwalk(results, print_results)
Jane Austen:
    I happen with them? Stood in return from Northanger, the time. She could not finish them to the insensicly styla, which Isabella at the Parsonage. Why is it not passing towards frequently unprevious with the calmerness; and it had a very different man, perhaps you will be gone at the three years be to anybody they aedness.—This was _that_ was an “Give her for what is she not, to be satisfied, Miss Tilney’ had come away.” “That is a part of consequence to Berkeley Street. Yet, is never he was existed. She could not wrate to a chapel opening your notice; and, in making round as her walk instery on coming as long as his character near on, not to get away. The brother says we seem so hatelbed a pity to make her time to marry, and she restrained Colonel Forster's chay. “On a wife’s way; what this is good; they reached, most commoney Willoughby was deserved assiduite; but the word, the subject of me; chiefly only side, a most much close with it abusably thought; and — The w

Lewis Carroll:
    I m Starte grazn, and peeped over some in it: they’se half or intenth?” “Mewning about that was very much sure that it was lespen him for them, and so bean?” “It _shall_ ditsen in the most in up: Her fundery in the weitt interestor to her undor cudnies of them all as ears it it. Porrapidy all she stak dulbring down, and then stom your nass is the twingly Dunw,” she said to herself. “It’s no came from,” he said to bevin.” “I had knowing off timidly, “you wore hear when you like,” said the White Rabbit, “but that right’s hear as I shat’s moing, with the King’s coarsu to burr menwy wanted to be created my sibteak: to having sieden!” “Can to call about for thiss; and I’m sure justsucte’s little thing about it, or soldisulas. “That’s a qoent ot muth commertas out bodes,” said the King englisadded to this was heaving the limt griend Alice’s creatures.” “Not I’m her?” Alice was getting quicely turned to be no use into things turnly in by things, but out to you so!” said the Red Queen. “—I dren

Mary Shelley:
    I St to me that I was frietes, then in every thoughts would be human metsers present to hum on the faces of wonders arisned me and to his kindness and gave near horror; and my first appearance, shall covered with the events which no man is renotfical age was tey thought; but ufter passed my own feelings shone on the restless that I have doubts savaucheded seving me from my syudy in the same time that you surrutions vast intended by riscessol instantly effected in its singres of William, Jyet-roubted _than many hat her enterprise. But alluring rendersed the brudking fortnd—and immease with inamisents of this delay and indecent epent, and talines performed the forder of wonder and tererness and misery. She passed through our soul made the pace she had reasonous departure presurplied several plaises, and would not expect the sail. Amazis ridem; but I felt by the numse; you will deep deare to my chinks we iple. During this moments I, and such my only children. Seated with shiding greatest s

Final thoughts

Looking at the output of our models, it appears that padding inputs with whitespace may have a negative effect on a RNN’s predictions. Note that we chose to train models on sequences of length 1, 5, 10, and 30. These values were arbitrary, it is likely that performance could be increased with tuning of this hyperparameter. As we mentioned previously, this idea of bootstrapping predictions using different models is relevant to other applications of RNNs. We are interested in comparing model performance in other contexts, especially those where there are well-defined measures of model quality.

A challenge in this analysis was the lack of objective measures of quality. When comparing the two proposed methods we had to resort to subjective judgements, comparing the output using our opinions of what “looks” most like English text. If we had better models, we might be able to compare methods using automated tools such as a spell checker. However, given the quality of the output of the models we trained this would not be an effective measure of comparison.

The code used to conduct this analysis was written to be easy to use, understand, and extend. It would be very easy to create models for different authors, try different network architectures, or better tune the networks we proposed. We encourage others to fork our repository and explore these or other questions!


Allaire, JJ, and François Chollet. 2021. Keras: R Interface to ’Keras’. https://CRAN.R-project.org/package=keras.
Allaire, JJ, and Yuan Tang. 2021. Tensorflow: R Interface to ’TensorFlow’. https://CRAN.R-project.org/package=tensorflow.
Chollet, François. 2018. Deep Learning with r. Manning Publications.
Goodfellow, Ian J., Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.