Visualizing BoardGameGeek Data with ggdensity

7 minute read

Published:

Visualizing BoardGameGeek data with ggdensity

We’re going to be looking at the BoardGameGeek data from week 4 of TidyTuesday 2022. This data set consists of community ratings and other stats for just over 20,000 board games. The first thing we need to do is load in the data and perform some basic cleaning, joining the ratings and thedetails data on the id column:

library("tidyverse")

data <- tidytuesdayR::tt_load('2022-01-25')

df <- data$ratings |>
  left_join(data$details, by = "id")

Looking at boardgamecategory

Something that immediately stands out to me is the variable boardgamecategory. Comparing stats across different types of board games could end up being really interesting! But, there is a problem—this column isn’t “tidy”:

select(df, name, boardgamecategory)
## # A tibble: 21,831 × 2
##    name              boardgamecategory                                          
##    <chr>             <chr>                                                      
##  1 Pandemic          ['Medical']                                                
##  2 Carcassonne       ['City Building', 'Medieval', 'Territory Building']        
##  3 Catan             ['Economic', 'Negotiation']                                
##  4 7 Wonders         ['Ancient', 'Card Game', 'City Building', 'Civilization', …
##  5 Dominion          ['Card Game', 'Medieval']                                  
##  6 Ticket to Ride    ['Trains']                                                 
##  7 Codenames         ['Card Game', 'Deduction', 'Party Game', 'Spies/Secret Age…
##  8 Terraforming Mars ['Economic', 'Environmental', 'Industry / Manufacturing', …
##  9 7 Wonders Duel    ['Ancient', 'Card Game', 'City Building', 'Civilization', …
## 10 Agricola          ['Animals', 'Economic', 'Farming']                         
## # … with 21,821 more rows

Luckily, this is an easy fix with some string processing. We can use stringr::str_extract_all() to extract the categories from each row into a list, then use tidyr::unnest() to flatten out the resulting list column.

df <- df |>
  filter(!is.na(boardgamecategory)) |>
  mutate(boardgamecategory = str_extract_all(boardgamecategory, "(?<=')[^,]*(?=')")) |>
  unnest(boardgamecategory)

select(df, name, boardgamecategory)
## # A tibble: 54,997 × 2
##    name        boardgamecategory 
##    <chr>       <chr>             
##  1 Pandemic    Medical           
##  2 Carcassonne City Building     
##  3 Carcassonne Medieval          
##  4 Carcassonne Territory Building
##  5 Catan       Economic          
##  6 Catan       Negotiation       
##  7 7 Wonders   Ancient           
##  8 7 Wonders   Card Game         
##  9 7 Wonders   City Building     
## 10 7 Wonders   Civilization      
## # … with 54,987 more rows

Great! Now, let’s see what the most popular categories are:

top_categories <- df |>
  group_by(boardgamecategory) |>
  summarize(n = n()) |>
  arrange(desc(n)) |>
  slice_head(n = 10)

top_categories
## # A tibble: 10 × 2
##    boardgamecategory     n
##    <chr>             <int>
##  1 Card Game          6402
##  2 Wargame            3820
##  3 Fantasy            2681
##  4 Party Game         1968
##  5 Dice               1847
##  6 Science Fiction    1666
##  7 Fighting           1658
##  8 Abstract Strategy  1545
##  9 Economic           1503
## 10 Animals            1354

Surprisingly, more games are categorized as “Card Games” than any other category! We can create a simple visual showing the prevalence of each of these top 10 categories:

top_categories |>
  mutate(boardgamecategory = fct_reorder(boardgamecategory, n, .desc = TRUE)) |>
  ggplot(aes(x = boardgamecategory, y = n)) +
  geom_col() +
  labs(
    x = "Category",
    y = NULL
  )

Looking at playingtime, minplayers, and maxplayers

Let’s put the work that we’ve done on the categories field on hold for a minute and look at how a game’s average number of players relates to its average play time. Before making any plots, I would suspect that as the number of players increases the average play time increases. That is to say, I would expect positive correlation between the two variables.

# First, we need to do a little more cleaning
# Filter out some outliers, compute avg_players
df <- df |> 
  filter(maxplayers < 20) |>
  filter(playingtime < 1000) |> 
  mutate(playingtime = playingtime / 60) |>
  mutate(avg_players = (minplayers + maxplayers)/2) 
  
df |>
  distinct(name, .keep_all = TRUE) |> # Don't care about categories right now
  ggplot(aes(x = avg_players, y = playingtime)) +
  geom_jitter(height = .5, width = .5, size = .1, alpha = .5) +
  scale_x_continuous(breaks = seq(0, 12, by = 2)) +
  scale_y_continuous(breaks = seq(0, 14, by = 2)) +
  coord_cartesian(ylim = c(0, 14), expand = FALSE) +
  labs(
    x = "Average no. of players",
    y = "Average play time (Hours)"
  )

Interestingly, this does not seem to be the case! In fact it seems like it may be the opposite—play time appears to be maximized when there are between 2 and 4 players and drops off as the number of players increases.

Unfortunately, the above plot has a few issues that stand in the way of us making useful observations. First, I have had to do some severe jittering to eliminate graphical artifacts resulting from the discrete nature of the data. Notice, several of the points seem to correspond to games with fewer than 0 average players! Second, there is pretty severe overplotting. Although I have attempted to avoid this by setting both the size and alpha arguments, the plot is still very crowded—especially around the horizontal axis between the 2 and 4 player ticks.

Fortunately, I know of a tool that can help with both of these issues—ggdensity!

library("ggdensity")

df |> 
  distinct(name, .keep_all = TRUE) |> # Don't care about categories right now
  ggplot(aes(x = avg_players, y = playingtime)) +
  geom_hdr(adjust = c(2, 4)) + # Need to set adjust b/c of discreteness
  scale_x_continuous(breaks = seq(0, 12, by = 2)) +
  scale_y_continuous(breaks = seq(0, 14, by = 2)) +
  coord_cartesian(ylim = c(0, 14), expand = FALSE) +
  labs(
    x = "Average no. of players",
    y = "Average play time (Hours)"
  )