Visualizing BoardGameGeek data with ggdensity

Author

James Otto

Published

February 15, 2022

The Data

In this blog post, we’re going to be looking at the BoardGameGeek data from week 4 of TidyTuesday 2022. This data set consists of community ratings and other stats for just over 20,000 board games. The first thing we need to do is load in the data and perform some basic cleaning, joining the ratings and details data on the id column:

library("tidyverse")

data <- tidytuesdayR::tt_load('2022-01-25')

df <- data$ratings |>
  left_join(data$details, by = "id")

Looking at `boardgamecategory`

Something that immediately stands out to me is the variable boardgamecategory. Comparing stats across different types of board games could end up being really interesting! But, there is a problem—this column isn’t “tidy”:

select(df, name, boardgamecategory) |>
  slice_head(n = 10)

name	boardgamecategory
Pandemic	[‘Medical’]
Carcassonne	[‘City Building’, ‘Medieval’, ‘Territory Building’]
Catan	[‘Economic’, ‘Negotiation’]
7 Wonders	[‘Ancient’, ‘Card Game’, ‘City Building’, ‘Civilization’, ‘Economic’]
Dominion	[‘Card Game’, ‘Medieval’]
Ticket to Ride	[‘Trains’]
Codenames	[‘Card Game’, ‘Deduction’, ‘Party Game’, ‘Spies/Secret Agents’, ‘Word Game’]
Terraforming Mars	[‘Economic’, ‘Environmental’, ‘Industry / Manufacturing’, ‘Science Fiction’, ‘Space Exploration’, ‘Territory Building’]
7 Wonders Duel	[‘Ancient’, ‘Card Game’, ‘City Building’, ‘Civilization’, ‘Economic’]
Agricola	[‘Animals’, ‘Economic’, ‘Farming’]

Luckily, this is an easy fix with some string processing. We can use stringr::str_extract_all() to extract the categories from each row into a list, then use tidyr::unnest() to flatten out the resulting list column.

df <- df |>
  filter(!is.na(boardgamecategory)) |>
  mutate(boardgamecategory = str_extract_all(boardgamecategory, "(?<=')[^,]*(?=')")) |>
  unnest(boardgamecategory)

select(df, name, boardgamecategory) |>
  slice_head(n = 10)

name	boardgamecategory
Pandemic	Medical
Carcassonne	City Building
Carcassonne	Medieval
Carcassonne	Territory Building
Catan	Economic
Catan	Negotiation
7 Wonders	Ancient
7 Wonders	Card Game
7 Wonders	City Building
7 Wonders	Civilization

Great! Now, let’s see what the most popular categories are:

top_categories <- df |>
  group_by(boardgamecategory) |>
  summarize(n = n()) |>
  arrange(desc(n)) |>
  slice_head(n = 10)

top_categories

boardgamecategory	n
Card Game	6402
Wargame	3820
Fantasy	2681
Party Game	1968
Dice	1847
Science Fiction	1666
Fighting	1658
Abstract Strategy	1545
Economic	1503
Animals	1354

Surprisingly, the most popular board game category is “Card Games”! We can create a simple visual showing the prevalence of each of these top 10 categories:

Code

top_categories |>
  mutate(boardgamecategory = fct_reorder(boardgamecategory, n, .desc = TRUE)) |>
  ggplot(aes(x = boardgamecategory, y = n)) +
  geom_col() +
  labs(
    x = "Category",
    y = NULL
  )

Looking at `playingtime`, `minplayers`, and `maxplayers`

Let’s put the work that we’ve done on the categories field on hold for a minute and look at how a game’s average number of players relates to its average play time. Before making any plots, I would suspect that as the number of players increases the average play time increases. That is to say, I would expect positive correlation between the two variables.

Code

# First, we need to do a little more cleaning
# Filter out some outliers, compute avg_players
df <- df |> 
  filter(maxplayers < 20) |>
  filter(playingtime < 1000) |> 
  mutate(playingtime = playingtime / 60) |>
  mutate(avg_players = (minplayers + maxplayers)/2) 
  
df |>
  distinct(name, .keep_all = TRUE) |> # Don't care about categories right now
  ggplot(aes(x = avg_players, y = playingtime)) +
  geom_jitter(height = .5, width = .5, size = .1, alpha = .5) +
  scale_x_continuous(breaks = seq(0, 12, by = 2)) +
  scale_y_continuous(breaks = seq(0, 14, by = 2)) +
  coord_cartesian(ylim = c(0, 14), expand = FALSE) +
  labs(
    x = "Average no. of players",
    y = "Average play time (Hours)"
  )

Interestingly, this does not seem to be the case! In fact it seems like it may be the opposite—play time appears to be maximized when there are between 2 and 4 players and drops off as the number of players increases.

Unfortunately, the above plot has a few issues that stand in the way of us making useful observations. First, I have had to do some severe jittering to eliminate graphical artifacts resulting from the discrete nature of the data. Notice, several of the points seem to correspond to games with fewer than 0 average players! Second, there is pretty severe overplotting. Although I have attempted to avoid this by setting both the size and alpha arguments, the plot is still very crowded—especially around the horizontal axis between the 2 and 4 player ticks.

Fortunately, I know of a tool that can help with both of these issues—ggdensity!

Code

library("ggdensity")

df |> 
  distinct(name, .keep_all = TRUE) |> # Don't care about categories right now
  ggplot(aes(x = avg_players, y = playingtime)) +
  geom_hdr(adjust = c(2, 4)) + # Need to set adjust b/c of discreteness
  scale_x_continuous(breaks = seq(0, 12, by = 2)) +
  scale_y_continuous(breaks = seq(0, 14, by = 2)) +
  coord_cartesian(ylim = c(0, 14), expand = FALSE) +
  labs(
    x = "Average no. of players",
    y = "Average play time (Hours)"
  )

Above, we are plotting estimated “Highest Density Regions” (HDRs)— these are the smallest regions containing 50%, 80%, 95%, and 99% of the data (essentially). For more information, check out the ggdensity repo. See that the issues of overplotting and jittering are eliminated as we have abandoned the strategy of plotting individual points. Now that we’ve taken care of these problems, we can see that there is a negative association between the average number of players and average play time. This is unexpected! Let’s look a little deeper, leveraging our previous work on boardgamecategory.

Putting it all together

Does this negative association hold true across the most popular categories? Or is this yet another example of Simpson’s paradox? There’s only one way to find out—faceting!

Code

df |> 
  filter(boardgamecategory %in% top_categories$boardgamecategory[1:5]) |>
  mutate(boardgamecategory = fct_reorder(boardgamecategory, playingtime, mean, .desc = TRUE)) |>
  ggplot(aes(x = avg_players, y = playingtime, fill = boardgamecategory)) +
  geom_hdr(adjust = 2) + # Need to set adjust b/c of discreteness
  facet_wrap(vars(boardgamecategory), ncol = 5) +
  scale_x_continuous(breaks = seq(0, 12, by = 2)) +
  scale_y_continuous(breaks = seq(0, 14, by = 2)) +
  scale_fill_brewer(type = "qual", palette = 2, guide = NULL) +
  coord_cartesian(ylim = c(0, 14), expand = FALSE) +
  labs(
    x = "Average no. of players",
    y = "Average play time (Hours)"
  )

This plot offers a new perspective. In each category, it appears to be the case that average play time and number of players are independent. Also, it looks like “Wargame” board games tend to involve fewer players and last signficantly longer than other categories. Combining board games across categories creates the illusion that play time and number of players is negatively correlated—another point for Simpson!

The Data

Looking at boardgamecategory

Looking at playingtime, minplayers, and maxplayers

Putting it all together

Looking at `boardgamecategory`

Looking at `playingtime`, `minplayers`, and `maxplayers`