
Improved density visualization in R

James Otto, David Kahle

Baylor University



  • ggplot2 includes several ways to estimate and visualize densities for uni- and bivariate data
    • Limited by the difficulty of interpreting density height
  • ggdensity provides interpretable visualizations via highest density regions

A Motivating Example

Defining the HDR

Let \(f(x, y)\) be the pdf of a random vector \(\left( X, Y \right) \in \mathbb{R}^2\). Then for \(\alpha \in (0,1)\) the \(100(1 - \alpha)\%\) highest density region (HDR) is the subset \(R(f_{\alpha}) \subset \mathbb{R}^2\) such that \(R(f_{\alpha}) = \{(x, y): f (x, y) \geq f_{\alpha}\}\) where \(f_{\alpha}\) is the largest constant such that \(\mathrm{P}\left[(X, Y) \in R(f_{\alpha})\right] \geq 1 - \alpha\).


df <- tibble(x = rnorm(1000), y = rnorm(1000))

ggplot(df, aes(x, y)) + 

df <- tibble(x = rnorm(1000), y = rnorm(1000))

ggplot(df, aes(x, y)) + 

df <- tibble(x = rnorm(1000), y = rnorm(1000))

ggplot(df, aes(x, y)) + 
    method = "kde"

df <- tibble(x = rnorm(1000), y = rnorm(1000))

ggplot(df, aes(x, y)) + 
    method = "mvnorm"

df <- tibble(x = rnorm(1000), y = rnorm(1000))

ggplot(df, aes(x, y)) + 
    method = "histogram"

Palmer Penguins

Palmer Penguins


f <- function(x, y) dnorm(x) * dgamma(y, 5, 3)

ggplot() +
  geom_hdr_fun(fun = f, xlim = c(-4, 4), ylim = c(0, 5))

df <- data.frame(x = rexp(100, 1), y = rexp(100, 1))

# pdf for parametric density estimate
f <- \(x, y, lambda) dexp(x, lambda[1]) * dexp(y, lambda[2])

# estimate parameters governing joint pdf
lambda_hat <- apply(df, 2, mean)

# make plot
ggplot(df, aes(x, y)) +
    fun = f, args = list(lambda = lambda_hat),
    xlim = c(0, 7), ylim = c(0, 7) 
  ) +
  geom_point(fill = "lightgreen", shape = 21) +

misc. geom’s

p <- ggplot(faithful, aes(eruptions, waiting))

p + geom_hdr_lines()
p + geom_hdr_points()
p + geom_hdr_rug()

p <- ggplot(faithful, aes(eruptions, waiting))

p + 
    aes(level = after_stat(probs)),
    stat = "hdr_lines", alpha = 1


