Ways to demonstrate the CLT with purrr • fdbck lps

The tidyverse is one of my favourite suites of R packages, and its idiomatic approach to data manipulation, analysis and visualization has heavily influenced how I approach data problems in R. If there’s one tidyverse package that best embodies these qualities, my vote is for purrr’s clean approach to iteration. purrr goes well beyond map and its variants (check out this presentation for a great sampler), but it’s been far too easy to ignore anything beyond purrr’s simplest functions.

A while ago at work I had a great opportunity to dive deeper into purrr. While demonstrating the Central Limit Theorem to some colleagues, I whipped up a simulation of the dice-rolling experiment in R. I didn’t save the code, but it looked a lot like this:

library(tidyverse)
library(purrr)
library(rlang)
library(daniel) # see https://gitlab.com/danielspracklin/daniel

simulate_dice <- function(n, number_of_dice, ...) {
  seq.int(number_of_dice) %>%
    map(~sample(1:6, n, replace = T)) %>%
    as.data.frame(row.names = NULL) %>%
    rowwise() %>%
    summarize(average_of_dice = sum(c_across(where(is.numeric))) / {{number_of_dice}},
              number_of_dice = {{number_of_dice}})
}

clt_plotter <- function(df, x_value, facet_value) {
  df %>%
    ggplot() +
    geom_density(aes({{x_value}}, fill = "1"), colour = "black") +
    scale_fill_daniel() +
    facet_wrap(vars({{facet_value}})) +
    theme_daniel() +
    guides(fill = F) +
    labs(x = "Sampling distribution of the mean",
          y = "Density")
}

And here’s the output. The wiggles in the number_of_dice = 1 facet are a necessary evil if we want to show the smooth behaviour for larger numbers of dice without customizing the density plots’ bandwidth and kernel.

set.seed(613)
1:9 %>%
  map_dfr(~simulate_dice(10000, .x)) %>%
  clt_plotter(., x_value = average_of_dice, facet_value = number_of_dice)

The simulate_dice code is functional but hacky:

Heavy reliance on dataframes here is a crutch, since we have to use the very slow rowwise operation to create a column containing the mean of each set of thrown dice.
Inserting the number of dice thrown as its own column is an ugly way to prepare to facet the clt_plotter visualization.

We can do better than this, and in the process learn more about the power of purrr.

We’ll start by writing a function that doesn’t immediately coerce everything to a dataframe. Instead, we’ll keep the results as a list as long as possible. To add the vectors element-wise, we can use purrr::reduce to avoid rowwise and mutate. (Since addition is associative, we don’t need to specify the .dir argument for reduce.) I also wrote a division function, vector_division, to get the mean of each set of thrown dice, although there’s likely a more idiomatic way to do this.

simulate_dice_list <- function(n, number_of_dice, ...) {

  vector_division <- function(x, divisor) {x / divisor}

  seq.int(number_of_dice) %>%
    map(~sample(1:6, n, replace = T)) %>%
    reduce(`+`) %>%
    map(~vector_division(.x, {{number_of_dice}})) %>%
    unlist()
}

Now let’s compare the outputs to confirm that they’re identical. Since the new version doesn’t produce a dataframe, we’ll have to wrangle it into shape. enframe gives us a nested dataframe that requires unnest-ing. As we can see, the results are identical.

set.seed(613)
old_version <- 1:9 %>%
  map_dfr(~simulate_dice(10000, .x))

set.seed(613)
new_version <- 1:9 %>%
  map(~simulate_dice_list(10000, .x)) %>%
  enframe(.) %>%
  unnest(value)

identical(old_version$average_of_dice,
          new_version$value)

## [1] TRUE

So which is better? It all comes down to performance, which I assessed quickly with microbenchmark.

library(microbenchmark)

set.seed(613)
benchmark <- microbenchmark(
  old_version = 1:9 %>%
    simulate_dice(100, .x),
  new_version = 1:9 %>%
    simulate_dice_list(100, .x) %>%
    enframe(.) %>%
    unnest(value))

benchmark

## Unit: milliseconds
##         expr       min        lq     mean   median       uq       max neval
##  old_version 23.097471 36.879219 57.42355 54.53358 72.07341 131.73285   100
##  new_version  5.350244  7.995392 15.33288 13.58652 19.48240  67.29409   100

autoplot(benchmark) +
  theme_daniel()

The difference is drastic: rowwise, while easy to pluck from the grab bag of dplyr tools, is indeed much slower than using lists.

Here’s the final code. As a bonus, we can avoid the summarize(number_of_dice = {{number_of_dice}}) code above by naming the input vector with setNames.

set.seed(613)
setNames(c(1:9), c(1:9)) %>%
  map(~simulate_dice_list(10000, .x)) %>%
  enframe(.) %>%
  unnest(value) %>%
  clt_plotter(., x_value = value, facet_value = name)

No difference in output, but faster and cleaner. It pays to use purrr in conjunction with more than just dataframes!