How to do cumulative unique count with R and dplyr

August 23, 2018

Couple of days ago I had to find a way to visualize in R cumulative number of unique customers over period of time. Having data with dates and customer ids ready I felt pretty confident that I will finish the task in seconds and most of the time will be eaten by loading tidyverse package. Damn I could have felt the coffee break LOOMING behind my screen.

Little did I know back then but apparently there is no function yet to do just that one task. I am always a bit shocked or/and sad when I am not able to do a task with purrr dplyr and this seemed like a case.

So checked cumstats…. none.

Maybe TTR?…… none.

…Plyr?

DataTable? Probably, but I want to keep code sort of in one format.

All of a sudden instead LOOMING break I was facing a danger of me LOOPING the data. Which is dreadful.

I searched and then I searched, then I tried some things and then tried some more. At the end of the day (not literally, it did not take me so much time…) I got result which satisfied my needs:

This is the solution with minimal example:

suppressPackageStartupMessages(library(tidyverse))

df <- tibble(
  item = c(1,2,3,4,5,6,1,2,3,7,4,8,9),
  day = c(1,1,1,1,1,1,2,2,2,2,3,3,3)
)

df
## # A tibble: 13 x 2
##     item   day
##    <dbl> <dbl>
##  1     1     1
##  2     2     1
##  3     3     1
##  4     4     1
##  5     5     1
##  6     6     1
##  7     1     2
##  8     2     2
##  9     3     2
## 10     7     2
## 11     4     3
## 12     8     3
## 13     9     3

Dataset has date-alike column (day number) and unique item id which we want to accumulate, first we need to group-nest ids over date (I have no clue how well or badly it scales…).

suppressPackageStartupMessages(library(tidyverse))

stage <- df %>% 
  group_by(day) %>%
  summarise(item =  list(unique(item)))

stage
## # A tibble: 3 x 2
##     day item     
##   <dbl> <list>   
## 1     1 <dbl [6]>
## 2     2 <dbl [4]>
## 3     3 <dbl [3]>

Then map function written below…

cumulative_n_distinct = function(x){
  require(purrr)
  require(dplyr)
                   y <- x %>%
                   accumulate(append) %>%
                   map(.,n_distinct) %>%
                     as.numeric
  
  return(y)
}

..to our nested dataset:

stage %>%
  mutate(n_uniqe_cum = cumulative_n_distinct(item)) %>%
  select(day, n_uniqe_cum) 
## # A tibble: 3 x 2
##     day n_uniqe_cum
##   <dbl>       <dbl>
## 1     1           6
## 2     2           7
## 3     3           9

Voila!

Take in mind that this example is quite minimal and my task also wasn’t including very big dataset (around 1000 customers). Still feel free to copy-paste, leave a comment or maybe improve the solution!

Cheers