I use the data.table package in preference to the tidyverse, mainly because it is fast at doing complex manipulations on intermediate-sized data sets and I’m impatient. data.table handles tasks in seconds that might take much longer to run (or in extreme circumstances be impossible) with the tidyverse.

I was looking at some comparative performance benchmarks for data.table and other tools here recently (there is a different version here) and was interested to see that there is now a library called Polars that is faster than data.table. It can be used via Python or R.

Polars is fast because it is written in an (increasing popular) language called Rust (fast), uses the Apache Arrow optimised memory format (fast), uses parallelisation (fast) and can optimise queries to its API (also fast). The Arrow (in memory) format also works well with the Parquet (compressed file) format. data.table doesn’t directly support reading Parquet files though there are workarounds. I will do a blog post on new data/file formats and new tools for importing/exporting data in R at some point.

As I am primarily an R user, I was interested to learn more about polars and find out if data.table was no longer the fastest option for me.

The instructions I used for installing polars (version 0.5.0.9000) are here. I installed some additional packages as explained later.

library(data.table)
# install.packages('polars', repos = 'https://rpolars.r-universe.dev')
library(polars)
# install.packages('wakefield')
# install.packages('microbenchmark')

Now I needed to generate some random data. I used the appropriately-named wakefield package to make up some data and then saved it as a CSV file.

Note that I will show tables as Markdown with knitr::kable, and will measure how long code takes to run using the microbenchmark package.

dummydata <- wakefield::r_data_frame(
  n = 5000,
  name,
  dob,
  age,
  sex,
  height,
  smokes,
  marital,
  education,
  children,
  employment,
  death
)
write.csv(dummydata, 'dummydata.csv')
knitr::kable(head(dummydata))
NameDOBAgeSexHeightSmokesMaritalEducationChildrenEmploymentDeath
Angely2010-02-0529Female69FALSEDivorcedBachelor’s Degree0Full TimeFALSE
Raelyn2008-07-2471Male64FALSEDivorced9th Grade to 12th Grade, No Diploma1Full TimeTRUE
Breale2009-01-2675Female76TRUESeparatedMaster’s Degree1Full TimeTRUE
Earth2009-12-2668Male70FALSEDivorcedMaster’s Degree1StudentFALSE
Cameran2010-02-1327Female75FALSENever MarriedRegular High School Diploma2Full TimeFALSE
Daejohn2008-05-3036Male70TRUENever MarriedMaster’s Degree1Full TimeFALSE

Next I needed to learn the polars syntax. Some of the documentation is still in the early stages and may be being outpaced by development, as some of the examples given didn’t work for me. If you know Python (and particularly the pandas module),polars syntax will look familiar, as it’s a bit like Python code with the dots replaced by dollar signs. polars functions are prefixed with pl$, whereas polars methods (methods are just functions attached to an object) are prefixed with the name of the object. Because polars uses chaining you can present your code in a tidyverse way as you will see below.

I first compared the time by polars to read in a CSV file (5,000 rows) compared to data.table:

microbenchmark::microbenchmark(
  'polars' = {
    dd_po <- pl$read_csv('dummydata.csv')
    },
  'data.table' = {
    dd_dt <- fread('dummydata.csv')
  })
## Unit: milliseconds
##        expr    min      lq     mean  median      uq     max neval cld
##      polars 2.6008 2.95305 3.678027 3.32280 3.98600  9.5663   100  a
##  data.table 5.1824 5.72105 6.865466 6.26545 6.99405 22.0428   100   b

polars was almost twice as fast (using median time) as data.table::fread at reading in a CSV file with 5,000 rows. This is an impressive start, and using Parquet files to save data might speed that up even more.

However polars was much slower at converting an existing data frame into its own format than data.table:

dd_df <- as.data.frame(dd_dt)
microbenchmark::microbenchmark(
  'polars' = {
    dd_po2 <- pl$DataFrame(dd_df)
  },
  'data.table' = {
    setDT(dd_df)
  }
)
## Unit: microseconds
##        expr    min      lq     mean median     uq     max neval cld
##      polars 2204.3 2810.75 4387.394 3565.6 4828.4 22119.3   100  a
##  data.table   63.5  112.15  196.147  159.5  225.3   880.6   100   b

Functions like class, head, tail, names and dim seem to work as expected (see here for a list), though as you can see polars has its own way of printing abbreviated tabular data to the console.

# Use dd_po2$to_data_frame() to convert back
class(dd_po2)
## [1] "DataFrame"
head(dd_po2)
## shape: (6, 12)
## ┌─────┬─────────┬────────────┬─────┬─────┬─────────────────────┬──────────┬────────────┬───────┐
## │ V1  ┆ Name    ┆ DOB        ┆ Age ┆ ... ┆ Education           ┆ Children ┆ Employment ┆ Death │
## │ --- ┆ ---     ┆ ---        ┆ --- ┆     ┆ ---                 ┆ ---      ┆ ---        ┆ ---   │
## │ i32 ┆ str     ┆ date       ┆ i32 ┆     ┆ str                 ┆ i32      ┆ str        ┆ bool  │
## ╞═════╪═════════╪════════════╪═════╪═════╪═════════════════════╪══════════╪════════════╪═══════╡
## │ 1   ┆ Angely  ┆ 2010-02-05 ┆ 29  ┆ ... ┆ Bachelor's Degree   ┆ 0        ┆ Full Time  ┆ false │
## │ 2   ┆ Raelyn  ┆ 2008-07-24 ┆ 71  ┆ ... ┆ 9th Grade to 12th   ┆ 1        ┆ Full Time  ┆ true  │
## │     ┆         ┆            ┆     ┆     ┆ Grade, No Dipl...   ┆          ┆            ┆       │
## │ 3   ┆ Breale  ┆ 2009-01-26 ┆ 75  ┆ ... ┆ Master's Degree     ┆ 1        ┆ Full Time  ┆ true  │
## │ 4   ┆ Earth   ┆ 2009-12-26 ┆ 68  ┆ ... ┆ Master's Degree     ┆ 1        ┆ Student    ┆ false │
## │ 5   ┆ Cameran ┆ 2010-02-13 ┆ 27  ┆ ... ┆ Regular High School ┆ 2        ┆ Full Time  ┆ false │
## │     ┆         ┆            ┆     ┆     ┆ Diploma             ┆          ┆            ┆       │
## │ 6   ┆ Daejohn ┆ 2008-05-30 ┆ 36  ┆ ... ┆ Master's Degree     ┆ 1        ┆ Full Time  ┆ false │
## └─────┴─────────┴────────────┴─────┴─────┴─────────────────────┴──────────┴────────────┴───────┘

data.table was comparable for selecting columns, whether or not you used the “lazy” option with polars:

microbenchmark::microbenchmark(
 'polars' = {
   dd_po$select(c('Name', 'DOB'))
 },
 'polars lazy' = {
   dd_po$lazy()$select(c('Name', 'DOB'))$collect()
 },
 'data.table' = {
   dd_dt[, .(Name, DOB)]
 }
)
## Unit: microseconds
##         expr   min     lq    mean median      uq    max neval cld
##       polars 490.8 641.85 823.833 784.65  931.80 2083.5   100   a
##  polars lazy 527.4 662.80 847.492 822.60 1019.95 1660.3   100   a
##   data.table 519.4 714.10 941.730 822.05  975.75 8226.5   100   a

data.table was much faster for applying a function to a column of data:

microbenchmark::microbenchmark(
  'polars'= {
    dd_po$select('Children')$sum()
  },
  'polars lazy'= {
    dd_po$lazy()$select('Children')$sum()$collect()
  },
  'data.table' = {
    dd_dt[, sum(Children)]
  }
)
## Unit: microseconds
##         expr    min      lq     mean  median      uq    max neval cld
##       polars 1129.1 1611.10 1895.107 1813.35 2114.20 5452.8   100 a
##  polars lazy  766.4 1002.55 1158.662 1111.20 1278.00 2137.7   100  b
##   data.table  262.9  486.95  609.637  609.15  734.85 1433.0   100   c

Note that you can make your polars code look like tidyverse:

dd_po $
  select('Children') $
  sum()
## shape: (1, 1)
## ┌──────────┐
## │ Children │
## │ ---      │
## │ i64      │
## ╞══════════╡
## │ 10523    │
## └──────────┘

With the select method you can do arithmetic. data.table is fastest.

microbenchmark::microbenchmark(
 'polars' = {
   dd_po$
      select(pl$col('Children') ^ 2)
 },
 'polars lazy' = {
   dd_po$
     lazy()$
     select(pl$col('Children') ^ 2)$
     collect()
 },
 'data.table' = {
   dd_dt[, Children ^ 2]
 }
)
## Unit: microseconds
##         expr   min     lq    mean median      uq    max neval cld
##       polars 592.2 759.35 937.684 913.15 1047.10 2041.6   100  a
##  polars lazy 599.2 759.50 951.802 904.50 1027.05 4162.8   100  a
##   data.table 289.0 418.05 503.875 487.15  548.25 1027.5   100   b

You can apply aggregations by groupings. polars is clearly faster.

microbenchmark::microbenchmark(
  'polars'= {
    dd_po$
      groupby('Sex')$
      agg(pl$count(),
          pl$col('Age')$mean())
    # NB dd_po$groupby('Sex')$count() did not work for me
  },
  'polars lazy'= {
    dd_po$
      lazy()$
      groupby('Sex')$
      agg(pl$count(),
          pl$col('Age')$mean())$
      collect()
  },
  'data.table' = {
    dd_dt[, mean(Age), Sex]
  }
)
## Unit: microseconds
##         expr    min     lq     mean  median      uq     max neval cld
##       polars  900.6 1117.8 1473.581 1368.05 1667.05  2877.8   100  a
##  polars lazy  931.4 1115.8 1574.055 1386.55 1699.25  6072.2   100  a
##   data.table 1750.1 2227.7 3043.317 2575.05 3234.10 10125.1   100   b

polars was faster for applying a function to filtered values, though the code is starting to resemble the worst type of tidyverse or pandas code to me - the data.table code is a thing of beauty and simplicity in comparison.

microbenchmark::microbenchmark(
  'polars' = {
    dd_po$
      filter(pl$
               col('Sex')$
               str$
               contains('Male'))$
      select('Children')$
      sum()
  },
  'polars lazy' = {
    dd_po$
      lazy()$
      filter(pl$
               col('Sex')$
               str$
               contains('Male'))$
      select('Children')$
      sum()$
      collect()
  },
  'data.table' = {
    dd_dt[Sex %in% 'Male', sum(Children)]
  }
)
## Unit: milliseconds
##         expr    min      lq     mean  median     uq    max neval cld
##       polars 2.5272 3.00085 3.404755 3.33485 3.7385 6.1626   100 a
##  polars lazy 1.2505 1.49705 1.776377 1.64650 1.9012 6.9724   100  b
##   data.table 1.7016 2.08345 2.379780 2.20760 2.5546 5.4575   100   c

Below is my first attempt to add a column, by selecting all the existing columns and adding a new conditional field derived from another variable, but it doesn’t make the changes “in place” and I would need to reassign to the object, which might be slow.

dd_po$
  lazy()$
  select(
    pl$all(),
    pl$when(
      pl$col('Age') > 16)$
        then('Adult')$
        otherwise('Child')$
      alias('agegrp')
)$
  collect()
## shape: (5000, 13)
## ┌──────┬──────────┬────────────┬─────┬─────┬──────────┬────────────┬───────┬────────┐
## │      ┆ Name     ┆ DOB        ┆ Age ┆ ... ┆ Children ┆ Employment ┆ Death ┆ agegrp │
## │ ---  ┆ ---      ┆ ---        ┆ --- ┆     ┆ ---      ┆ ---        ┆ ---   ┆ ---    │
## │ i64  ┆ str      ┆ str        ┆ i64 ┆     ┆ i64      ┆ str        ┆ bool  ┆ str    │
## ╞══════╪══════════╪════════════╪═════╪═════╪══════════╪════════════╪═══════╪════════╡
## │ 1    ┆ Angely   ┆ 2010-02-05 ┆ 29  ┆ ... ┆ 0        ┆ Full Time  ┆ false ┆ Adult  │
## │ 2    ┆ Raelyn   ┆ 2008-07-24 ┆ 71  ┆ ... ┆ 1        ┆ Full Time  ┆ true  ┆ Adult  │
## │ 3    ┆ Breale   ┆ 2009-01-26 ┆ 75  ┆ ... ┆ 1        ┆ Full Time  ┆ true  ┆ Adult  │
## │ 4    ┆ Earth    ┆ 2009-12-26 ┆ 68  ┆ ... ┆ 1        ┆ Student    ┆ false ┆ Adult  │
## │ ...  ┆ ...      ┆ ...        ┆ ... ┆ ... ┆ ...      ┆ ...        ┆ ...   ┆ ...    │
## │ 4997 ┆ Sherah   ┆ 2008-09-01 ┆ 55  ┆ ... ┆ 2        ┆ Full Time  ┆ true  ┆ Adult  │
## │ 4998 ┆ Salbador ┆ 2010-03-15 ┆ 65  ┆ ... ┆ 3        ┆ Full Time  ┆ true  ┆ Adult  │
## │ 4999 ┆ Keavion  ┆ 2008-08-04 ┆ 67  ┆ ... ┆ 0        ┆ Full Time  ┆ false ┆ Adult  │
## │ 5000 ┆ Anacely  ┆ 2009-07-01 ┆ 76  ┆ ... ┆ 1        ┆ Unemployed ┆ true  ┆ Adult  │
## └──────┴──────────┴────────────┴─────┴─────┴──────────┴────────────┴───────┴────────┘

Apparently with_column/with_columns is the correct approach, though this would also seem to return a new DataFrame requiring reassignment. Yet polars is still almost as fast as data.table.

microbenchmark::microbenchmark(
  'polars lazy' = {
    dd_po <- dd_po$
      lazy()$
      with_column(
        pl$when(
          pl$col('Age') > 16)$
            then('Adult')$
            otherwise('Child')$
          alias('agegrp')
      )$
      collect()
  },
  'data.table' = {
    dd_dt[, agegrp := ifelse(Age > 16, 'Adult', 'Child')]
  }
)
## Unit: milliseconds
##         expr min      lq     mean median      uq    max neval cld
##  polars lazy 1.6 1.97725 2.247649 2.1671 2.48025 4.0360   100   a
##   data.table 1.6 1.95320 2.230975 2.1003 2.43435 4.5123   100   a

In summary, polars is an exciting development with a number of features enabling speed, but at least for the data sets I typically work with, may not yet offer must-have advantages over data.table, and there are still a few rough edges here and there. I will continue to play with it and see how it performs for joins or reshaping, and if there are better ways I can do any of the above things. But data.table’s concise syntax still has me hooked for the moment.

Postscript: data.table does some optimised functions such as %chin% and fifelse which I did briefly look at, but which did not seem to have a major impact on relative timings.

More info here:

sessionInfo()
## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] polars_0.5.0.9000 data.table_1.14.8
##
## loaded via a namespace (and not attached):
##  [1] pillar_1.9.0         compiler_4.2.3       tools_4.2.3
##  [4] digest_0.6.31        evaluate_0.20        lifecycle_1.0.3
##  [7] tibble_3.2.1         lattice_0.20-45      pkgconfig_2.0.3
## [10] rlang_1.1.0          Matrix_1.5-4         cli_3.6.1
## [13] rstudioapi_0.14      microbenchmark_1.4.9 yaml_2.3.7
## [16] mvtnorm_1.1-3        xfun_0.39            fastmap_1.1.1
## [19] dplyr_1.1.2          knitr_1.42           generics_0.1.3
## [22] vctrs_0.6.2          wakefield_0.3.6      grid_4.2.3
## [25] tidyselect_1.2.0     glue_1.6.2           R6_2.5.1
## [28] fansi_1.0.4          survival_3.5-5       rmarkdown_2.21
## [31] multcomp_1.4-23      TH.data_1.1-2        magrittr_2.0.3
## [34] codetools_0.2-19     htmltools_0.5.5      splines_4.2.3
## [37] MASS_7.3-59          sandwich_3.0-2       utf8_1.2.3
## [40] zoo_1.8-12