Notes on the data.table package
Work in progress
Advantages of
data.table:- speed (especially with large data sets)
- concise code (useful when working interactively)
- some conceptual similarities with SQL syntax
- syntax builds on R data frame syntax (“data frame on steroids”)
- mature and actively developed package
- chaining (equivalent to pipes)
- very fast function for reading CSV files (
fread) - a
data.tableobject behaves like a data frame in many ways data.tablecan be used alongside tidyverse where required- or use
dtplyrto write tidyverse code and get almostdata.tablespeed, e.g. for speeding up large lookup tables)
- or use
- minimal dependencies
- can do almost everything tidyverse can (with notable exception of
dbplyrfunctionality for reading from databases, which is mostly syntactic sugar anyway), with only one package (vs 10-20 typically for tidyverse)
- can do almost everything tidyverse can (with notable exception of
- lots of help on StackOverflow, with other users typically competing to achieve complex data manipulations with a single line of
data.tablecode
Disadvantages:
- something else to learn
- may require more familiarity with base R or R internals than tidyverse (e.g.
lapply,NA_integer_and1Lmay not be your thing)- also need to understand what a data frame is, ie a list of vectors with special properties (names must be unique, some matrix-like properties allowing extraction of rows/columns, all elements must be vectors (vectors can only contain one type of data) of same length, has row numbers etc)
- a
data.tableis also a list and you will see list-related functions applied to it
- code can seem cryptic
- various traps for the unwary
Faster than:
- tidyverse
- pandas (Python)
- Julia
Not necessarily faster than:
- polars
- reading data in optimised data formats like Apache Parquet
collapse: uses similar optimisations but has more statistical functionality- main page: Advanced and Fast Data Transformation in R • collapse
- can be used with tidyverse: collapse for tidyverse Users • collapse
fstpackage: may be faster for reading/writing data files
Why is
data.tableso fast?- optimisations of data structures
- compilation of functions from C code
- efficient indexing
- parallel processing
- typically changes R data objects “in place”
- instead of copying the object in RAM and changing that then replacing the original object, which is the memory-hungry default
Dependencies:
- mostly core R packages
- suggested to use
bit64package if working with large integers (you will need it for NHS numbers)
Associated packages (rarely needed):
kit: adds some optimised data manipulation functionsdataPreparation: automated data preparation for data science- the “
fastverse” (metapackage)
various options can be set for
data.table- after loading the package, run
options()and look at the ones beginning$datatable. - I have never touched these
data.tablewill automatically identify the number of CPU cores it can use for parallel processing
- after loading the package, run
Installation: available on CRAN
Reading a CSV file
DT <- fread(datafile.csv)- automatically detects data types
- NB if many blanks at start of file can fail - see
colClassesif so
- NB if many blanks at start of file can fail - see
- useful options: select, drop
- can use command line tools to preprocess data
- automatically detects data types
Convert a data frame or tibble to a
data.table(in place)setDT(mydataframe) setDT(mytibble)can also use (makes a copy):
```{r} DT <- as.data.table(DF) ```
Read an Excel file into a
data.tablelibrary(readxl) DT <- read_excel(datafile.xlsx) |> setDT()Workaround for reading in Parquet files: fread feather · Issue #2026 · Rdatatable/data.table · GitHub
- only works if a
data.tableis saved as Parquet
- only works if a
Creating a
data.tablefrom vectorsDT <- data.table(x = rnorm(100), y = rnorm(100))Copying a
data.tablenewDT <- copy(oldDT) # deep copy- Beginner mistake 1: not distinguishing between shallow and deep copies
newDT <- oldDT # shallow copy - with a shallow copy, any changes to
newDTwill also be made tooldDT - this is because they are just different names for the same object in memory
- Beginner mistake 1: not distinguishing between shallow and deep copies
Syntax for a
data.tableobject simplifies and extends data frame syntax- Beginner mistake 2: using dollar notation with
data.tables- as with a data frame you can select rows of data with an integer, logical or character vector, or an expression that evaluates to one of these
- but unlike with a data frame, columns are assumed to be in the
data.tableunless otherwise indicated - so you should rarely need to use dollar notation:
DT[DT$age > 17,] # will work but unnecessary typing DT[age > 17] # all that is needed, and slightly more readable IMHO - note no comma is needed - logical
NAs are treated as false (usually safest) - you can put a lot more in i - see ? data.table
- joins with other data.tables - ? with data frames
- lists, matrices - why?
- i, j, k
- grouping: by, keyby
- list and
. - special symbols:
.N.SD- NB: assignment- others which I have rarely used e.g.
.I..GRP,.BY,.NGRP,.EACHI .I(row number)
- Beginner mistake 2: using dollar notation with
Keys, primary and secondary
Renaming variables in place
Ordering variables in place
Use “fast” versions of functions, prefixed with f, e.g.
fifelseBeginner mistake 3: trying to translate literally from tidyverse syntax
Differences between data frame and
data.table:- with = FALSE
- using correct NA and 1L and coercion
Tricks:
- lapply, mapply, which.max
[]to print it out- row-wise operations a bit kludgy ? use collapse
- “piping” with chains
data.tablein R packages:- data.table should be in Imports, not Depends (you shouldn’t use the Depends field for anything anyway)
- if you put it in Suggests, then add
.datatable.aware=TRUEto one of the R/* files
- if you put it in Suggests, then add
- if you use the
usethispackage then this has a handyuse_data_tablefunction - you can also just import the bits you want if you are concerned about masking
- e.g. you could just import
fread
- e.g. you could just import
- more detail here: Importing data.table
- data.table should be in Imports, not Depends (you shouldn’t use the Depends field for anything anyway)
joins, merges, rollup, dicing, pivot, reshape, interval join, inequality join
look at pinboard for other ideas
Documentation: Enhanced data.frame — data.table-package • data.table
FAQ: datatable-faq.pdf
Cheatsheets:
Useful links: