Work in progress
-
Advantages of
data.table:- speed (especially with large data sets)
- concise code (useful when working interactively)
- some conceptual similarities with SQL syntax
- syntax builds on R data frame syntax (“data frame on steroids”)
- mature and actively developed package
- chaining (equivalent to pipes)
-
very fast function for reading CSV files (
fread) -
a
data.tableobject behaves like a data frame in many ways -
data.tablecan be used alongside tidyverse where required-
or use
dtplyrto write tidyverse code and get almostdata.tablespeed, e.g. for speeding up large lookup tables)
-
or use
-
minimal dependencies
-
can do almost everything tidyverse can (with notable exception of
dbplyrfunctionality for reading from databases, which is mostly syntactic sugar anyway), with only one package (vs 10-20 typically for tidyverse)
-
can do almost everything tidyverse can (with notable exception of
-
lots of help on StackOverflow, with other users typically competing to achieve complex data manipulations with a single line of
data.tablecode
-
Disadvantages:
- something else to learn
-
may require more familiarity with base R or R internals than tidyverse (e.g.
lapply,NA_integer_and1Lmay not be your thing)- also need to understand what a data frame is, ie a list of vectors with special properties (names must be unique, some matrix-like properties allowing extraction of rows/columns, all elements must be vectors (vectors can only contain one type of data) of same length, has row numbers etc)
-
a
data.tableis also a list and you will see list-related functions applied to it
- code can seem cryptic
- various traps for the unwary
-
Faster than:
- tidyverse
- pandas (Python)
- Julia
-
Not necessarily faster than:
- polars
- reading data in optimised data formats like Apache Parquet
-
collapse: uses similar optimisations but has more statistical functionality- main page: Advanced and Fast Data Transformation in R • collapse
- can be used with tidyverse: collapse for tidyverse Users • collapse
-
fstpackage: may be faster for reading/writing data files
-
Why is
data.tableso fast?- optimisations of data structures
- compilation of functions from C code
- efficient indexing
- parallel processing
-
typically changes R data objects “in place”
- instead of copying the object in RAM and changing that then replacing the original object, which is the memory-hungry default
-
Dependencies:
- mostly core R packages
-
suggested to use
bit64package if working with large integers (you will need it for NHS numbers)
-
Associated packages (rarely needed):
-
kit: adds some optimised data manipulation functions -
dataPreparation: automated data preparation for data science -
the “
fastverse” (metapackage)
-
-
various options can be set for
data.table-
after loading the package, run
options()and look at the ones beginning$datatable. - I have never touched these
-
data.tablewill automatically identify the number of CPU cores it can use for parallel processing
-
after loading the package, run
-
Installation: available on CRAN
-
Reading a CSV file
DT <- fread(datafile.csv)-
automatically detects data types
-
NB if many blanks at start of file can fail - see
colClassesif so
-
NB if many blanks at start of file can fail - see
- useful options: select, drop
- can use command line tools to preprocess data
-
automatically detects data types
-
Convert a data frame or tibble to a
data.table(in place)setDT(mydataframe) setDT(mytibble)-
can also use (makes a copy):
```{r} DT <- as.data.table(DF) ```
-
-
Read an Excel file into a
data.tablelibrary(readxl) DT <- read_excel(datafile.xlsx) |> setDT() -
Workaround for reading in Parquet files: fread feather · Issue #2026 · Rdatatable/data.table · GitHub
-
only works if a
data.tableis saved as Parquet
-
only works if a
-
Creating a
data.tablefrom vectorsDT <- data.table(x = rnorm(100), y = rnorm(100)) -
Copying a
data.tablenewDT <- copy(oldDT) # deep copy-
Beginner mistake 1: not distinguishing between shallow and deep copies
newDT <- oldDT # shallow copy -
with a shallow copy, any changes to
newDTwill also be made tooldDT - this is because they are just different names for the same object in memory
-
Beginner mistake 1: not distinguishing between shallow and deep copies
-
Syntax for a
data.tableobject simplifies and extends data frame syntax-
Beginner mistake 2: using dollar notation with
data.tables- as with a data frame you can select rows of data with an integer, logical or character vector, or an expression that evaluates to one of these
-
but unlike with a data frame, columns are assumed to be in the
data.tableunless otherwise indicated -
so you should rarely need to use dollar notation:
DT[DT$age > 17,] # will work but unnecessary typing DT[age > 17] # all that is needed, and slightly more readable IMHO - note no comma is needed -
logical
NAs are treated as false (usually safest) - you can put a lot more in i - see ? data.table
- joins with other data.tables - ? with data frames
- lists, matrices - why?
- i, j, k
- grouping: by, keyby
-
list and
. -
special symbols:
-
.N -
.SD- NB: assignment -
others which I have rarely used e.g.
.I..GRP,.BY,.NGRP,.EACHI -
.I(row number)
-
-
Beginner mistake 2: using dollar notation with
-
Keys, primary and secondary
-
Renaming variables in place
-
Ordering variables in place
-
Use “fast” versions of functions, prefixed with f, e.g.
fifelse -
Beginner mistake 3: trying to translate literally from tidyverse syntax
-
Differences between data frame and
data.table:- with = FALSE
- using correct NA and 1L and coercion
-
Tricks:
- lapply, mapply, which.max
-
[]to print it out - row-wise operations a bit kludgy ? use collapse
- “piping” with chains
-
data.tablein R packages:-
data.table should be in Imports, not Depends (you shouldn’t use the Depends field for anything anyway)
-
if you put it in Suggests, then add
.datatable.aware=TRUEto one of the R/* files
-
if you put it in Suggests, then add
-
if you use the
usethispackage then this has a handyuse_data_tablefunction -
you can also just import the bits you want if you are concerned about masking
-
e.g. you could just import
fread
-
e.g. you could just import
- more detail here: Importing data.table
-
data.table should be in Imports, not Depends (you shouldn’t use the Depends field for anything anyway)
-
joins, merges, rollup, dicing, pivot, reshape, interval join, inequality join
-
look at pinboard for other ideas
-
Documentation: Enhanced data.frame — data.table-package • data.table
-
FAQ: datatable-faq.pdf
-
Cheatsheets:
-
Useful links:
- Getting started · Rdatatable/data.table Wiki · GitHub
- Articles · Rdatatable/data.table Wiki · GitHub
- Convenience features of fread · Rdatatable/data.table Wiki · GitHub
- Do’s and Don’ts · Rdatatable/data.table Wiki · GitHub
- A data.table and dplyr tour · Home
- A gentle introduction to data.table · Home