Benchmarking file imports

Recently, Hadley Wickham announced the release of readr package v0.1.0. Having been written using Rcpp, this package claims to read tabular data into R in a fast and friendly manner.

I decided to benchmark read_csv() from the readr package for a large csv file with base read.csv() function and fread() from the data.table package, which is written in C.

I use the flights data from nycflights13 package.

library(nycflights13)
write.csv(flights, "flights.csv")

The flights.csv file gets extracted as a 25.5MB CSV file on my Windows machine.

library(rbenchmark)
library(data.table); library(readr)

## Read using read.csv() function from base
read.base <- function(x){
  read.csv("flights.csv")
}

## Read using read_csv() function from readr
read.readr <- function(x){
  read_csv("flights.csv")
}

## Read using fread() function from data.table
read.DT <- function(x){
  read_csv("flights.csv")
}

benchmark(
  read.base(),
  read.readr(),
  read.DT(),
  replications = 10
  )
##           test replications elapsed relative user.self sys.self user.child
## 1  read.base()           10   31.79    3.315     31.39     0.33         NA
## 3    read.DT()           10    9.59    1.000      9.53     0.07         NA
## 2 read.readr()           10    9.61    1.002      9.50     0.10         NA
##   sys.child
## 1        NA
## 3        NA
## 2        NA

Both fread() and read_csv() provide us with significant improvement in timings.

Let's tweak the read.csv() function to read the all the columns as characters (which supposedly improves performance).

## Read using read.csv() function from base
read.base2 <- function(x){
  read.csv("flights.csv", colClasses = "character")
}

benchmark(
  read.base(),
  read.base2(),
  read.readr(),
  read.DT(),
  replications = 10
  )
##           test replications elapsed relative user.self sys.self user.child
## 1  read.base()           10   30.41    2.993     29.95     0.37         NA
## 2 read.base2()           10   26.14    2.573     25.46     0.42         NA
## 4    read.DT()           10   10.21    1.005     10.01     0.07         NA
## 3 read.readr()           10   10.16    1.000     10.08     0.04         NA
##   sys.child
## 1        NA
## 2        NA
## 4        NA
## 3        NA

Though the performance of read.csv() functions improves, it does not even come closer to that of function from readr or data.table packages.

Thanks, Hadley Wickham, Romain Francois for readr; Matt Dowle et.al. for data.table. Now I can read my data much more quickly and efficiently.

Written on April 19, 2015
comments powered by Disqus