dplyr Compared to data.table

A few of my recent posts included manipulation of data inside the data.table by using special characters in the j and by positions. Another very popular approach is to use the dplyr package, rather than data.table. dplyr is a grammar of data manipulation, which includes arrange, filter, mutate, select, and summarise, all of which can be used with other functions, such as the group_by function or even logical operators. For comparison, I provide the similar action with data.table.

arrange

Used to sort the order of rows
Note the use of as_tibble, which simpliifes the data frame and improves the appearance
Note the use of forward-pipe operator %>%

data("iris")
iris <- as_tibble(iris)

df <- iris %>% # uses the pipe operator
  arrange(desc(Sepal.Length)) 
pander(head(df)) # note dplyr resets the rownames/order

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
7.9	3.8	6.4	2	virginica
7.7	3.8	6.7	2.2	virginica
7.7	2.6	6.9	2.3	virginica
7.7	2.8	6.7	2	virginica
7.7	3	6.1	2.3	virginica
7.6	3	6.6	2.1	virginica

setorder is similar to arrange
Make sure to use the negative-sign to set descending

dt <- data.table(iris)
setorder(dt, -Sepal.Length)  # note the '-' for descending
pander(head(dt))  # note data.table preserves row name order

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
7.9	3.8	6.4	2	virginica
7.7	3.8	6.7	2.2	virginica
7.7	2.6	6.9	2.3	virginica
7.7	2.8	6.7	2	virginica
7.7	3	6.1	2.3	virginica
7.6	3	6.6	2.1	virginica

filter

Picks rows based on their values

df <- iris %>%
    filter(Sepal.Length == 7.7)
pander(head(df))  # note dplyr reset the row name order

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
7.7	3.8	6.7	2.2	virginica
7.7	2.6	6.9	2.3	virginica
7.7	2.8	6.7	2	virginica
7.7	3	6.1	2.3	virginica

filter is similar to setting a row filter in the i position of the data.table
Note the positions in the data.table: > dt[i,j,by]

dt_filt <- dt[Sepal.Length == 7.7, ]  # filter values in the i position
pander(dt_filt)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
7.7	3.8	6.7	2.2	virginica
7.7	2.6	6.9	2.3	virginica
7.7	2.8	6.7	2	virginica
7.7	3	6.1	2.3	virginica

mutate

Adds a new variable as a function of existing variables

df <- iris %>%
    mutate(Sepal.Length, Sepal.Length_new = Sepal.Length/2)
pander(head(df))  # note the new column is added to the end of the table

Table continues below
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

Sepal.Length_new
2.55
2.45
2.35
2.3
2.5
2.7

for data.table, use the impute special character ‘:=’ in the j position
‘:=’ creates a new variable in the data.table

dt_mut <- dt[, `:=`(Sepal.Length_new, Sepal.Length/2)]  # note the new column is added to the end of the data.table
pander(head(dt_mut))

Table continues below
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
7.9	3.8	6.4	2	virginica
7.7	3.8	6.7	2.2	virginica
7.7	2.6	6.9	2.3	virginica
7.7	2.8	6.7	2	virginica
7.7	3	6.1	2.3	virginica
7.6	3	6.6	2.1	virginica

Sepal.Length_new
3.95
3.85
3.85
3.85
3.85
3.8

select

Picks the columns you want in the final table.
Below, I’ve combine this function with mutate, which shows how various functions can be piped together

df <- iris %>%
  as_tibble() %>% mutate( # pipe several manipulationa together
    Sepal.Length_new = Sepal.Length/2, # adds column
    Sepal.Width_new = Sepal.Width/2, # adds column
    Petal.Width = NULL)  %>% # removes a column
  dplyr::select(Sepal.Length_new, Sepal.Width_new) # will potentially conflict with other packages

pander(head(df)) # note the new column is added to the end of the table

Sepal.Length_new	Sepal.Width_new
2.55	1.75
2.45	1.5
2.35	1.6
2.3	1.55
2.5	1.8
2.7	1.95

For the data.table, you string operations together with additional brackets, much like the pipe operators
Use the list feature ‘.(x, y, z)’ to extract the features of interest - note the period in front of the parentheses, which results in lists

dt <- data.table(iris)

dt_mut <- dt[, `:=`(Sepal.Length_new, Sepal.Length/2)][, `:=`(Sepal.Width_new, Sepal.Width/2)][,
    `:=`(Sepal.Width, NULL)][, .(Sepal.Length_new, Sepal.Width_new)]  # note the new column is added to the end of the data.table, the final operations selects only the columns of interest
pander(head(dt_mut))

Sepal.Length_new	Sepal.Width_new
2.55	1.75
2.45	1.5
2.35	1.6
2.3	1.55
2.5	1.8
2.7	1.95

summarise

Reduces multiple values to a single summary

df <- iris %>%
    dplyr::summarise(mean = mean(Sepal.Length), n = n())

pander(df)

mean	n
5.843	150

dt_sum <- dt[, .(mean = mean(Sepal.Length), n = length(Sepal.Length))]

pander(dt_sum)

mean	n
5.843	150

Both dplyr and data.table have tons more features to explore. Checkout dplyr here. And, checkout data.table here. A great post that considers differences in Speed, Memory Usage, Syntax and Features can be found here. Until next time…