A few of my recent posts included manipulation of data inside the data.table by using special characters in the j and by positions. Another very popular approach is to use the dplyr package, rather than data.table. dplyr is a grammar of data manipulation, which includes arrange, filter, mutate, select, and summarise, all of which can be used with other functions, such as the group_by function or even logical operators. For comparison, I provide the similar action with data.table.
{{ instagram BgaKgO-HKF- }}
arrange
- Used to sort the order of rows
- Note the use of as_tibble, which simpliifes the data frame and improves the appearance
- Note the use of forward-pipe operator %>%
data("iris")
iris <- as_tibble(iris)
df <- iris %>% # uses the pipe operator
arrange(desc(Sepal.Length))
pander(head(df)) # note dplyr resets the rownames/order
7.9 |
3.8 |
6.4 |
2 |
virginica |
7.7 |
3.8 |
6.7 |
2.2 |
virginica |
7.7 |
2.6 |
6.9 |
2.3 |
virginica |
7.7 |
2.8 |
6.7 |
2 |
virginica |
7.7 |
3 |
6.1 |
2.3 |
virginica |
7.6 |
3 |
6.6 |
2.1 |
virginica |
- setorder is similar to arrange
- Make sure to use the negative-sign to set descending
dt <- data.table(iris)
setorder(dt, -Sepal.Length) # note the '-' for descending
pander(head(dt)) # note data.table preserves row name order
7.9 |
3.8 |
6.4 |
2 |
virginica |
7.7 |
3.8 |
6.7 |
2.2 |
virginica |
7.7 |
2.6 |
6.9 |
2.3 |
virginica |
7.7 |
2.8 |
6.7 |
2 |
virginica |
7.7 |
3 |
6.1 |
2.3 |
virginica |
7.6 |
3 |
6.6 |
2.1 |
virginica |
filter
- Picks rows based on their values
df <- iris %>%
filter(Sepal.Length == 7.7)
pander(head(df)) # note dplyr reset the row name order
7.7 |
3.8 |
6.7 |
2.2 |
virginica |
7.7 |
2.6 |
6.9 |
2.3 |
virginica |
7.7 |
2.8 |
6.7 |
2 |
virginica |
7.7 |
3 |
6.1 |
2.3 |
virginica |
- filter is similar to setting a row filter in the i position of the data.table
- Note the positions in the data.table:
> dt[i,j,by]
dt_filt <- dt[Sepal.Length == 7.7, ] # filter values in the i position
pander(dt_filt)
7.7 |
3.8 |
6.7 |
2.2 |
virginica |
7.7 |
2.6 |
6.9 |
2.3 |
virginica |
7.7 |
2.8 |
6.7 |
2 |
virginica |
7.7 |
3 |
6.1 |
2.3 |
virginica |
mutate
- Adds a new variable as a function of existing variables
df <- iris %>%
mutate(Sepal.Length, Sepal.Length_new = Sepal.Length/2)
pander(head(df)) # note the new column is added to the end of the table
Table continues below
5.1 |
3.5 |
1.4 |
0.2 |
setosa |
4.9 |
3 |
1.4 |
0.2 |
setosa |
4.7 |
3.2 |
1.3 |
0.2 |
setosa |
4.6 |
3.1 |
1.5 |
0.2 |
setosa |
5 |
3.6 |
1.4 |
0.2 |
setosa |
5.4 |
3.9 |
1.7 |
0.4 |
setosa |
2.55 |
2.45 |
2.35 |
2.3 |
2.5 |
2.7 |
- for data.table, use the impute special character ‘:=’ in the j position
- ‘:=’ creates a new variable in the data.table
dt_mut <- dt[, `:=`(Sepal.Length_new, Sepal.Length/2)] # note the new column is added to the end of the data.table
pander(head(dt_mut))
Table continues below
7.9 |
3.8 |
6.4 |
2 |
virginica |
7.7 |
3.8 |
6.7 |
2.2 |
virginica |
7.7 |
2.6 |
6.9 |
2.3 |
virginica |
7.7 |
2.8 |
6.7 |
2 |
virginica |
7.7 |
3 |
6.1 |
2.3 |
virginica |
7.6 |
3 |
6.6 |
2.1 |
virginica |
3.95 |
3.85 |
3.85 |
3.85 |
3.85 |
3.8 |
select
- Picks the columns you want in the final table.
- Below, I’ve combine this function with mutate, which shows how various functions can be piped together
df <- iris %>%
as_tibble() %>% mutate( # pipe several manipulationa together
Sepal.Length_new = Sepal.Length/2, # adds column
Sepal.Width_new = Sepal.Width/2, # adds column
Petal.Width = NULL) %>% # removes a column
dplyr::select(Sepal.Length_new, Sepal.Width_new) # will potentially conflict with other packages
pander(head(df)) # note the new column is added to the end of the table
2.55 |
1.75 |
2.45 |
1.5 |
2.35 |
1.6 |
2.3 |
1.55 |
2.5 |
1.8 |
2.7 |
1.95 |
- For the data.table, you string operations together with additional brackets, much like the pipe operators
- Use the list feature ‘.(x, y, z)’ to extract the features of interest - note the period in front of the parentheses, which results in lists
dt <- data.table(iris)
dt_mut <- dt[, `:=`(Sepal.Length_new, Sepal.Length/2)][, `:=`(Sepal.Width_new, Sepal.Width/2)][,
`:=`(Sepal.Width, NULL)][, .(Sepal.Length_new, Sepal.Width_new)] # note the new column is added to the end of the data.table, the final operations selects only the columns of interest
pander(head(dt_mut))
2.55 |
1.75 |
2.45 |
1.5 |
2.35 |
1.6 |
2.3 |
1.55 |
2.5 |
1.8 |
2.7 |
1.95 |
summarise
- Reduces multiple values to a single summary
df <- iris %>%
dplyr::summarise(mean = mean(Sepal.Length), n = n())
pander(df)
dt_sum <- dt[, .(mean = mean(Sepal.Length), n = length(Sepal.Length))]
pander(dt_sum)
Both dplyr and data.table have tons more features to explore. Checkout dplyr here. And, checkout data.table here. A great post that considers differences in Speed, Memory Usage, Syntax and Features can be found here. Until next time…