rowwise manipulation of data frame in R


The data frame is used for storing tabular data in R. It is the fundamental data structure in R, especially for tidyverse. Tidy data (data frame), in which each variable in a column and each observation in a row, is used wherever possible througout all the tidyverse packages. Working on column is a very natural and usual operation, and the key ideology in dplyr, a core package of tidyverse. But how to perform rowwise manipulation ? In this post, I will show how to do rowwise operation on data frames using both base R and tidyverse.

sample data

library(tidyverse)
library(rvest)

url <- "https://www.pesmaster.com/arsenal/pes-2020/team/101/"

html <- read_html(url)

players <- html_nodes(html, ".squad-table") %>% 
  html_table() %>% 
  `[[`(1) %>%
  arrange(desc(Ovr)) %>% 
  slice(1:10) %>% select(Name, Pas:Dri) %>% 
  column_to_rownames("Name") %>% 
  t() %>% 
  as.data.frame()
names(players) <- stringi::stri_trans_general(names(players), "latin-ascii") %>% # convert to latin-ascii
  make.names() %>% # valid names
  str_replace_all(".*\\.+", "") # family name

We want to find the the most skillful player of each ability (Pas, Sht, Phy, Def, Spd, and Dri) of ten Arsenal players in PES 2020.

base R

apply()

Once rowwise operation is mentioned, the first function comes mind is apply().

index <- apply(players, 1, which.max)
max_ability <- apply(players, 1, max)
skillful_player <- data.frame(
  name = names(players)[index],
  value = max_ability,
  ability = row.names(players)
)

We can use ggplot to visulaize the rusults

player_label <- data.frame(
  label = skillful_player$name,
  x = skillful_player$ability,
  y = 100
)
ggplot(skillful_player, aes(x = ability, y = value)) +
  geom_col(aes(fill = ability)) +
  geom_text(aes(x = x, y = y, label = label), data = player_label)

The result shows that the best dribbling player is the new player PEPE, and Ozil is still the best passer, although he have not participated in anyt official match for a long time.

for loop

Of course someone has to write loops

index_loop <- vector(mode = "double", nrow(players))
for (i in seq_along(index_loop)) {
  index_loop[i] <- which.max(players[i, ])
}
max_ability_loop <- vector(mode = "double", nrow(players))
for (i in seq_along(max_ability_loop)) {
  max_ability_loop[i] <- max(players[i, ])
}

Apprantely, using for loop is more intuitive, but requires more typing.

split, then apply and combine

The Next method is split the data frame by row then apply and combine it.

players_split <- split(players, seq_len(nrow(players)))
max_ability <- sapply(players_split, max)
index <- sapply(players_split, which.max)
skillful_player_split <- data.frame(
  value = max_ability,
  name = names(players)[index],
  ability = row.names(players)
)

ggplot(skillful_player_split, aes(x = ability, y = value)) +
  geom_col(aes(fill = ability)) +
  geom_text(aes(x = x, y = y, label = label), data = player_label)

tidyverse

pmap() in purrr

purrr::pmap() iterate over multiple arguments simultaneously

which_max <- function(...) {
  which.max(c(...))
}
skillful_player_pmap <- players %>% 
  mutate(index = pmap_int(., which_max),
    value = pmap_int(., max),
    names = names(players)[index],
    ability = row.names(players)
  ) 


ggplot(skillful_player_pmap, aes(x = ability, y = value)) +
  geom_col(aes(fill = ability)) +
  geom_text(aes(x = x, y = y, label = label), data = player_label)

rowwise() in dplyr

dplyr provides a function rowwise() to preform row-wise operations. However, as mentioned in this issue, we can not use tidyselect operation :, which means that all variables must be explicitly provided for rowwise manipulation. rowwise() is not suitable while there are many varaibles.

skillful_player_rowwise <- players %>% 
  rowwise() %>% 
  # mutate(value = max(Aubameyang:Kolasinac)) does not work well
  mutate(value = max(Aubameyang, Lacazette, Leno, Sokratis, 
    Luiz, Ozil, Torreira, Xhaka, Pepe, Kolasinac)
  )

Furthermore, we can also transpose the data frame first, and then use apply or dplyr or purrr::map() to perform rowwise operation. Intuitively, this method is more complicated than the method mentioned above, so here we not detail the code.

In summary, purrr::pmap() is using tidyverse and mmore easy to used as part of a pipe. The apply and for loop is more intuitive and efficiency (as shown in here, but requires more typing. dplyr::rowwise() is not suitable while there are many varaibles.


Choyang

Bioinformatics, R enthusiast. Thoughts on reasarch, personal experience and other distractions.

Tags

dplyr package-dev R tidyverse