Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Order factor by (multiple functions of) multiple variables #16

Open
huftis opened this issue Aug 14, 2016 · 16 comments · May be fixed by #220
Open

Feature request: Order factor by (multiple functions of) multiple variables #16

huftis opened this issue Aug 14, 2016 · 16 comments · May be fixed by #220
Labels
feature a feature request or enhancement

Comments

@huftis
Copy link

huftis commented Aug 14, 2016

For plotting and tables, it’s useful to reorder levels of a factor according to other variables, first by one variable, and then by other variables to break any ties. The summary function used for each variable may be different. Example:

d = data.frame(
  name = factor(c("A", "A", "B", "B", "C", "C", "C", "D"),
                levels = LETTERS[4:1]),
  quality = c(5, 3, 4, 4, 7, 7, 7, 7),
  year = c(2000, 2001, 2013, 2014, 2015, 2015, 2015, 2015),
  weight = c(50, 45, 60, 57, 47, 50, 500, 63)
)

I want to order the name factor by 1) average quality, then 2) the first year the product appeared, and then 3) its median weight. If any ties remain, keep the original label order for these ties. In this example, the levels would be ordered ABCD.

To do this reordering, I have to think backwards, reordering by the last tie-breaker first, using either fct_reorder() or reorder():

d$name = fct_reorder(d$name, d$weight, median)
d$name = fct_reorder(d$name, d$year, min)
d$name = fct_reorder(d$name, d$quality, mean)

I would be very convenient to be able to do this in one go, using something like this (I’m dropping the d$ prefix to make the code clearer):

name = fct_reordern(name,
                    vars = list(quality, year, weight),
                    funs = list(mean, min, median))

It would be even better if the functions were shown along with the variable names, e.g. something like this (if possible, or perhaps using some form of formula syntax?):

name = fcr_reordern(name,
                    mean(quality), min(year), median(weight))

For descending order, perhaps by using a desc() function, like this?

name = fcr_reordern(name,
                    desc(mean(quality)), min(year), median(weight))
@hadley
Copy link
Member

hadley commented Aug 15, 2016

Changing the fun for each variable seems pretty confusing to me. Do you have a compelling use case?

@huftis
Copy link
Author

huftis commented Aug 15, 2016

Not at the top of my head, but ordering factors first by the mean of one variable and then by the min of another variable (usually a date variable), or the order way around, is something I’ve done several times. Doing stuff like this is more common when you have lots of ties on the first variable, so that you’re not really using a second variable just for breaking ties, but because you’re really interested in the ordering for this second variable.

I’ve also used different summary functions for the same variable, e.g. first ordering by the median of one variable and then by the standard deviation of the same variable.

And finally, sometimes I have data in long format where a variable is repeated but unique within the factor I’m reordering. Then I sometimes abuse the mean function by reordering first on the mean (which is equivalent to reordering on x[1], since the values are unique within the factor) and then by (perhaps a different summary function on) a different variable. For example, I had long data on patients receiving blood transfusions, with longitudinal follow-up (one or more measurements/rows). One variable was the number of bags of blood received by each patient (unique within each patient), and I wanted to graph the longitudinal data, where the patients (as separate panels) would ordered by the number of bags of blood (where many patients received the same number of bags) and then by the maximum value obtained on the measurement variable (some sort of the measure of the effect of the blood transfusion). Then ordering first by the mean and then by max was useful.

Sometimes reordering the actual data set using other function can be done as a workaround (followed by fct_inorder()). But frequently I don’t want to change the order of the rows, because that messes up the data. And sometimes I’m just experiencing with different (first- and second-degree orderings), e.g. inside a facet_wrap(), and having do major data_frame reordering just for this is a chore.

@hadley
Copy link
Member

hadley commented Aug 15, 2016

Thanks - that's useful.

That somehow feels a bit big for forcats, and seems like somehow it might be an interaction with dplyr. Let me think about it for a bit.

@acnb
Copy link

acnb commented Apr 27, 2017

Let me just support @huftis, I think he described a common use case very well. Here is another example on stackoverflow.

It seems to me, that the order of factor levels is most relevant for creating nicer plots. In fact I'm not sure if the order of factor levels has any meaning if the factor itself is not ordered? I understand dplyr as a tool to permanetly manipulate my data. Therefore a fct_reordern would be better placed inside forcats to just temporarily reorder levels. (e.g. ggplot(aes(x=fct_reordern(a,...)))

@shabbybanks
Copy link

I wandered here looking for a 'lexicographic reordering' of factors. In my use case, there is a hierarchy to my factors, say f is a coarse classification, and g is a fine classification. I will make a plot (a bar plot actually) with colors (and x axis) determined by g, but facets determined by f. I want the colors to be essentially in order across the facets and within each facet. I need to reorder f according to a numeric order of corresponding g (there are many ties), and then another variable. First pass at code, which is not terribly general, looks like:

fct_lexi_reord <- function(f, ..., .desc=FALSE) {
  # cannot seem to include this in the function list
  fun <- median
  numbys <- list(...)
  f <- forcats:::check_factor(f)
  stopifnot(rep(length(f),length(numbys))==unlist(lapply(numbys,length)))
  allsumma <- lapply(numbys,function(anx) {
    summary <- tapply(anx, f, fun)
    if (!is.numeric(summary)) {
        stop("`fun` must return a single number per group", call. = FALSE)
    }
    summary
  })
  neworder <- do.call(order,args=c(allsumma,list(decreasing=.desc)))
  lvls_reorder(f, neworder)
}

Again, this is probably not general enough for inclusion in forcats, but it gives the general idea of what I am looking for.
(To be absolutely clear, this code solves my problem, but might not be what the OP needs, and is not "ready for prime time" yet.)

@shabbybanks
Copy link

Some context, here is a MWE for something like what I am doing, using the above fct_lexi_reord. Basically I want facets by wine color, axis by varietal, color in order across the x axis.

library(tibble)
library(dplyr)
library(ggplot2)
library(forcats)
set.seed(123)
wines <- tibble::tribble(~color,    ~varietal,
                         'white',    'riesling',
                         'white',    'chardonnay',
                         'white',    'sauv blanc',
                         'rose',     '2 buck chuck',
                         'red',      'barbera',
                         'red',      'grenache',
                         'red',      'zinfandel',
                         'red',      'merlot',
                         'red',      'pinot noir',
                         'red',      'syrah',
                         'red',      'cab sauv') %>%
  mutate(points=runif(length(color),min=0,max=100))

ph <- wines %>%
  mutate(color_ord=forcats::fct_reorder(color,points)) %>%
  mutate(varietal_ord=fct_lexi_reord(varietal,as.numeric(color_ord),points)) %>%
  ggplot(aes(varietal_ord,points,fill=varietal_ord)) +
  geom_bar(stat='identity') +
  facet_grid(.~color_ord,space='free',scale='free') +
  labs(x='varietal',y='points')
print(ph)

(BTW, awful things happen when you try to flip this to the y axis via coord_flip, as the direction of facet_grid and the erstwhile x axis become internally inconsistent. That is, the following is just wrong:

ph <- wines %>%
  mutate(color_ord=forcats::fct_reorder(color,points)) %>%
  mutate(varietal_ord=fct_lexi_reord(varietal,as.numeric(color_ord),points)) %>%
  ggplot(aes(varietal_ord,points,fill=varietal_ord)) +
  geom_bar(stat='identity') +
  coord_flip() +
  facet_grid(color_ord~.,space='free',scale='free') +
  labs(x='varietal',y='points')
print(ph)

But I suppose that is an issue for ggplot2 instead.)

@mwiesweg
Copy link

mwiesweg commented Aug 8, 2017

I had the similar problem of ordering a factor by multiple other variables. In this use case, the other variables served as a sort of hierarchical index, where I needed to sort by level 1 first, then by level 2 within level 1 groups.
Intuitively I could solve this by order() which takes multiple arguments as I want, and
df[order(level1, level2),]
works just fine. But fct_reorder did not; in fact, it does not want an order, but an ordered index (such that df[order(ordered_index),] == df[order(level1, level2),]).
Unintuitively, ordered_index <- order(order(...)), so

level1 <- c(7,6,9,8,10,6,6,7)
level2 <- c(1,2,1,2,1,1,3,2)
my_unordered_factor <- paste(l, l1, sep="_")
my_factor_properly_ordered <- fct_reorder(my_unordered_factor, order(order(l, l1)))

@hadley hadley added the feature a feature request or enhancement label Feb 10, 2018
@billdenney
Copy link
Contributor

I ran into the same attempt at using order() to solve the fct_reordern() problem today.

I tend to agree with @hadley that multiple functions seem complex, but I wonder if something like the following could work (borrowing from purrr::pmap()):

Rather than:

name = fct_reordern(name,
                    vars = list(quality, year, weight),
                    funs = list(mean, min, median))

How about:

myorder <- function(quality, year, weight) {
  c(mean(quality), min(year), median(weight))
}
name = fct_reordern(.f = name,
                    .l = list(quality, year, weight),
                    .fun = myorder)

Then, fct_reordern() would do the sort based on the vector (instead of scalar) output using the first then second then third then ... value to sort. (If myorder() could sufficiently describe a scalar summary, that would just be a degenerate case of the vector option.)

@Cameron-Fairfield
Copy link

Will have a go at this for the tidyverse developer day

@esperluette
Copy link

I think I ran into a similar problem as I wanted to make a plot where my 'name' variable is ordered first for min of 'value' and then in descending order for 'delta'. Directly nesting fct_reorder in aes worked for me:

ggplot(data) + geom_path(aes(x = value, y = fct_reorder(fct_reorder(name, delta, .desc=TRUE), value, min)))

@billdenney
Copy link
Contributor

As I need this all the time, and it's the oldest open issue in forcats, how about this for an implementation, @hadley? It drops the argument for .fun, but it allows for a common use case where each factor level has a single value of the .data items. It also supports vectors of .desc and has no dependencies other than base R and forcats. Based on the help page for order, it may fail above 2^31 elements in .f.

library(forcats)
fct_reordern <- function(.f, .data, .desc=FALSE, ordered=FALSE) {
  stopifnot(nrow(.data) == length(.f))
  .f_name <- paste0(max(names(.data)), "X")
  fct_inorder(
    f=
      .f[
        do.call(
          base::order,
          append(
            unname(.data),
            list(method="radix", decreasing=.desc)
          )
        )
        ],
    ordered=ordered
  )
}

mydata <-
  data.frame(
    A=c(3, 3, 2, 1),
    B=c("A", "B", "C", "D"),
    stringsAsFactors=FALSE
  )
fct_reordern(.f=c("A", "B", "C", "D"), .data=mydata)
#> [1] D C A B
#> Levels: D C A B
fct_reordern(.f=c("A", "B", "C", "D"), .data=mydata, .desc=TRUE)
#> [1] B A C D
#> Levels: B A C D
fct_reordern(.f=c("A", "B", "C", "D"), .data=mydata, .desc=c(FALSE, TRUE))
#> [1] D C B A
#> Levels: D C B A

Created on 2019-11-17 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Nov 18, 2019

@billdenney unfortunately I don't follow your explanation; I don't see how this solves the original problem. I also find the the implementation rather hard to understand because of the giant nested call inside of fct_inorder().

@billdenney
Copy link
Contributor

@hadley, It solves the issue that the factor is ordered by an arbitrary number of variables (given in .data). It does not solve the "multiple functions of" part of the issue; it relies on the order being defined in the raw data. As you noted, making it work for multiple functions of the variables turns it into a dplyr-dependent solution, and in my mind, if a more complex factor ordering is required, then it is not too large of a step to require the user to make that simpler data set for inclusion (such as a group_by(.f) followed by a summarize() with the summarization functions of interest. And, in case of multiple values being present at one factor level, it will use the lowest because that is the way that order() will work (so it is like .fun=min in fct_reorder()).

Here is an un-nested version of the function that is more in line with forcats programming methods which will hopefully clarify things:

#' @param .f A factor (or character vector)
#' @param .data A tbl with the same number of rows as the length of \code{.f}
#' @param .desc Order in descending order?  It may either be a scalar or a
#'   vector with the same length as the number of columns as \code{.data}.
#' @inheritParams fct_inorder
fct_reordern <- function(.f, .data, .desc=FALSE, ordered=NA) {
  stopifnot(nrow(.data) == length(.f))
  stopifnot(length(.desc) %in% c(1, ncol(.data)))
  stopifnot(all(.desc %in% c(TRUE, FALSE)))
  f <- forcats:::check_factor(.f)
  # .data is unnamed so that its names do not clash with named arguments to
  # order().  The radix method is used to support a vector of .desc (other
  # methods only support scalar values for .desc).
  order_args <-
    append(
      unname(.data),
      list(method="radix", decreasing=.desc)
    )
  new_order <- do.call(base::order, order_args)
  f_sorted <- f[new_order]
  fct_inorder(f=f_sorted, ordered=ordered)
}

@billdenney
Copy link
Contributor

I just realized that you were probably getting at something a bit different. Here is a way that you can pass in an arbitrary number of vectors. It has the side benefit that it also indirectly exposes the na.last argument from order():

library(forcats)
#' @param .f A factor (or character vector)
#' @param ... Arguments passed to \code{base::order()}.  (\code{method} may not
#'   be modified, and \code{decreasing} is handled through the \code{.desc}
#'   argument.)
#' @param .desc Order in descending order?  It may either be a scalar or a
#'   vector with the same length as the number of columns as \code{.data}.
#' @inheritParams fct_inorder
fct_reordern <- function(.f, ..., .desc=FALSE, ordered=NA) {
  stopifnot(length(.desc) %in% c(1, ...length()))
  stopifnot(all(.desc %in% c(TRUE, FALSE)))
  f <- forcats:::check_factor(.f)
  # The radix method is used to support a vector of .desc (other methods only
  # support scalar values for .desc).
  new_order <- base::order(..., method="radix", decreasing=.desc)
  f_sorted <- f[new_order]
  fct_inorder(f=f_sorted, ordered=ordered)
}

  mydata <-
    data.frame(
      A=c(3, 3, 2, 1),
      B=c("A", "B", "C", "D"),
      stringsAsFactors=FALSE
    )
  fct_reordern(.f=c("A", "B", "C", "D"), mydata$A, mydata$B)
#> [1] D C A B
#> Levels: D C A B
  fct_reordern(.f=c("A", "B", "C", "D"), mydata$A, mydata$B, .desc=TRUE)
#> [1] B A C D
#> Levels: B A C D
  fct_reordern(.f=c("A", "B", "C", "D"), mydata$A, mydata$B, .desc=c(FALSE, TRUE))
#> [1] D C B A
#> Levels: D C B A

Created on 2019-11-18 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Nov 18, 2019

Ah, ok, that's starting to make sense to me. I'd suggest dropping the .desc argument in favour of using dplyr:desc(). And shouldn't you be ordering the levels, rather than the actual data? (And can you please omit the names of data arguments in calls).

@hadley
Copy link
Member

hadley commented Nov 18, 2019

But at this point, the overall approach seems reasonable to me, so it's probably easier to move to a PR.

billdenney added a commit to billdenney/forcats that referenced this issue Nov 18, 2019
@billdenney billdenney linked a pull request Nov 18, 2019 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants