English 中文(简体)
Remove columns from dataframe where ALL values are NA
原标题:

I have a data frame where some of the columns contain NA values.

How can I remove columns where all rows contain NA values?

最佳回答

Try this:

df <- df[,colSums(is.na(df))<nrow(df)]
问题回答

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.

Here are two approaches that are more memory and time efficient

An approach using Filter

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0( X ,seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can t allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can t allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18 

Update

You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to @mcstrother for bringing this to attention).

library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select(where(not_all_na))
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select(where(not_any_na))
  x
1 1
2 2
3 3
4 4
5 5

Old Answer

dplyr now has a select_if verb that may be helpful here:

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select_if(not_all_na)
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select_if(not_any_na)
  x
1 1
2 2
3 3
4 4
5 5

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.

df <- janitor::remove_empty(df, which = "cols")

Another way would be to use the apply() function.

If you have the data.frame

df <- data.frame (var1 = c(1:7,NA),
                  var2 = c(1,2,1,3,4,NA,NA,9),
                  var3 = c(NA)
                  )

then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.

> !apply (is.na(df), 2, all)
 var1  var2  var3 
 TRUE  TRUE FALSE 

> df[, !apply(is.na(df), 2, all)]
  var1 var2
1    1    1
2    2    2
3    3    1
4    4    3
5    5    4
6    6   NA
7    7   NA
8   NA    9

Another options with purrr package:

library(dplyr)

df <- data.frame(a = NA,
                 b = seq(1:5), 
                 c = c(rep(1, 4), NA))

df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))
df[sapply(df, function(x) all(is.na(x)))] <- NULL

An old question, but I think we can update @mnel s nice answer with a simpler data.table solution:

DT[, .SD, .SDcols = (x) !all(is.na(x))]

(I m using the new (x) lambda function syntax available in R>=4.1, but really the key thing is to pass the logical subsetting through .SDcols.

Speed is equivalent.

microbenchmark::microbenchmark(
  which_unlist = DT[, which(unlist(lapply(DT, (x) !all(is.na(x))))), with=FALSE],
  sdcols       = DT[, .SD, .SDcols = (x) !all(is.na(x))],
  times = 2
)
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq      max neval cld
#>  which_unlist 51.32227 51.32227 56.78501 56.78501 62.24776 62.24776     2   a
#>        sdcols 43.14361 43.14361 49.33491 49.33491 55.52621 55.52621     2   a

You can use Janitor package remove_empty

library(janitor)

df %>%
  remove_empty(c("rows", "cols")) #select either row or cols or both

Also, Another dplyr approach

 library(dplyr) 
 df %>% select_if(~all(!is.na(.)))

OR

df %>% select_if(colSums(!is.na(.)) == nrow(df))

this is also useful if you want to only exclude / keep column with certain number of missing values e.g.

 df %>% select_if(colSums(!is.na(.))>500)

A handy base R option could be colMeans():

df[, colMeans(is.na(df)) != 1]

I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.

naColsRemoval = function (DataTable) {
     na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )]
     DataTable [ , unlist (na.cols) := NULL , with = F]
     }

.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as

janitor::remove_constant() 

does this very nicely.

From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:

How to get rid of columns where for ALL rows the value is NA?

First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)

Second, it uses dplyr.

Instead of

df <- df %>% select_if(~all(!is.na(.)))

I find that what works is

df <- df %>% select_if(~!all(is.na(.)))

The point is that the "not" symbol "!" needs to be on the outside of the universal quantifier. I.e. the select_if operator acts on columns. In this case, it selects only those that do not satisfy the criterion

every element is equal to "NA"

library(dplyr)

# create a sample data frame
df <- data.frame(x = c(1, 2, NA, 4),
                 y = c(NA, NA, NA, NA),
                 z = c(6, 7, NA, 9))

# remove columns with all NAs
df <- df %>%
  select_if(~!all(is.na(.)))




相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...

热门标签