English 中文(简体)
How can I simplify the code for selecting specific diagnoses from a diagnosis variable using ICD-10 codes in R?
原标题:
  • 时间:2023-05-25 02:37:20
  •  标签:
  • r
  • dataframe

I am attempting to select a number of ICD-10 medical codes from a diagnosis variable to retrieve a set of specific diagnoses within a dataset, however am having a very challenging time doing so in a succinct manner.

For context, there are over 50 different diagnostic instances for every participant ID (PID), each coded as Disease_code_1 , Disease_code_2 , Disease_code_3, etc. with corresponding dates as shown below.

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c( I802 ,  G200 , I802 ,NA,  H356 ),
                Disease_code_2 = c( A071 ,NA, G20 ,NA, I802 ),
                Disease_code_3 = c( H250 , NA,NA,NA,NA),
                Date_of_diagnosis_1 = c( 12/06/1997 , 13/06/1997 , 14/02/2003 ,NA, 18/03/2005 ),
                Date_of_diagnosis_2 = c( 12/06/1998 ,NA, 18/09/2001 ,NA, 12/07/1993 ),
                Date_of_diagnosis_3 = c( 17/09/2010 ,NA,NA,NA,NA))

    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010
2 1002           G200           <NA>           <NA>          13/06/1997                <NA>                <NA>
3 1003           I802            G20           <NA>          14/02/2003          18/09/2001                <NA>
4 1004           <NA>           <NA>           <NA>                <NA>                <NA>                <NA>
5 1005           H356           I802           <NA>          18/03/2005          12/07/1993                <NA>

As can be seen I have several I802 codes within disease_code_1, but also within disease_code2 that I want to include in my analysis.

However, when I combine these variables that I am wanting to select together using a simple ifelse function, I cannot use an or operator as I get 0 participants

data$diagnosis<- with(data, ifelse((data$disease_code_1 == "I802||G200||H356"),1,0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)

I am left with a sum of 0. The only way I can achieve a number so far is using a long format. Although there are only 3 instances and 3 disease codes in this code, there are over 50 instances and approximately 12 diagnoses that need to be counted for each, meaning this line of code becomes exceptionally long and I would like to make it more succinct.

data$diagnosis <- with(data, ifelse(((data$disease_code_1 == "I802")|(data$disease_code_1 == "G200")|(data$disease_code_1 == "H356")|(data$disease_code_2 == "I802")|(data$disease_code_2 == "G200")|(data$disease_code_2 == "H356")|(data$disease_code_3 == "I802")|(data$disease_code_3 == "G200")|(data$disease_code_3 == "H356")), 1, 0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)

This method will provide an answer, however once I get to a certain length, the line will also cease to run, resulting in a + symbol within the command line.

Is there a way I can make this much more succinct?

问题回答

A few notes regarding your codes:

  1. When you use with, you no longer need the data$ syntax, just with(data, ifelse(disease_code_1 == "I802||G200||H356"),1,0) is enough
  2. We cannot use == to compare a set of strings
  3. In your code you have lower-case disease_code but in your data you have upper-case Disease_code

Solution

To check the existence of a set of strings, we can use the %in% operator, and we can use apply to iterate over the rows of the data frame.

data$diagnosis <- apply(data[startsWith(colnames(data), "Disease_code_")], 1, (x) as.integer(any(x %in% c("I802", "G200", "H356"))))

    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010         1
2 1002           G200           <NA>           <NA>          13/06/1997                <NA>                <NA>         1
3 1003           I802            G20           <NA>          14/02/2003          18/09/2001                <NA>         1
4 1004           <NA>           <NA>           <NA>                <NA>                <NA>                <NA>         0
5 1005           H356           I802           <NA>          18/03/2005          12/07/1993                <NA>         1
sum(data$diagnosis)
[1] 4

Using dplyr::if_any() (with an assist from tidyr::replace_na()):

library(dplyr)
library(tidyr)

data %>%
  mutate(diagnosis = replace_na(
    if_else(
      if_any(starts_with("disease_code"), (x) x %in% c("I802", "G200", "H356")),
      1,
      0
    ),
    0
  ))

The way I would approach this is to convert your original data frame into a long format and then select the rows from there:

library(tidyr)
library(dplyr)

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c( I802 ,  G200 , I802 ,NA,  H356 ),
                Disease_code_2 = c( A071 ,NA, G20 ,NA, I802 ),
                Disease_code_3 = c( H250 , NA,NA,NA,NA),
                Date_of_diagnosis_1 = c( 12/06/1997 , 13/06/1997 , 14/02/2003 ,NA, 18/03/2005 ),
                Date_of_diagnosis_2 = c( 12/06/1998 ,NA, 18/09/2001 ,NA, 12/07/1993 ),
                Date_of_diagnosis_3 = c( 17/09/2010 ,NA,NA,NA,NA))

#convert to long format, all Disease codes in one column
dflong<- pivot_longer(df, -ID, names_sep = "_(?=\d)", names_to=c(".value", "count"), values_to= "value")

#find the rows that matches the code(s) of interest
hascondition <- which(dflong$Disease_code %in% c("I802", "G200", "H356"))


#get a list of patient IDs
patients <- unique(dflong$ID[hascondition])

#filter/subset the original data frame
subset <- df %>% filter(ID %in% patients)




相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...

热门标签