Question

I am attempting to select a number of ICD-10 medical codes from a diagnosis variable to retrieve a set of specific diagnoses within a dataset, however am having a very challenging time doing so in a succinct manner.

For context, there are over 50 different diagnostic instances for every participant ID (PID), each coded as Disease_code_1 , Disease_code_2 , Disease_code_3, etc. with corresponding dates as shown below.

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c( I802 ,  G200 , I802 ,NA,  H356 ),
                Disease_code_2 = c( A071 ,NA, G20 ,NA, I802 ),
                Disease_code_3 = c( H250 , NA,NA,NA,NA),
                Date_of_diagnosis_1 = c( 12/06/1997 , 13/06/1997 , 14/02/2003 ,NA, 18/03/2005 ),
                Date_of_diagnosis_2 = c( 12/06/1998 ,NA, 18/09/2001 ,NA, 12/07/1993 ),
                Date_of_diagnosis_3 = c( 17/09/2010 ,NA,NA,NA,NA))

    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010
2 1002           G200           <NA>           <NA>          13/06/1997                <NA>                <NA>
3 1003           I802            G20           <NA>          14/02/2003          18/09/2001                <NA>
4 1004           <NA>           <NA>           <NA>                <NA>                <NA>                <NA>
5 1005           H356           I802           <NA>          18/03/2005          12/07/1993                <NA>

As can be seen I have several I802 codes within disease_code_1, but also within disease_code2 that I want to include in my analysis.

However, when I combine these variables that I am wanting to select together using a simple ifelse function, I cannot use an or operator as I get 0 participants

data$diagnosis<- with(data, ifelse((data$disease_code_1 == "I802||G200||H356"),1,0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)

I am left with a sum of 0. The only way I can achieve a number so far is using a long format. Although there are only 3 instances and 3 disease codes in this code, there are over 50 instances and approximately 12 diagnoses that need to be counted for each, meaning this line of code becomes exceptionally long and I would like to make it more succinct.

data$diagnosis <- with(data, ifelse(((data$disease_code_1 == "I802")|(data$disease_code_1 == "G200")|(data$disease_code_1 == "H356")|(data$disease_code_2 == "I802")|(data$disease_code_2 == "G200")|(data$disease_code_2 == "H356")|(data$disease_code_3 == "I802")|(data$disease_code_3 == "G200")|(data$disease_code_3 == "H356")), 1, 0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)

This method will provide an answer, however once I get to a certain length, the line will also cease to run, resulting in a + symbol within the command line.

Is there a way I can make this much more succinct?

Answer 1

A few notes regarding your codes:

When you use with, you no longer need the data$ syntax, just with(data, ifelse(disease_code_1 == "I802||G200||H356"),1,0) is enough
We cannot use == to compare a set of strings
In your code you have lower-case disease_code but in your data you have upper-case Disease_code

Solution

To check the existence of a set of strings, we can use the %in% operator, and we can use apply to iterate over the rows of the data frame.

data$diagnosis <- apply(data[startsWith(colnames(data), "Disease_code_")], 1, (x) as.integer(any(x %in% c("I802", "G200", "H356"))))

    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010         1
2 1002           G200           <NA>           <NA>          13/06/1997                <NA>                <NA>         1
3 1003           I802            G20           <NA>          14/02/2003          18/09/2001                <NA>         1
4 1004           <NA>           <NA>           <NA>                <NA>                <NA>                <NA>         0
5 1005           H356           I802           <NA>          18/03/2005          12/07/1993                <NA>         1

sum(data$diagnosis)
[1] 4

Answer 2

Using dplyr::if_any() (with an assist from tidyr::replace_na()):

library(dplyr)
library(tidyr)

data %>%
  mutate(diagnosis = replace_na(
    if_else(
      if_any(starts_with("disease_code"), (x) x %in% c("I802", "G200", "H356")),
      1,
      0
    ),
    0
  ))

Answer 3

The way I would approach this is to convert your original data frame into a long format and then select the rows from there:

library(tidyr)
library(dplyr)

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c( I802 ,  G200 , I802 ,NA,  H356 ),
                Disease_code_2 = c( A071 ,NA, G20 ,NA, I802 ),
                Disease_code_3 = c( H250 , NA,NA,NA,NA),
                Date_of_diagnosis_1 = c( 12/06/1997 , 13/06/1997 , 14/02/2003 ,NA, 18/03/2005 ),
                Date_of_diagnosis_2 = c( 12/06/1998 ,NA, 18/09/2001 ,NA, 12/07/1993 ),
                Date_of_diagnosis_3 = c( 17/09/2010 ,NA,NA,NA,NA))

#convert to long format, all Disease codes in one column
dflong<- pivot_longer(df, -ID, names_sep = "_(?=\d)", names_to=c(".value", "count"), values_to= "value")

#find the rows that matches the code(s) of interest
hascondition <- which(dflong$Disease_code %in% c("I802", "G200", "H356"))


#get a list of patient IDs
patients <- unique(dflong$ID[hascondition])

#filter/subset the original data frame
subset <- df %>% filter(ID %in% patients)

Solution

友情链接