I am attempting to select a number of ICD-10 medical codes from a diagnosis variable to retrieve a set of specific diagnoses within a dataset, however am having a very challenging time doing so in a succinct manner.
For context, there are over 50 different diagnostic instances for every participant ID (PID), each coded as Disease_code_1 , Disease_code_2 , Disease_code_3, etc. with corresponding dates as shown below.
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
Disease_code_1 = c( I802 , G200 , I802 ,NA, H356 ),
Disease_code_2 = c( A071 ,NA, G20 ,NA, I802 ),
Disease_code_3 = c( H250 , NA,NA,NA,NA),
Date_of_diagnosis_1 = c( 12/06/1997 , 13/06/1997 , 14/02/2003 ,NA, 18/03/2005 ),
Date_of_diagnosis_2 = c( 12/06/1998 ,NA, 18/09/2001 ,NA, 12/07/1993 ),
Date_of_diagnosis_3 = c( 17/09/2010 ,NA,NA,NA,NA))
ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010
2 1002 G200 <NA> <NA> 13/06/1997 <NA> <NA>
3 1003 I802 G20 <NA> 14/02/2003 18/09/2001 <NA>
4 1004 <NA> <NA> <NA> <NA> <NA> <NA>
5 1005 H356 I802 <NA> 18/03/2005 12/07/1993 <NA>
As can be seen I have several I802
codes within disease_code_1, but also within disease_code2 that I want to include in my analysis.
However, when I combine these variables that I am wanting to select together using a simple ifelse
function, I cannot use an or operator as I get 0 participants
data$diagnosis<- with(data, ifelse((data$disease_code_1 == "I802||G200||H356"),1,0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)
I am left with a sum of 0. The only way I can achieve a number so far is using a long format. Although there are only 3 instances and 3 disease codes in this code, there are over 50 instances and approximately 12 diagnoses that need to be counted for each, meaning this line of code becomes exceptionally long and I would like to make it more succinct.
data$diagnosis <- with(data, ifelse(((data$disease_code_1 == "I802")|(data$disease_code_1 == "G200")|(data$disease_code_1 == "H356")|(data$disease_code_2 == "I802")|(data$disease_code_2 == "G200")|(data$disease_code_2 == "H356")|(data$disease_code_3 == "I802")|(data$disease_code_3 == "G200")|(data$disease_code_3 == "H356")), 1, 0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)
This method will provide an answer, however once I get to a certain length, the line will also cease to run, resulting in a +
symbol within the command line.
Is there a way I can make this much more succinct?