原标题:Best way to replace a variety of write-in answers with numbers using R

我正在利用R清理数据集。 我的部分数据集涉及:

record_id | organization | other_work_loc
1               12            CCC
2               12            AMG
3               12            TAO
4                1
5                2
6                7

其他工作地点是一个自由回答的列,其输入变化非常大。 只有当组织= 12时才有数据。 我想将组织和其他工作地点数据重新分类为一个列(org_cat),其中包括三个类别(1、2、3)。 大多数其他工作地点数据将被重新分类为3。

dataset<- dataset %>% mutate(org_cat = case_when (organization == 1 | organization == 2 ~  1 ,
                                                            organization >= 3 & organization <12 ~  2 ,
                                                            other_work_loc ==  CCC  | other_work_loc == AMG ~  3 ))



使用Excel或类似工具创建一个数据框,其中包含列,其中最后两个是您的自由回答答案及其对应的数值替换值 - 基本上是一个查找表。我把它命名为,它看起来像这样:

organization    other_work_loc  newvar
12              CCC             3
12              AMG             3
12              TAO             2
1                               1


df <- read_csv( df.csv ) %>% print()
lut <- read_csv( lut.csv ) %>% print()

left_join(df, lut)

Joining with `by = join_by(organization, other_work_loc)`
# A tibble: 6 x 4
  record_id organization other_work_loc newvar
      <dbl>        <dbl> <chr>           <dbl>
1         1           12 CCC                 3
2         2           12 AMG                 3
3         3           12 TAO                 2
4         4            1 NA                  1
5         5            2 NA                 NA
6         6            7 NA                 NA


  • Even though I left other_work_loc blank in the LUT for organization #1, it was able to successfully match to that line of your original file, just based on organization.
  • I didn t fill out the entire LUT, so organizations #2 and #7 ended up with NA for newvar.
  • For organization #12, you much more easily edit the LUT file to add additional free responses and their corresponding newvar entries, than write additional lines of case_when code.

