English 中文(简体)
使用R将各种书面答案替换为数字的最佳方法
原标题:Best way to replace a variety of write-in answers with numbers using R

我正在利用R清理数据集。 我的部分数据集涉及:

record_id | organization | other_work_loc
1               12            CCC
2               12            AMG
3               12            TAO
4                1
5                2
6                7

其他工作地点是一个自由回答的列,其输入变化非常大。 只有当组织= 12时才有数据。 我想将组织和其他工作地点数据重新分类为一个列(org_cat),其中包括三个类别(1、2、3)。 大多数其他工作地点数据将被重新分类为3。

dataset<- dataset %>% mutate(org_cat = case_when (organization == 1 | organization == 2 ~  1 ,
                                                            organization >= 3 & organization <12 ~  2 ,
                                                            other_work_loc ==  CCC  | other_work_loc == AMG ~  3 ))

这个代码是有效的,但在other_work_loc中有100个自由回应。大多数将被重新归类为3。然而,22个需要分类为1或2,我想知道是否有比编写每个单独回应的重新编码更优雅的方法?

问题回答

使用Excel或类似工具创建一个数据框,其中包含列,其中最后两个是您的自由回答答案及其对应的数值替换值 - 基本上是一个查找表。我把它命名为,它看起来像这样:

organization    other_work_loc  newvar
12              CCC             3
12              AMG             3
12              TAO             2
1                               1

我指定了以下数据组:df.csv,在装载tidyverse后,使用left_join<>/code>进行替换:

df <- read_csv( df.csv ) %>% print()
lut <- read_csv( lut.csv ) %>% print()

left_join(df, lut)

Joining with `by = join_by(organization, other_work_loc)`
# A tibble: 6 x 4
  record_id organization other_work_loc newvar
      <dbl>        <dbl> <chr>           <dbl>
1         1           12 CCC                 3
2         2           12 AMG                 3
3         3           12 TAO                 2
4         4            1 NA                  1
5         5            2 NA                 NA
6         6            7 NA                 NA

关键点:

  • Even though I left other_work_loc blank in the LUT for organization #1, it was able to successfully match to that line of your original file, just based on organization.
  • I didn t fill out the entire LUT, so organizations #2 and #7 ended up with NA for newvar.
  • For organization #12, you much more easily edit the LUT file to add additional free responses and their corresponding newvar entries, than write additional lines of case_when code.




相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...