Question

I am social scientist who works with survey data a lot. Many of the variables are four point agree-disagree likert scales with response options "Strongly agree", "Somewhat agree", "Somewhat disagree", "Strongly disagree", but sometimes are six point scales. A consistent part of the data cleaning process is to convert these variables into dichotomous factors (meaning they have two response options of "Agree" and Disagree"). Here is an example below where data is the data frame, x is the original variable with all four response options, and new_x is the dichotomized variable:

pacman::p_load(tidyverse, labelled, rlang)

data %>% 
  mutate(
    new_x = case_match(
      x,
      c(1:2) ~ "Agree",
      c(3:4) ~ "Disagree"
    )
  )

问题是,我常常有30多个变数,我必须这样做。我知道,我可以使用<代码>交叉()在所有30个变数中进行同样的数据转变,但我们必须每两周重复一次,我们才能获得新的调查数据。相反,我要有一项称为“make_dicho()”的职能,我可在mutate(和a cross(>>上使用,因此我不必每一次书写整个<>code_match()。这里是成功尝试建立基本版本:

# create sample data
data <- tibble::tribble(
  ~x, ~y, ~z,
  3, 2, 3,
  4, 4, 2,
  2, 3, 1,
  1, 1, 4
)

df

# create the function where values of 1-2 are "Agree" and 3-4 are "Disagree"
make_dicho <- function(var) {
  dplyr::case_match(
    x,
    c(1:2) ~ "Agree",
    c(3:4) ~ "Disagree"
  )
}

# check to see if it worked
df %>% mutate(new_x = make_dicho(x))

# success!

这一职能行之有效,但很脆弱,因为它依靠调查设计师和调查提供者,利用4个对策选择,以非常具体的方式编码这些价值。避免这种情况的一个办法是利用含有价值标签的元数据,表明每项价值的含义。由于我的大多数数据都包含这一元数据,我谨利用这一数据自动决定哪些数值应当重新编码为“Agree”,哪些数值应当重新编码为“Disagree”。这使情况严重复杂化,因为我现在需要为数据框架增加新的论据。这是我迄今提出的:

# add value labels to the data

data <- tribble(
  ~x, ~y, ~z,
  3, 2, 3,
  4, 4, 2,
  2, 3, 1,
  1, 1, 4
) %>% 
  # add value labels
  labelled::set_value_labels(
    x = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
    y = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
    z = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4)
  )

# write the new function
make_dicho <- function(df = NULL, var) {

  ## if var is a symbol convert it to a string
  # "Returns a naked expression of the variable"
  var <- rlang::enexpr(var)
  
  if (!is.character(var)) {
    # convert to a sym() object and then use as_name to make it a string
   var <- rlang::as_name(rlang::ensym(var))
  }

  # Since this is taking advantage of labelled data, it should be of class haven_labelled 
  if (class(df[[var]])[1] == "haven_labelled") {

    ### Set up vectors based on the underlying attribute
    
    # get the named vector
    labs <- attributes(df[[var]])$labels
    
    
    # flip the names
    labs <- setNames(names(labs), labs)
    
    
    # get the agree vector by removing the strings containing "disagree" or "Disagree"
    agree_vec <- labs[!str_detect(labs, pattern = "disagree|Disagree")]
    
    # now flip the vector back and make it numeric
    # enframe() converts named atomic vectors or lists to one- or two-column data frames.
    agree_vec <- enframe(agree_vec) %>%
      # put the "value" column at the beginning of the df
      relocate(value) %>%
      # convert "name" to numeric
      mutate(name = as.numeric(name)) %>%
      # deframe() converts two-column data frames to a named vector or list
      deframe()
    
    
    # get the agree vector by keeping the strings containing "disagree" or "Disagree"
    disagree_vec <- labs[str_detect(labs, pattern = "disagree|Disagree")]
    
    # now flip the vector back and make it numeric
    # enframe() converts named atomic vectors or lists to one- or two-column data frames.
    disagree_vec <- enframe(disagree_vec) %>%
      # put the "value" column at the beginning of the df
      relocate(value) %>%
      # convert "name" to numeric
      mutate(name = as.numeric(name)) %>%
      # deframe() converts two-column data frames to a named vector or list
      deframe()
    
    
    
    ### now create the case_match function,
    # Adding in df[[var]] so that it know which vector to use
    dplyr::case_match(
      df[[var]],
      agree_vec ~ "Agree",
      disagree_vec ~ "Disagree"
    ) 
    
    }

}

# test function
data %>% mutate(new_x = make_dicho(x))

This fails and gives an error that says argument "var" is missing, with no default. However, if I add . inside make_dicho() it works. Like this:

data %>% mutate(new_x = make_dicho(., x))

我的第一个问题是,我如何更新我的职能,以便它从一开始就不再需要<代码>。第二,我是如何在<条码>交叉(<>条码/代码>上工作的? 这里使用的是(以下简称)编码:

# make all three variables dichotomous factors with "new_" prefix
df %>% mutate(
  across(
    c(x:z),
    ~make_dicho(., .x),
    .names = "new_{col}"
  )
)

我在尝试使用<代码>a cross()时,就会出现错误形象。我的猜测是,在<代码>内,有点可做,在<代码>内,<>make_dicho(<>>/code>, 以及打电话到df[[[[var]>, 载于案件_match。但是,我诚然没有想法,尽管我感到非常接近,但我知道,这一职能可以被大家忘掉。

Hopefully the request, while a bit complicated, is easy to understand. Thank you for any and all help!

Answer 1

Not touching the methodology of survey evaluation, but simply considering the intended coding, this could be one approach. You would still need to know the encoding of the individual variables and type them in the .cols with corresponding make_dicho arguments in mutate(across(...)).

library(tidyverse)

## create sample data
## let s say `up` refers to higher values indicate higher agreement, `down`
## indicates lower values indicate higher agreement. let s also introduce
## some "errors" you may be faced with (e.g., `NA`, unrealistic values).
dat <- tibble(
  sc6_up_x = c(1,4,5,2),
  sc6_down_x = c(2,5,1000,4),
  sc6_up_y = c(NA,6,1,6),
  sc4_down_x = c(3,4,2,1),
  sc4_down_y = c(2,4,3,1),
  sc4_up_x = c(3,2,1,4)
)

dat

# create function
make_dicho <- function(x_var, agr_l, agr_u, dis_l, dis_u) {
  dplyr::case_match(
    x_var,
    c(agr_l:agr_u) ~ "Agree",
    c(dis_l:dis_u) ~ "Disagree"
  )
}

## check to see if it works when using one variable
dat %>% 
  mutate(new_x = make_dicho(sc6_down_x, 1,3,4,6))

## apply function to several variables of interest and with appropriate
## arguments
dat %>%
  mutate(
    across(
      .cols = c(sc6_up_x, sc6_up_y),
      .fns = list(
        make_dicho = ~ make_dicho(.x,4,6,1,3)
      ),
      .names = "new_{col}"
    ),
    across(
      .cols = c(sc6_down_x),
      .fns = list(
        make_dicho = ~ make_dicho(.x,1,3,4,6)
      ),
      .names = "new_{col}"
    ),
    across(
      .cols = c(sc4_up_x),
      .fns = list(
        make_dicho = ~ make_dicho(.x,3,4,1,2)
      ),
      .names = "new_{col}"
    ),
    across(
      .cols = c(sc4_down_x, sc4_down_y),
      .fns = list(
        make_dicho = ~ make_dicho(.x,1,2,3,4)
      ),
      .names = "new_{col}"
    )
    )

Answer 2

你们可以在R基地轻易做到这一点。

Assuming that your data is as you ve given it, let s first convert it back to base R: factor variables and data frame data:

for (nn in names(data)) {
  data[,nn] %<>% to_factor
}
data %<>% as.data.frame
data
#>                   x                 y                 z
#> 1 Somewhat disagree    Somewhat agree Somewhat disagree
#> 2 Strongly disagree Strongly disagree    Somewhat agree
#> 3    Somewhat agree Somewhat disagree    Strongly agree
#> 4    Strongly agree    Strongly agree Strongly disagree

既然你的数据是按基准R格式编制的,那么,你们都需要这样做。

nr = nrow(data)
for (nn in names(data)) {
  # Note the space before "agree" -- without it, it would match "disagree" as well!
  levels(data[,nn])[which(levels(data[,nn]) %>% endsWith(" agree"))] = "Agree"
  levels(data[,nn])[which(levels(data[,nn]) %>% endsWith("disagree"))] = "Disagree"
  if ( !all(levels(data[,nn]) %in% c("Agree", "Disagree")) ) {
    print(paste0(nn, ": not all levels converted"))
  }
}
data
#>          x        y        z
#> 1 Disagree    Agree Disagree
#> 2 Disagree Disagree    Agree
#> 3    Agree Disagree    Agree
#> 4    Agree    Agree Disagree

友情链接