I am social scientist who works with survey data a lot. Many of the variables are four point agree-disagree likert scales with response options "Strongly agree", "Somewhat agree", "Somewhat disagree", "Strongly disagree", but sometimes are six point scales. A consistent part of the data cleaning process is to convert these variables into dichotomous factors (meaning they have two response options of "Agree" and Disagree"). Here is an example below where data
is the data frame, x
is the original variable with all four response options, and new_x
is the dichotomized variable:
pacman::p_load(tidyverse, labelled, rlang)
data %>%
mutate(
new_x = case_match(
x,
c(1:2) ~ "Agree",
c(3:4) ~ "Disagree"
)
)
问题是,我常常有30多个变数,我必须这样做。 我知道,我可以使用<代码>交叉()在所有30个变数中进行同样的数据转变,但我们必须每两周重复一次,我们才能获得新的调查数据。 相反,我要有一项称为“make_dicho(
)”的职能,我可在mutate(
和a cross(>
>上使用,因此我不必每一次书写整个<>code_match()。 这里是成功尝试建立基本版本:
# create sample data
data <- tibble::tribble(
~x, ~y, ~z,
3, 2, 3,
4, 4, 2,
2, 3, 1,
1, 1, 4
)
df
# create the function where values of 1-2 are "Agree" and 3-4 are "Disagree"
make_dicho <- function(var) {
dplyr::case_match(
x,
c(1:2) ~ "Agree",
c(3:4) ~ "Disagree"
)
}
# check to see if it worked
df %>% mutate(new_x = make_dicho(x))
# success!
这一职能行之有效,但很脆弱,因为它依靠调查设计师和调查提供者,利用4个对策选择,以非常具体的方式编码这些价值。 避免这种情况的一个办法是利用含有价值标签的元数据,表明每项价值的含义。 由于我的大多数数据都包含这一元数据,我谨利用这一数据自动决定哪些数值应当重新编码为“Agree”,哪些数值应当重新编码为“Disagree”。 这使情况严重复杂化,因为我现在需要为数据框架增加新的论据。 这是我迄今提出的:
# add value labels to the data
data <- tribble(
~x, ~y, ~z,
3, 2, 3,
4, 4, 2,
2, 3, 1,
1, 1, 4
) %>%
# add value labels
labelled::set_value_labels(
x = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
y = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
z = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4)
)
# write the new function
make_dicho <- function(df = NULL, var) {
## if var is a symbol convert it to a string
# "Returns a naked expression of the variable"
var <- rlang::enexpr(var)
if (!is.character(var)) {
# convert to a sym() object and then use as_name to make it a string
var <- rlang::as_name(rlang::ensym(var))
}
# Since this is taking advantage of labelled data, it should be of class haven_labelled
if (class(df[[var]])[1] == "haven_labelled") {
### Set up vectors based on the underlying attribute
# get the named vector
labs <- attributes(df[[var]])$labels
# flip the names
labs <- setNames(names(labs), labs)
# get the agree vector by removing the strings containing "disagree" or "Disagree"
agree_vec <- labs[!str_detect(labs, pattern = "disagree|Disagree")]
# now flip the vector back and make it numeric
# enframe() converts named atomic vectors or lists to one- or two-column data frames.
agree_vec <- enframe(agree_vec) %>%
# put the "value" column at the beginning of the df
relocate(value) %>%
# convert "name" to numeric
mutate(name = as.numeric(name)) %>%
# deframe() converts two-column data frames to a named vector or list
deframe()
# get the agree vector by keeping the strings containing "disagree" or "Disagree"
disagree_vec <- labs[str_detect(labs, pattern = "disagree|Disagree")]
# now flip the vector back and make it numeric
# enframe() converts named atomic vectors or lists to one- or two-column data frames.
disagree_vec <- enframe(disagree_vec) %>%
# put the "value" column at the beginning of the df
relocate(value) %>%
# convert "name" to numeric
mutate(name = as.numeric(name)) %>%
# deframe() converts two-column data frames to a named vector or list
deframe()
### now create the case_match function,
# Adding in df[[var]] so that it know which vector to use
dplyr::case_match(
df[[var]],
agree_vec ~ "Agree",
disagree_vec ~ "Disagree"
)
}
}
# test function
data %>% mutate(new_x = make_dicho(x))
This fails and gives an error that says argument "var" is missing, with no default
. However, if I add .
inside make_dicho()
it works. Like this:
data %>% mutate(new_x = make_dicho(., x))
我的第一个问题是,我如何更新我的职能,以便它从一开始就不再需要<代码>。 第二,我是如何在<条码>交叉(<>条码/代码>上工作的? 这里使用的是(以下简称)编码:
# make all three variables dichotomous factors with "new_" prefix
df %>% mutate(
across(
c(x:z),
~make_dicho(., .x),
.names = "new_{col}"
)
)
我在尝试使用<代码>a cross()时,就会出现错误形象。 我的猜测是,在<代码>内,有点可做,在<代码>内,<>make_dicho(<>>/code>, 以及打电话到df[[[[var]
>, 载于案件_match。 但是,我诚然没有想法,尽管我感到非常接近,但我知道,这一职能可以被大家忘掉。
Hopefully the request, while a bit complicated, is easy to understand. Thank you for any and all help!