English 中文(简体)
数据框中多个字符串匹配的 rrep (快速)
原标题:R grep for multiple string matches in a dataframe (quickly)
  • 时间:2024-05-21 23:24:29
  •  标签:
  • r
  • dataframe

我有一个目前大约4200行的数据框, 它来自数千 pdf 文件的数据。 数据框中的列之一列出了该数据来自的 pdf 的文档路径。 一些数据点被复制, 原因是数据框中已渗入的数据更早的迭代。 我有一个函数可以找到多个迭代时案例的最新迭代, 并且为包含每个条目最新迭代的 pdf 文件生成一个文件路径列表。 我需要过滤我的原始数据, 以便只包含这些最新的文件路径 。 目前, 我用这个函数来做 :

data<-data[grep(paste0(latest_version, collapse="|"), data$path),]

数据是我的原始数据框, 而最近的版本是正确的文件路径列表。 这工作很有效, 但速度非常慢 。 由于我需要经常更新我的数据, 我需要一种快速( 秒而不是分钟) 的方法来做到这一点。 我试图在我的 grep 函数调用中使用固定=T, 但有些原因没有返回任何结果 。 任何想法, 想要以更有效的方式处理这个问题吗?

编辑 :

可通过生成下列假数据模拟这一问题:

test_df<-data.frame(a=c(1:50000), b=c(1:50000), path = do.call(paste0, Map(stri_rand_strings, n=50000, length=c(100, 50, 7),
                                                                        pattern = c( [A-Z] ,  [0-9] ,  [A-Z] ))))
    test_latest_ver<-test_df$path[sample(50000, 2000)]
    
    final_data<-test_df[grep(paste0(test_latest_ver, collapse="|"), test_df$path),]
问题回答

使用 dplyr 软件包中的连结功能应提供改进。

#use code from question here to define test_df and test_latest_ver

library(dplyr)
#create a new df with the new paths 
answer <- right_join(test_df, data.frame(latest=test_latest_ver), by=join_by(path==latest), keep=TRUE)

#remove the old paths
answer <-answer %>% select(-path)
#rename the columns
names(answer)[names(answer)=="latest"] <- "path"


#perform the comparison
row.names(answer) <- NULL
row.names(final_data) <- NULL
identical(answer, final_data)

#identical(answer, final_data)
#[1] TRUE

在我的机器上,这应该是~700x速度的改进。





相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...

热门标签