English 中文(简体)
2. 标注字的最好和最有效的方法
原标题:Best and most efficient way to count tokens words
  • 时间:2010-12-10 20:58:12
  •  标签:
  • r

我有一栏CSV。 累计成本一栏、成本一栏和关键词栏。 R 笔记为小案卷,但当我拿到实际档案(有100万个牢房)时,全然死亡。 你们能否帮助我提高这一说法的效率? The Token. 我正在制造麻烦。 谢谢!

# Token Histogram

# Import CSV data from Report Downloader API Feed
Mydf <- read.csv("Output_test.csv.csv", sep=",", header = TRUE, stringsAsFactors=FALSE)

# Helps limit the dataframe according the HTT
# Change number to:
# .99 for big picture
# .8 for HEAD
limitor <- Mydf$CumuCost <= .8
# De-comment to ONLY measure TORSO
#limitor <- (Mydf$CumuCost <= .95 & Mydf$CumuCost > .8)
# De-comment to ONLY measure TAIL
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .95)
# De-comment to ONLY measure Non-HEAD
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .8)

# Creates a column with HTT segmentation labels
# Creates a dataframe
HTT <- data.frame()
# Populates dataframe according to conditions
HTT <- ifelse(Mydf$CumuCost <= .8,"HEAD",ifelse(Mydf$CumuCost <= .95,"TORSO","TAIL"))
# Add the column to Mydf and rename it HTT
Mydf <- transform(Mydf, HTT = HTT)

# Count all KWs in account by using the dimension function
KWportfolioSize <- dim(Mydf)[1]

# Percent of portfolio
PercentofPortfolio <- sum(limitor)/KWportfolioSize

# Length of Keyword -- TOO SLOW
# Uses the Tau package
# My function takes the row number and returns the number of tokens
library(tau)
Myfun = function(n) {
  sum(sapply(Mydf$Keyword.text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L))}
# Creates a dataframe to hold the results
Token.Count <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Token.Count <- rbind(Token.Count,Myfun(i))}
# Add the column to Mydf
Mydf <- transform(Mydf, Token.Count = Token.Count)
# Not quite sure why but the column needs renaming in this case
colnames(Mydf)[dim(Mydf)[2]] <- "Token.Count"
最佳回答

预留您的储存 >填满。 <>Never在座右边做什么,并加固或/code>>>>>。 R必须复制、分配更多的储存,以便存放在每一处,而这是破坏你的代码的间接费用。

创建<代码>Token.Count,配以足够的浏览器和栏目,填充。 类似:

Token.Count <- matrix(ncol = ?, nrow = nrow(Mydf))
for (i in seq_len(nrow(Mydf))) {
    Token.Count[i, ] <- Myfun(i)
}
Token.Count <- data.frame(Token.Count)

Sorry I可能更具体,但我不知道有多少栏目


<><>Update 1: 在观看了<代码>textcnt后,我认为你可能完全避免 lo。 你们有这样的数据框架。

DF <- data.frame(CumuCost = c(0.00439, 0.0067), Cost = c(1678, 880),
                 Keyword.text = c("north+face+outlet", "kinect sensor"),
                 stringsAsFactors = FALSE)

如果我们删除关键词,将其改为清单

keywrds <- with(DF, as.list(Keyword.text))
head(keywrds)

然后,我们可打电话到<条码>文本编号<>>>>,以便计算每个清单组成部分的字句;

countKeys <- textcnt(keywrds, split = "[[:space:][:punct:]]+", method = "string",
                     n = 1L, recursive = TRUE)
head(countKeys)

以上几乎是你拥有的,但我添加了<代码>recursive = TRUE,以便分别处理每个投入矢量。 最后一个步骤是sapply www.un.org/chinese/sc/presidency.asp

> sapply(countKeys, sum)
[1] 3 2

看来是你在试图通过 lo和功能实现的。 我是否享有这一权利?


Update 2: OK, if having fixed the preallocation issue and used textcnt in a vectorized way still isn t quite as quick as you would like, we can investigate other ways of counting words. It could well be possible that you don t need all the functionality of textcnt to do what you want. [I can t check if the solution below will work for all your data, but it is a lot quicker.]

一种可能的解决办法是将<代码>分开。 关键词:text 采用以下文字的矢量:strsplit功能,例如使用keywrd生成的上述功能,仅使用第一种要素:

> length(unlist(strsplit(keywrds[[1]], split = "[[:space:][:punct:]]+")))
[1] 3

采用这一想法或许更容易在用户功能中加以总结:

fooFun <- function(x) {
    length(unlist(strsplit(x, split = "[[:space:][:punct:]]+"),
                  use.names = FALSE, recursive = FALSE))
}

我们可以适用于<代码>keywrds清单:

> sapply(keywrds, fooFun)
[1] 3 2

就这一简单的例子数据集而言,我们得出同样的结果。 什么时候? 第一是使用<代码>文本/代码>的解决办法,其中两个步骤从Update 1合并:

> system.time(replicate(10000, sapply(textcnt(keywrds, 
+                                     split = "[[:space:][:punct:]]+", 
+                                     method = "string", n = 1L, 
+                                     recursive = TRUE), sum)))
   user  system elapsed 
  4.165   0.026   4.285

之后,可在<>上<>上>找到解决办法:

> system.time(replicate(10000, sapply(keywrds, fooFun)))
   user  system elapsed 
  0.883   0.001   0.889

因此,即便对于这一小样本,在“<条码>英文/代码”上也有相当的间接费用,但这一差异在应用两种方法对待全部数据集时是否仍然有待观察。

最后,我们应当注意到<代码>strsplit。 可在<条码>中直接使用<条码>。

> sapply(strsplit(DF$Keyword.text, split = "[[:space:][:punct:]]+"), length)
[1] 3 2

这与另外两种方法一样,其结果略高于对<代码>>>>>>>>>的未定用途:

> system.time(replicate(10000, sapply(strsplit(DF$Keyword.text, 
+                              split = "[[:space:][:punct:]]+"), length)))
   user  system elapsed 
  0.732   0.001   0.734

以上数据集中的任何数据都比较快?

Minor Update: replicating DF to give 130 rows of data and timing the three approaches suggests that the last (vectorized strsplit()) scales better:

> DF2 <- rbind(DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF)
> dim(DF2)
[1] 130   3
> system.time(replicate(10000, sapply(textcnt(keywrds2, split = "[[:space:][:punct:]]+", method = "string", n = 1L, recursive = TRUE), sum)))
   user  system elapsed 
238.266   1.790 241.404
> system.time(replicate(10000, sapply(keywrds2, fooFun)))
   user  system elapsed 
 28.405   0.007  28.511
> system.time(replicate(10000, sapply(strsplit(DF2$Keyword.text,split = "[[:space:][:punct:]]+"), length)))
   user  system elapsed 
  7.497   0.011   7.528
问题回答

暂无回答




相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...

热门标签