Question

我有一栏CSV。累计成本一栏、成本一栏和关键词栏。 R 笔记为小案卷,但当我拿到实际档案(有100万个牢房)时,全然死亡。你们能否帮助我提高这一说法的效率? The Token. 我正在制造麻烦。谢谢!

# Token Histogram

# Import CSV data from Report Downloader API Feed
Mydf <- read.csv("Output_test.csv.csv", sep=",", header = TRUE, stringsAsFactors=FALSE)

# Helps limit the dataframe according the HTT
# Change number to:
# .99 for big picture
# .8 for HEAD
limitor <- Mydf$CumuCost <= .8
# De-comment to ONLY measure TORSO
#limitor <- (Mydf$CumuCost <= .95 & Mydf$CumuCost > .8)
# De-comment to ONLY measure TAIL
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .95)
# De-comment to ONLY measure Non-HEAD
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .8)

# Creates a column with HTT segmentation labels
# Creates a dataframe
HTT <- data.frame()
# Populates dataframe according to conditions
HTT <- ifelse(Mydf$CumuCost <= .8,"HEAD",ifelse(Mydf$CumuCost <= .95,"TORSO","TAIL"))
# Add the column to Mydf and rename it HTT
Mydf <- transform(Mydf, HTT = HTT)

# Count all KWs in account by using the dimension function
KWportfolioSize <- dim(Mydf)[1]

# Percent of portfolio
PercentofPortfolio <- sum(limitor)/KWportfolioSize

# Length of Keyword -- TOO SLOW
# Uses the Tau package
# My function takes the row number and returns the number of tokens
library(tau)
Myfun = function(n) {
  sum(sapply(Mydf$Keyword.text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L))}
# Creates a dataframe to hold the results
Token.Count <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Token.Count <- rbind(Token.Count,Myfun(i))}
# Add the column to Mydf
Mydf <- transform(Mydf, Token.Count = Token.Count)
# Not quite sure why but the column needs renaming in this case
colnames(Mydf)[dim(Mydf)[2]] <- "Token.Count"

Answer 1

预留您的储存 >填满。 <>Never在座右边做什么,并加固或/code>>>>>。 R必须复制、分配更多的储存,以便存放在每一处,而这是破坏你的代码的间接费用。

创建<代码>Token.Count,配以足够的浏览器和栏目,填充。类似:

Token.Count <- matrix(ncol = ?, nrow = nrow(Mydf))
for (i in seq_len(nrow(Mydf))) {
    Token.Count[i, ] <- Myfun(i)
}
Token.Count <- data.frame(Token.Count)

Sorry I可能更具体,但我不知道有多少栏目。

<><>Update 1: 在观看了<代码>textcnt后,我认为你可能完全避免 lo。你们有这样的数据框架。

DF <- data.frame(CumuCost = c(0.00439, 0.0067), Cost = c(1678, 880),
                 Keyword.text = c("north+face+outlet", "kinect sensor"),
                 stringsAsFactors = FALSE)

如果我们删除关键词,将其改为清单

keywrds <- with(DF, as.list(Keyword.text))
head(keywrds)

然后,我们可打电话到<条码>文本编号<>>>>,以便计算每个清单组成部分的字句;

countKeys <- textcnt(keywrds, split = "[[:space:][:punct:]]+", method = "string",
                     n = 1L, recursive = TRUE)
head(countKeys)

以上几乎是你拥有的,但我添加了<代码>recursive = TRUE,以便分别处理每个投入矢量。最后一个步骤是sapply www.un.org/chinese/sc/presidency.asp

> sapply(countKeys, sum)
[1] 3 2

看来是你在试图通过 lo和功能实现的。我是否享有这一权利?

Update 2: OK, if having fixed the preallocation issue and used textcnt in a vectorized way still isn t quite as quick as you would like, we can investigate other ways of counting words. It could well be possible that you don t need all the functionality of textcnt to do what you want. [I can t check if the solution below will work for all your data, but it is a lot quicker.]

一种可能的解决办法是将<代码>分开。关键词:text 采用以下文字的矢量:strsplit功能,例如使用keywrd生成的上述功能,仅使用第一种要素:

> length(unlist(strsplit(keywrds[[1]], split = "[[:space:][:punct:]]+")))
[1] 3

采用这一想法或许更容易在用户功能中加以总结:

fooFun <- function(x) {
    length(unlist(strsplit(x, split = "[[:space:][:punct:]]+"),
                  use.names = FALSE, recursive = FALSE))
}

我们可以适用于<代码>keywrds清单:

> sapply(keywrds, fooFun)
[1] 3 2

就这一简单的例子数据集而言,我们得出同样的结果。什么时候? 第一是使用<代码>文本/代码>的解决办法,其中两个步骤从Update 1合并:

> system.time(replicate(10000, sapply(textcnt(keywrds, 
+                                     split = "[[:space:][:punct:]]+", 
+                                     method = "string", n = 1L, 
+                                     recursive = TRUE), sum)))
   user  system elapsed 
  4.165   0.026   4.285

之后,可在<>上<>上>找到解决办法:

> system.time(replicate(10000, sapply(keywrds, fooFun)))
   user  system elapsed 
  0.883   0.001   0.889

因此,即便对于这一小样本,在“<条码>英文/代码”上也有相当的间接费用,但这一差异在应用两种方法对待全部数据集时是否仍然有待观察。

最后,我们应当注意到<代码>strsplit。可在<条码>中直接使用<条码>。

> sapply(strsplit(DF$Keyword.text, split = "[[:space:][:punct:]]+"), length)
[1] 3 2

这与另外两种方法一样,其结果略高于对<代码>>>>>>>>>的未定用途:

> system.time(replicate(10000, sapply(strsplit(DF$Keyword.text, 
+                              split = "[[:space:][:punct:]]+"), length)))
   user  system elapsed 
  0.732   0.001   0.734

以上数据集中的任何数据都比较快?

Minor Update: replicating DF to give 130 rows of data and timing the three approaches suggests that the last (vectorized strsplit()) scales better:

> DF2 <- rbind(DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF)
> dim(DF2)
[1] 130   3
> system.time(replicate(10000, sapply(textcnt(keywrds2, split = "[[:space:][:punct:]]+", method = "string", n = 1L, recursive = TRUE), sum)))
   user  system elapsed 
238.266   1.790 241.404
> system.time(replicate(10000, sapply(keywrds2, fooFun)))
   user  system elapsed 
 28.405   0.007  28.511
> system.time(replicate(10000, sapply(strsplit(DF2$Keyword.text,split = "[[:space:][:punct:]]+"), length)))
   user  system elapsed 
  7.497   0.011   7.528

友情链接