English 中文(简体)
将“豁免”性质项目从R卷中删除?
原标题:Removing an "empty" character item from a corpus of documents in R?

我正在使用<代码>tm和lda在R中编订一套新闻文章范本。 然而,我正在遇到一个“非果园”问题,作为<条码>“<>>”,即阐述我的题目。 我的工作流程如下:

text <- Corpus(VectorSource(d$text))
newtext <- lapply(text, tolower)
sw <- c(stopwords("english"), "ahram", "online", "egypt", "egypts", "egyptian")
newtext <- lapply(newtext, function(x) removePunctuation(x))
newtext <- lapply(newtext, function(x) removeWords(x, sw))
newtext <- lapply(newtext, function(x) removeNumbers(x))
newtext <- lapply(newtext, function(x) stripWhitespace(x))
d$processed <- unlist(newtext)
corpus <- lexicalize(d$processed)
k <- 40
result <-lda.collapsed.gibbs.sampler(corpus$documents, k, corpus$vocab, 500, .02, .05,
compute.log.likelihood = TRUE, trace = 2L)

不幸的是,当我训练Lda模型时,除了最经常出现的字眼外,一切都看着“”。 我试图通过将其从下文所述原封节中删除来加以纠正,并重新更新上述模式:

newtext <- lapply(newtext, function(x) removeWords(x, ""))

但是,情况仍然如此。

str_split(newtext[[1]], " ")

[[1]]
 [1] ""              "body"          "mohamed"       "hassan"       
 [5] "cook"          "found"         "turkish"       "search"       
 [9] "rescue"        "teams"         "rescued"       "hospital"     
[13] "rescue"        "teams"         "continued"     "search"       
[17] "missing"       "body"          "cook"          "crew"         
[21] "wereegyptians" "sudanese"      "syrians"       "hassan"       
[25] "cook"          "cargo"         "ship"          "sea"          
[29] "bright"        "crashed"       "thursday"      "port"         
[33] "antalya"       "southern"      "turkey"        "vessel"       
[37] "collided"      "rocks"         "port"          "thursday"     
[41] "night"         "result"        "heavy"         "winds"        
[45] "waves"         "crew"          ""             

就如何消除这一问题提出任何建议? 在我的中词清单中添加<条码>也无帮助。

最佳回答

我讨论的是案文,但并非一帆风顺,因此,这是摆脱“你”的两种办法。 额外“”性质很可能是因为判决之间有双重空间。 在你把案文变成一个言辞之前或之后,你可以处理这一状况。 你们可以把所有“x2”改为“x1”,然后才能这样做(你必须在插图之后除名)。

x <- "I like to ride my bicycle.  Do you like to ride too?"

#TREAT BEFORE(OPTION):
a <- gsub(" +", " ", x)
strsplit(a,  " ")

#TREAT AFTER OPTION:
y <- unlist(strsplit(x, " "))
y[!y%in%""]

也可以尝试:

newtext <- lapply(newtext, function(x) gsub(" +", " ", x))

Again I don t use tm so this may not be of help but this post hadn t seen any action so I figured I d share possibilities.

问题回答

如果你已经建立了这套书状,将文件长度作为过滤器,将其附在meta()上,然后制作新的文件。

dtm <- DocumentTermMatrix(corpus)

## terms per document
doc.length = rowSums(as.matrix(dtm))

## add length as description term
meta(corpus.clean.noTL,tag="Length") <- doc.length

## create new corpus
corpus.noEmptyDocs <- tm_filter(corpus, FUN = sFilter, "Length > 0")

## remove Length as meta tag
meta(corpus.clean.noTL,tag="Length") <- NULL

采用上述方法,你可以计算有效地劫持tm现有的矩阵操纵支持,但只有5行代码。





相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...

热门标签