Question

首先,我要说,我对机器学习、大地学习和驾车来说是一个相当新的现象,这个项目是更多地了解这一情况的一种手段,也是向我们的CIO提供这一数据的手段,以便我能够利用这一数据开发新的服务台系统。

我有60K行文。档案中载有教师在三年内开出的求助桌票的标题。

I would like create a r program that takes these titles and creates a set of categories. For instance, terms related to printing issues, or a group of terms related to projector bulbs. I have used r to open the text document, clean up the data, remove stop words and other words that I felt were not necessary. I ve gotten a list of all the terms with a frequency >= 400 and saved those to a text file.

但现在,我想申请(如果可以或适当的话)把大地块归为同一数据组,看看我是否能够提出类别。

The code below includes code that will write out the list of terms used >= 400. It is at the end, and is commented out.

library(tm) #load text mining library
library(SnowballC)
options(max.print=5.5E5) 
setwd( c:/temp/ ) #sets R s working directory to near where my files are
ae.corpus<-Corpus(DirSource("c:/temp/"),readerControl=list(reader=readPlain))
summary(ae.corpus) #check what went in
ae.corpus <- tm_map(ae.corpus, tolower)
ae.corpus <- tm_map(ae.corpus, removePunctuation)
ae.corpus <- tm_map(ae.corpus, removeNumbers)
ae.corpus <- tm_map(ae.corpus, stemDocument, language = "english")  
myStopwords <- c(stopwords( english ), <a very long list of other words>)
ae.corpus <- tm_map(ae.corpus, removeWords, myStopwords) 

ae.corpus <- tm_map(ae.corpus, PlainTextDocument)

ae.tdm <- DocumentTermMatrix(ae.corpus, control = list(minWordLength = 5))


dtm.weight <- weightTfIdf(ae.tdm)

m <- as.matrix(dtm.weight)
rownames(m) <- 1:nrow(m)

#euclidian 
norm_eucl <- function(m) {
  m/apply(m,1,function(x) sum(x^2)^.5)
}
m_norm <- norm_eucl(m)

results <- kmeans(m_norm,25)

#list clusters

clusters <- 1:25
for (i in clusters){
  cat("Cluster ",i,":",findFreqTerms(dtm.weight[results$cluster==i],400,"

"))
}


#inspect(ae.tdm)
#fft <- findFreqTerms(ae.tdm, lowfreq=400)

#write(fft, file = "dataTitles.txt",
#      ncolumns = 1,
#      append = FALSE, sep = " ")

#str(fft)

#inspect(fft)

当我利用《国际理论》来管理时,我发现:

> results <- kmeans(m_norm,25)

Error in sample.int(m, k) : cannot take a sample larger than the population when replace = FALSE

我并不真正确定这意味着什么,而且我没有发现很多网上信息。任何想法?

TIA

Answer 1

你们正在阅读单一档案,有多个线,而不是目录中的多个档案。而不是

ae.corpus<-Corpus(DirSource("c:/temp/"),readerControl=list(reader=readPlain))`

您需要使用

text <- readLines("c:\temp\your_file_name", n = -1)
ae.corpus<-Corpus(VectorSource(text),readerControl=list(reader=readPlain))

然后,你将拿到60K档案,而不是1个60克线的档案。

Answer 2

我也遇到了同样的问题,最后我发现,组群的目标数目比某些类型的数据增长要大。由于以你的方式提供的各类数据可能低于各组群的目标数目。

友情链接