首先,我要说,我对机器学习、大地学习和驾车来说是一个相当新的现象,这个项目是更多地了解这一情况的一种手段,也是向我们的CIO提供这一数据的手段,以便我能够利用这一数据开发新的服务台系统。
我有60K行文。 档案中载有教师在三年内开出的求助桌票的标题。
I would like create a r program that takes these titles and creates a set of categories. For instance, terms related to printing issues, or a group of terms related to projector bulbs. I have used r to open the text document, clean up the data, remove stop words and other words that I felt were not necessary. I ve gotten a list of all the terms with a frequency >= 400 and saved those to a text file.
但现在,我想申请(如果可以或适当的话)把大地块归为同一数据组,看看我是否能够提出类别。
The code below includes code that will write out the list of terms used >= 400. It is at the end, and is commented out.
library(tm) #load text mining library
library(SnowballC)
options(max.print=5.5E5)
setwd( c:/temp/ ) #sets R s working directory to near where my files are
ae.corpus<-Corpus(DirSource("c:/temp/"),readerControl=list(reader=readPlain))
summary(ae.corpus) #check what went in
ae.corpus <- tm_map(ae.corpus, tolower)
ae.corpus <- tm_map(ae.corpus, removePunctuation)
ae.corpus <- tm_map(ae.corpus, removeNumbers)
ae.corpus <- tm_map(ae.corpus, stemDocument, language = "english")
myStopwords <- c(stopwords( english ), <a very long list of other words>)
ae.corpus <- tm_map(ae.corpus, removeWords, myStopwords)
ae.corpus <- tm_map(ae.corpus, PlainTextDocument)
ae.tdm <- DocumentTermMatrix(ae.corpus, control = list(minWordLength = 5))
dtm.weight <- weightTfIdf(ae.tdm)
m <- as.matrix(dtm.weight)
rownames(m) <- 1:nrow(m)
#euclidian
norm_eucl <- function(m) {
m/apply(m,1,function(x) sum(x^2)^.5)
}
m_norm <- norm_eucl(m)
results <- kmeans(m_norm,25)
#list clusters
clusters <- 1:25
for (i in clusters){
cat("Cluster ",i,":",findFreqTerms(dtm.weight[results$cluster==i],400,"
"))
}
#inspect(ae.tdm)
#fft <- findFreqTerms(ae.tdm, lowfreq=400)
#write(fft, file = "dataTitles.txt",
# ncolumns = 1,
# append = FALSE, sep = " ")
#str(fft)
#inspect(fft)
当我利用《国际理论》来管理时,我发现:
> results <- kmeans(m_norm,25)
Error in sample.int(m, k) : cannot take a sample larger than the population when replace = FALSE
我并不真正确定这意味着什么,而且我没有发现很多网上信息。 任何想法?
TIA