English 中文(简体)
样本中的错误(m, k) :不能采样比人口大得多。
原标题:Error in sample.int(m, k) : cannot take a sample larger than the population

首先,我要说,我对机器学习、大地学习和驾车来说是一个相当新的现象,这个项目是更多地了解这一情况的一种手段,也是向我们的CIO提供这一数据的手段,以便我能够利用这一数据开发新的服务台系统。

我有60K行文。 档案中载有教师在三年内开出的求助桌票的标题。

I would like create a r program that takes these titles and creates a set of categories. For instance, terms related to printing issues, or a group of terms related to projector bulbs. I have used r to open the text document, clean up the data, remove stop words and other words that I felt were not necessary. I ve gotten a list of all the terms with a frequency >= 400 and saved those to a text file.

但现在,我想申请(如果可以或适当的话)把大地块归为同一数据组,看看我是否能够提出类别。

The code below includes code that will write out the list of terms used >= 400. It is at the end, and is commented out.

library(tm) #load text mining library
library(SnowballC)
options(max.print=5.5E5) 
setwd( c:/temp/ ) #sets R s working directory to near where my files are
ae.corpus<-Corpus(DirSource("c:/temp/"),readerControl=list(reader=readPlain))
summary(ae.corpus) #check what went in
ae.corpus <- tm_map(ae.corpus, tolower)
ae.corpus <- tm_map(ae.corpus, removePunctuation)
ae.corpus <- tm_map(ae.corpus, removeNumbers)
ae.corpus <- tm_map(ae.corpus, stemDocument, language = "english")  
myStopwords <- c(stopwords( english ), <a very long list of other words>)
ae.corpus <- tm_map(ae.corpus, removeWords, myStopwords) 

ae.corpus <- tm_map(ae.corpus, PlainTextDocument)

ae.tdm <- DocumentTermMatrix(ae.corpus, control = list(minWordLength = 5))


dtm.weight <- weightTfIdf(ae.tdm)

m <- as.matrix(dtm.weight)
rownames(m) <- 1:nrow(m)

#euclidian 
norm_eucl <- function(m) {
  m/apply(m,1,function(x) sum(x^2)^.5)
}
m_norm <- norm_eucl(m)

results <- kmeans(m_norm,25)

#list clusters

clusters <- 1:25
for (i in clusters){
  cat("Cluster ",i,":",findFreqTerms(dtm.weight[results$cluster==i],400,"

"))
}


#inspect(ae.tdm)
#fft <- findFreqTerms(ae.tdm, lowfreq=400)

#write(fft, file = "dataTitles.txt",
#      ncolumns = 1,
#      append = FALSE, sep = " ")

#str(fft)

#inspect(fft)

当我利用《国际理论》来管理时,我发现:

> results <- kmeans(m_norm,25)

Error in sample.int(m, k) : cannot take a sample larger than the population when replace = FALSE

我并不真正确定这意味着什么,而且我没有发现很多网上信息。 任何想法?

TIA

问题回答

你们正在阅读单一档案,有多个线,而不是目录中的多个档案。 而不是

ae.corpus<-Corpus(DirSource("c:/temp/"),readerControl=list(reader=readPlain))` 

您需要使用

text <- readLines("c:\temp\your_file_name", n = -1)
ae.corpus<-Corpus(VectorSource(text),readerControl=list(reader=readPlain)) 

然后,你将拿到60K档案,而不是1个60克线的档案。

我也遇到了同样的问题,最后我发现,组群的目标数目比某些类型的数据增长要大。 由于以你的方式提供的各类数据可能低于各组群的目标数目。





相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...

热门标签