English 中文(简体)
读到的csv,读数栏数大的读数文档极为缓慢。
原标题:read.csv is extremely slow in reading csv files with large numbers of columns
  • 时间:2011-09-07 01:03:41
  •  标签:
  • r
  • csv

我的档案有:例如,Csv有8 000栏×400 000字。 单体文件每栏都有一个显示器。 所有领域都包含0至10年的分类数值。 当我试图将这一档案装上读物时,它就会变得极其缓慢。 当我增加一个参数Nrow=100时,这也非常缓慢。 我想知道,是否有办法加速读到.csv,或使用某些其他功能,而不是读到.csv,将档案作为汇总表或数据载入记忆。 框架

提前感谢。

问题回答

如果您的《社会保障法》只载有分类账,则您应使用scan而不是read.csv/code>,因为?read.csv。 说:

 ‘read.table’ is not the right tool for reading large matrices,
 especially those with many columns: it is designed to read _data
 frames_ which may have columns of very different classes.  Use
 ‘scan’ instead for matrices.

由于您的档案有标题,你将需要<代码>skip=1,如果你制定<代码>what=integer(,则可能更快。 如果你必须使用<代码>read.csv,并且速度/记忆的消耗是个令人关切的问题,那么,确定<编码>colClasses<>/code”的论点是一个巨大的帮助。

Try using data.table:fread(). 迄今为止,最快的读写方式是:.csv文档进入R。 http://stackoverflow.com/questions/1727772/quickly-reading-very-lar-ge-tables-as-dataframes-in-r”>。

library(data.table)

data <- fread("c:/data.csv")

如果你想更快地做到这一点,你也只能读到你希望使用的一栏子的子集:

data <- fread("c:/data.csv", select = c("col1", "col2", "col3"))

还审判Hadley Wickhamsreadr 一揽子:

library(readr) 
data <- read_csv("file.csv")

If you ll read the file often, it might well be worth saving it from R in a binary format using the save function. Specifying compress=FALSE often results in faster load times.

...... 然后,你可以将其装上(价格!)load功能。

d <- as.data.frame(matrix(1:1e6,ncol=1000))
write.csv(d, "c:/foo.csv", row.names=FALSE)

# Load file with read.csv
system.time( a <- read.csv("c:/foo.csv") ) # 3.18 sec

# Load file using scan
system.time( b <- matrix(scan("c:/foo.csv", 0L, skip=1, sep= , ), 
                         ncol=1000, byrow=TRUE) ) # 0.55 sec

# Load (binary) file using load
save(d, file="c:/foo.bin", compress=FALSE)
system.time( load("c:/foo.bin") ) # 0.09 sec

值得尝试新的vroom

<代码>vroom是将有限和固定宽度数据读入R的新办法。

由此得出的意见是,在将文档从磁盘上读取数据时,一般不是主要的瓶颈。 相反,(重新)将记忆和价值归为R类数据(特别是特性)占用了大部分时间。

因此,你可以先执行快速指数化步骤,然后使用ALTREP(ALTernative REPresentations)框架,可在R版本3.5+中查阅数值。

This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.

#install.packages("vroom", 
#                 dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)

df <- vroom( example.csv )

rel=“nofollow noreferer”>Benchmark: vs data.table vs t/code>

></p>
    </div>
       </div>
            <footer class=





相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...