Question

我的档案有:例如,Csv有8 000栏×400 000字。单体文件每栏都有一个显示器。所有领域都包含0至10年的分类数值。当我试图将这一档案装上读物时,它就会变得极其缓慢。当我增加一个参数Nrow=100时,这也非常缓慢。我想知道,是否有办法加速读到.csv,或使用某些其他功能,而不是读到.csv,将档案作为汇总表或数据载入记忆。框架

提前感谢。

Answer 1

如果您的《社会保障法》只载有分类账,则您应使用scan而不是read.csv/code>,因为?read.csv。说:



 ‘read.table’ is not the right tool for reading large matrices,
 especially those with many columns: it is designed to read _data
 frames_ which may have columns of very different classes.  Use
 ‘scan’ instead for matrices.

由于您的档案有标题,你将需要<代码>skip=1,如果你制定<代码>what=integer(,则可能更快。如果你必须使用<代码>read.csv,并且速度/记忆的消耗是个令人关切的问题,那么,确定<编码>colClasses<>/code”的论点是一个巨大的帮助。

Answer 2

Try using data.table:fread(). 迄今为止,最快的读写方式是:.csv文档进入R。 http://stackoverflow.com/questions/1727772/quickly-reading-very-lar-ge-tables-as-dataframes-in-r”>。

library(data.table)

data <- fread("c:/data.csv")

如果你想更快地做到这一点,你也只能读到你希望使用的一栏子的子集:

data <- fread("c:/data.csv", select = c("col1", "col2", "col3"))

Answer 3

还审判Hadley Wickhamsreadr 一揽子:

library(readr) 
data <- read_csv("file.csv")

Answer 4

If you ll read the file often, it might well be worth saving it from R in a binary format using the save function. Specifying compress=FALSE often results in faster load times.

...... 然后,你可以将其装上(价格!)load功能。

d <- as.data.frame(matrix(1:1e6,ncol=1000))
write.csv(d, "c:/foo.csv", row.names=FALSE)

# Load file with read.csv
system.time( a <- read.csv("c:/foo.csv") ) # 3.18 sec

# Load file using scan
system.time( b <- matrix(scan("c:/foo.csv", 0L, skip=1, sep= , ), 
                         ncol=1000, byrow=TRUE) ) # 0.55 sec

# Load (binary) file using load
save(d, file="c:/foo.bin", compress=FALSE)
system.time( load("c:/foo.bin") ) # 0.09 sec

Answer 5

值得尝试新的vroom

<代码>vroom是将有限和固定宽度数据读入R的新办法。

由此得出的意见是,在将文档从磁盘上读取数据时,一般不是主要的瓶颈。相反,(重新)将记忆和价值归为R类数据(特别是特性)占用了大部分时间。

因此,你可以先执行快速指数化步骤,然后使用ALTREP(ALTernative REPresentations)框架,可在R版本3.5+中查阅数值。

This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.

#install.packages("vroom", 
#                 dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)

df <- vroom( example.csv )

rel=“nofollow noreferer”>Benchmark: vs data.table vs t/code>

></p>
</div>
</div>
<footer class= 时间：2019-12-19 15:41:09

上一篇：行动稿+谷歌地图

下一篇：无法连接到服务器数据库系统的启动包的无效长度已被关闭, 收到快速关闭请求( pgadmin postgres docker)

友情链接