English 中文(简体)
What are the most efficient idioms for streaming data from disk with constant space usage?
原标题:

Problem Description

I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I m doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.

In summary:

  • Needs to be constant space
  • Fast as possible
  • Assume very large files
  • Result fits in memory

Question

What are the fastest ways to read/stream data from a file using constant space?

Ideas I ve had

If the file was small enough to fit in memory, then mmap on POSIX systems would be very fast, unfortunately that s not the case here. Is there any performance advantage to using mmap with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread?

问题回答

I wouldn t be so sure that mmap would be very fast (where very fast is defined as significantly faster than fread).

Grep used to use mmap, but switched back to fread. One of the reasons was stability (strange things happen with mmap if the file shrinks whilst it is mapped or an IO error occurs). This page discusses some of the history about that.

You can compare the performance on your system with the option --mmap to grep. On my system the difference in performance on a 200GB file is negligible, but your mileage might vary!

In short, I d use fread with a fixed size buffer. It s simpler to code, easier to handle errors and will almost certainly be fast enough.

Depending on the language you are using, a C-like fread() loop based on a file for which you declared a particular buffer size will require exactly this buffer size, no more no less.

We typically choose a buffer size of 4 to 128 kBytes, there is little gain if any with bigger buffers.

If performance was extremely important, for relatively little gain (and at the risk of re-inventing something), one could consider using a two-thread implementation, whereby one thread reads the file in a set of two buffers, and the other thread perform calculations sequential fashion in one of the buffers at a time. In this fashion the disk access delays can be removed.

mjv is right. You can use double-buffers and overlapped I/O. That way your crunching and the disk reading can be happening at the same time. Then I would profile or stack-shot the crunching to make it as fast as possible. With luck it will be faster than the I/O, so you will end up running the I/O at top speed without pause. Then things like file fragmentation come into the picture.





相关问题
编辑大案

在我可以预见到需要编辑的大量档案(主要是固定文本档案,但可以是CSV、固定-width、XML,......迄今为止)。 我需要发展......。

How can I quickly parse large (>10GB) files?

I have to process text files 10-20GB in size of the format: field1 field2 field3 field4 field5 I would like to parse the data from each line of field2 into one of several files; the file this gets ...

gcc/g++: error when compiling large file

I have a auto-generated C++ source file, around 40 MB in size. It largely consists of push_back commands for some vectors and string constants that shall be pushed. When I try to compile this file, g+...

Is git worth for managing many files bigger than 500MB

I would put under version control a big amount of data, i.e. a directory structure (with depth<=5) with hundreds files with size about 500Mb). The things I need is a system that help me: - to ...

热门标签