Problem Description
I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I m doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.
In summary:
- Needs to be constant space
- Fast as possible
- Assume very large files
- Result fits in memory
Question
What are the fastest ways to read/stream data from a file using constant space?
Ideas I ve had
If the file was small enough to fit in memory, then mmap
on POSIX systems would be very fast, unfortunately that s not the case here. Is there any performance advantage to using mmap
with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap
buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread
?