Question

I have a large-ish file (4-5 GB compressed) of small messages that I wish to parse into approximately 6,000 files by message type. Messages are small; anywhere from 5 to 50 bytes depending on the type.

Each message starts with a fixed-size type field (a 6-byte key). If I read a message of type 000001 , I want to ~~write~~ append its payload to 000001.dat, etc. The input file contains a mixture of messages; I want N homogeneous output files, where each output file contains only the messages of a given type.

What s ~~an efficient~~ a fast way of writing these messages to so many individual files? I d like to use as much memory and processing power to get it done as fast as possible. I can write compressed or uncompressed files to the disk.

I m thinking of using a hashmap with a message type key and an outputstream value, but I m sure there s a better way to do it.

Thanks!

Answer 1

A Unix-like system will typically have a limit on the number of file handles open at any given time; on my Linux, for example, it s currently at 1024, though I could change it within reason. But there are good reasons for these limits, as open files are a burden to the system.

You haven t yet responded to my question on whether there are multiple occurrences of the same key in your input, meaning that several separate batches of data may need to be concatenated into each file. If this isn t the case, Pace s answer would be handily the best you can do, as all that work needs to be done and there s no sense in setting up a huge administration around such a simple sequence of events.

But if there are multiple messages in your input for the same key, it would be efficient to keep a large number of files open. I d advise against trying to keep all 6000 open at once, though. Instead, I d go for something like 500, opened on a first-come-first-served basis; i.e. you open up files for the first 500 (or so) distinct message keys and then chew through your entire input file looking for stuff to add into those 500, then close them all upon hitting EOF on input. You will also need to keep a HashSet of keys already processed, because you then proceed to re-read your input file again, processing the next batch of 500 keys you didn t catch on the first round.

Rationale: Opening and closing a file is (usually) a costly operation; you do NOT want to open and close thousands of files more than once each if you can help it. So you keep as many handles open as possible, all of which end up filled on a single pass through your input. On the other hand, streaming sequentially through a single input file is quite efficient, and even if you have to make 12 passes through your input file, the time to do so will be almost negligible compared to the time needed to open/close 6000 other files.

Pseudocode:

processedSet = [ ]
keysWaiting = true
MAXFILE = 500
handlesMap = [ ]
while (keysWaiting) {
  keysWaiting = false
  open/rewind input file
  while (not EOF(input file)) {
    read message
    if (handlesMap.containsKey(messageKey)) {
       write data to handlesMap.get(messageKey)
    } else if (processedSet.contains(messageKey) {
       continue // already processed
    } else if (handlesMap.size < MAXFILE) {
       handlesMap.put(messageKey, new FileOutputStream(messageKey + ".dat")
       processedSet.add(messageKey)
       write data to handlesMap.get(messageKey)
    else
       keysWaiting = true
    endif
  }
  for all handlesMap.values() {
     close file handle
  }
  handlesMap.clear
}

Answer 2

You might not need a hash map. You could just...

Read a message
Open the new file in append mode
Write the message to the new file
Close the new file

Not sure if this would be faster though because you d be doing a lot of opens and closes.

Answer 3

I d recommend some kind of intelligent pooling: keep the largest/most frequently used files open to improve performance and close the rest to conserve resources.

If the main file is made up mostly of record types 1-5, keep those files open as long as they re needed. The others can be opened and closed as required so that you don t starve the system of resources.

Answer 4

I m going to make some assumptions about your question:

Each message starts with the message type, as a fixed-size field
You have a heterogenous input file, containing a mixture of messages; you want N homogenous output files, where each output file contains only the messages of a given type.

The approach that jumps to mind is functor based: you create a mapping of message types to objects that handle that particular message. Your main() is a dispatch loop that reads the fixed message header, finds the appropriate functor from the map, then calls it.

You probably won t be able to hold 6,000 files (one per message type) open at once; most operating systems have a limit of around 1,024 simultaneous open files (although with Linux you can change the kernel parameters that control this). So this implies that you ll be opening and closing files repeatedly.

Probably the best approach is to set a fixed-count buffer on every functor, so that it opens, writes, and closes after, say 10 messages. If your messages are max 50 bytes, then that s 500 bytes (10 x 50) x 6,000 that will remain in memory at any given time.

I d probably write my functors to hold fixed-size byte arrays, and create a generic functor class that reads N bytes at a time into that array:

public class MessageProcessor
{
    int _msgSize;                   // the number of bytes to read per message
    byte[] _buf = new byte[1024];   // bigger than I said, but it s only 6 Mb total
    int _curSize;                   // when this approaches _buf.length, write

Answer 5

There s usually limits on open files in the system, and in any case accessing thousands of little files in a more or less random order is going to bog your system down very badly.

Consider breaking the large file up into a file (or some sort of in-memory table, if you ve got the memory) of individual messages, and sorting that by message type. Once that is done, write the message out to their appropriate files.

Answer 6

Since you re doing many small writes to many files you want to minimize the number of writes, especially given that the simplest design would pretty much guarantee that each new write would involve a new file open/close.

Instead, why not map each key to a buffer? at the end, write each buffer to disk. Or if you re concerned that you ll be holding too much memory, you could structure your buffers to write every 1K, or 5K, or whatever lines. e.g.

public class HashLogger {

          private HashMap<String,MessageBuffer> logs;

          public void write(String messageKey, String message)
          {
              if (!logs.contains(messageKey)) { logs.put(messageKey, new MessageBuffer(messageKey); }
              logs.get(messageKey).write(message);
          }

         public void flush()
         {
             for (MessageBuffer buffer: logs.values())
             {
                buffer.flush();
             }
            // ...flush all the buffers when you re done...
         }

    private class MessageBuffer {
             private MessageBuffer(String name){ ... }
             void flush(){ .. something here to write to a file specified by name ... }
             void write(String message){
             //... something here to add to internal buffer, or StringBuilder, or whatever... 
             //... you could also have something here that flushes if the internal builder gets larger than N lines ...
     }
}

You could even create separate Log4j loggers, which can be configured to use buffered logging, I d be surprised if more modern logging frameworks like slf4j didn t support this as well.

友情链接