Asynchronous unpacking and synchronous processing

Question

I working on a project which works on a very large amount of data. I have a lot(thousands) of zip files, each containing ONE simple txt file with thousands of lines(about 80k lines). What I am currently doing is the following:

for(File zipFile: dir.listFiles()){
ZipFile zf = new ZipFile(zipFile);
ZipEntry ze = (ZipEntry) zf.entries().nextElement();
BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze)));
...

In this way I can read the file line by line, but it is definetely too slow. Given the large number of files and lines that need to be read, I need to read them in a more efficient way.

I have looked for a different approach, but I haven t been able to find anything. What I think I should use are the java nio APIs intended right for intensive I/O operations, but I don t know how to use them with zip files.

任何帮助都会非常感激。

谢谢

马马尔科

Answer 1

我有很多(千)拉链文件。拉链文件每份大约30MB,而拉链文件中的txt文件大约为60/70MB。用这个代码读取和处理文件需要很多小时,大约15小时,但要看情况。

让我们做一些后信封的计算。

s 表示您有5000个文件。如果处理它们需要15小时,这等于每个文件~10秒。每个文件大约30MB,所以输送量为~3MB/s。

这是比能够解压缩材料的速度慢的一到两个数量级之间。

磁盘有问题(它们是本地的,还是网络共享的? ),还是实际处理需要大部分时间。

最好的方法就是使用一个剖析仪

Answer 2

绕过一个 zip 文件的正确方式

final ZipFile file = new ZipFile( FILE_NAME );
try
{
    final Enumeration<? extends ZipEntry> entries = file.entries();
    while ( entries.hasMoreElements() )
    {
        final ZipEntry entry = entries.nextElement();
        System.out.println( entry.getName() );
        //use entry input stream:
        readInputStream( file.getInputStream( entry ) )
    }
}
finally
{
    file.close();
}

private static int readInputStream( final InputStream is ) throws IOException {
    final byte[] buf = new byte[ 8192 ];
    int read = 0;
    int cntRead;
    while ( ( cntRead = is.read( buf, 0, buf.length ) ) >=0  )
    {
        read += cntRead;
    }
    return read;
}

Zip 文件由多个条目组成, 每个条目都有一个包含当前条目中字节数的字段。因此, 在不实际数据解压缩的情况下, 很容易复制所有拉链文件条目。 java. util.zip. ZipFile 接受一个文件/ 文件名称, 并使用随机访问在文件位置之间跳跃。 java. util.zip. ZipInputStream 正在与流一起工作, 因此无法自由跳跃。这就是为什么它必须读取和解压缩所有拉链数据, 才能达到 EOF 的每个条目, 并读取下一个条目页头。

这意味着什么? 如果您的文件系统中已经有一个 zip 文件 — — 使用 ZipFile 处理它, 而不考虑您的任务。作为奖金, 您可以按顺序或随机访问 zip 条目( 执行处罚较小 ) 。另一方面, 如果您正在处理一条流, 您需要使用 ZipInputStream 来按顺序处理所有条目。

举例来说,一个包含3个0.6Gb条目的拉链档案(总文件大小=1.6Gb)在0.05秒内使用ZipFile进行迭接,在18秒内使用ZipInputStream进行迭接。

Answer 3

您可以像这样使用新文件 API :

Path jarPath = Paths.get(...);
try (FileSystem jarFS = FileSystems.newFileSystem(jarPath, null)) {
    Path someFileInJarPath = jarFS.getPath("/...");
    try (ReadableByteChannel rbc = Files.newByteChannel(someFileInJarPath, EnumSet.of(StandardOpenOption.READ))) {
        // read file
    }
}

密码是用来处理罐子文件的但我认为它也应该用来处理拉链

Answer 4

您可以尝试这个代码

try
    {

        final ZipFile zf = new ZipFile("C:/Documents and Settings/satheesh/Desktop/POTL.Zip");

        final Enumeration<? extends ZipEntry> entries = zf.entries();
        ZipInputStream zipInput = null;

        while (entries.hasMoreElements())
        {
            final ZipEntry zipEntry=entries.nextElement();
            final String fileName = zipEntry.getName();
        // zipInput = new ZipInputStream(new FileInputStream(fileName));
            InputStream inputs=zf.getInputStream(zipEntry);
            //  final RandomAccessFile br = new RandomAccessFile(fileName, "r");
                BufferedReader br = new BufferedReader(new InputStreamReader(inputs, "UTF-8"));
                FileWriter fr=new FileWriter(f2);
            BufferedWriter wr=new BufferedWriter(new FileWriter(f2) );

            while((line = br.readLine()) != null)
            {
                wr.write(line);
                System.out.println(line);
                wr.newLine();
                wr.flush();
            }
            br.close();
            zipInput.closeEntry();
        }


    }
    catch(Exception e)
    {
        System.out.print(e);
    }
    finally
    {
        System.out.println("


The had been extracted successfully");

    }

本守则运作良好。

Answer 5

Intel has made an improved version of zlib, which Java uses internally peroform zip/unzip. It requires you to patch zlib sources with Interl s IPP paches. I made a benchmark showing 1.4x to 3x gains in throughput.

Answer 6

Asynchronous unpacking and synchronous processing

使用提供的建议,这非常像 ,Wasim Wani < a > 的回答,,<来自Satheesh Kumar :在ZIP条目上插入他们每个人的代码,他们对他们做了一些事情,我有自己的解决方案。

在我的案例中,处理过程是瓶颈, 所以我在开始时大量发射平行提取, 在 < code> 条目上循环。 has MoreEmentles (), 并将每个结果放入我从处理线索中消耗的 < code> ConcoltLinkedQuue 。我的 ZIP 包含一组代表连续的 Java 天体的 XML 文件, 因此我的“ 提取” 包括使天体脱序, 而那些被解密的天体是排队的对象。

对我来说,这与我先前的做法相比有一些好处, 即按顺序从ZIP获取每个文件并处理它:

the more compelling one: 10% reduction in total time
the release of the file occurs earlier
the whole amount of RAM is allocated quicker, so if there is not enough RAM it will fail faster (in a matter of tens of minutes instead of over one hour); please note that the amount of memory I keep allocated after processing is quite similar to that occupied by the unzipped files, otherwise, it would be better to unzip and discard sequentially to keep the memory footprint lower
unzipping and deserializing seems to have a high CPU usage, so the faster is finished, the faster you get your CPU for the processing, which is what really matters

还有一个缺点:在包括平行主义时,流动控制要复杂一些。

Asynchronous unpacking and synchronous processing

友情链接