English 中文(简体)
处理大量档案
原标题:Dealing with a very large number of files

我目前正在开展一项研究项目,该项目涉及将大量档案(240k)索引;这些档案大多是html、xml、 doc、xls、zip、rar、pdf,还有几处KB至100多个甲基溴文档。

由于所有手提和书记档案都摘取,我最终获得了100万份档案。

我正在使用《2010年视觉演播》,C#和NET4.0,支持TPL数据流和Async CTP V3。 为了从这些档案中提取案文,我使用Pharma Tika(与ikvm相左),我使用Lucene.net 2.9.4作为索引。 我希望使用新的低频数据库和同步方案规划。

我有几个问题:

  1. Would I get performance benefits if I use TPL? It is mainly an I/O process and from what I understand, TPL doesn t offer much benefit when you heavily use I/O.

  2. 生产者/消费者办法是否是处理这类档案处理的最佳途径,或者是否有其他更好的模式? 我正想建立一个有多个消费者的生产者,使用阻塞手段。

  3. TPL数据流图书馆是否用于这种程序? 在某些类型的传闻系统中似乎最能使用TPL数据。

  4. 在此情况下,我是否应当使用同步的方案拟订或坚持同步的??

问题回答

async/await在处理外部资源(通常是网络要求、档案系统或db)业务时肯定会有所帮助。 这里令人感兴趣的问题是,你需要同时满足<>多重要求:

  • consume as small amount of CPU as possible (this is where async/await will help)
  • perform multiple operations at the same time, in parallel
  • control the amount of tasks that are started (!) - if you do not take this into account, you will likely run out of threads when dealing with many files.

您可以看一看我出版的小型项目:

Parallel tree walker

它能够高效地在名录结构中点出任何文件。 您可以确定在每一档案中(在您的个案索引中)开展“合成”行动,同时控制同时处理<>文件的最大数量。

例如:

await TreeWalker.WalkAsync(root, new TreeWalkerOptions
{
    MaxDegreeOfParallelism = 10,
    ProcessElementAsync = async (element) =>
    {
        var el = element as FileSystemElement;
        var path = el.Path;
        var isDirectory = el.IsDirectory;

        await DoStuffAsync(el);
    }
});

(如果你不能直接使用该工具,你可能会在来源法中找到一些有用的实例)

You could use Everything Search. The SDK is open source and have C# example. It s the fastest way to index files on Windows I ve seen.

<<><><<>/strong>:

1.2.2. 记录我的档案需要多少时间?

"Everything" only uses file and folder names and generally takes a few seconds to build its > database. A fresh install of Windows XP SP2 (about 20,000 files) will take about 1 second to index. 1,000,000 files will take about 1 minute.

我不敢肯定,你是否能够使用杀伤人员地雷。





相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签