Question

我目前正在开展一项研究项目,该项目涉及将大量档案(240k)索引;这些档案大多是html、xml、 doc、xls、zip、rar、pdf,还有几处KB至100多个甲基溴文档。

由于所有手提和书记档案都摘取,我最终获得了100万份档案。

我正在使用《2010年视觉演播》,C#和NET4.0,支持TPL数据流和Async CTP V3。为了从这些档案中提取案文,我使用Pharma Tika(与ikvm相左),我使用Lucene.net 2.9.4作为索引。我希望使用新的低频数据库和同步方案规划。

我有几个问题:

Would I get performance benefits if I use TPL? It is mainly an I/O process and from what I understand, TPL doesn t offer much benefit when you heavily use I/O.
生产者/消费者办法是否是处理这类档案处理的最佳途径,或者是否有其他更好的模式? 我正想建立一个有多个消费者的生产者,使用阻塞手段。
TPL数据流图书馆是否用于这种程序? 在某些类型的传闻系统中似乎最能使用TPL数据。
在此情况下,我是否应当使用同步的方案拟订或坚持同步的??

Answer 1

async/await在处理外部资源(通常是网络要求、档案系统或db)业务时肯定会有所帮助。这里令人感兴趣的问题是,你需要同时满足<>多重要求:

consume as small amount of CPU as possible (this is where async/await will help)
perform multiple operations at the same time, in parallel
control the amount of tasks that are started (!) - if you do not take this into account, you will likely run out of threads when dealing with many files.

您可以看一看我出版的小型项目:

Parallel tree walker

它能够高效地在名录结构中点出任何文件。您可以确定在每一档案中(在您的个案索引中)开展“合成”行动,同时控制同时处理<>文件的最大数量。

例如:

await TreeWalker.WalkAsync(root, new TreeWalkerOptions
{
    MaxDegreeOfParallelism = 10,
    ProcessElementAsync = async (element) =>
    {
        var el = element as FileSystemElement;
        var path = el.Path;
        var isDirectory = el.IsDirectory;

        await DoStuffAsync(el);
    }
});

(如果你不能直接使用该工具,你可能会在来源法中找到一些有用的实例)

Answer 2

You could use Everything Search. The SDK is open source and have C# example. It s the fastest way to index files on Windows I ve seen.

<<><><<>/strong>:

1.2.2. 记录我的档案需要多少时间?

"Everything" only uses file and folder names and generally takes a few seconds to build its > database. A fresh install of Windows XP SP2 (about 20,000 files) will take about 1 second to index. 1,000,000 files will take about 1 minute.

我不敢肯定,你是否能够使用杀伤人员地雷。

友情链接