English 中文(简体)
Python-在数百个大的gzip文件中搜索项目
原标题:Python - Search for items in hundreds of large, gzipped files

不幸的是,我使用的是一个非常大的语料库,它被扩展到数百个.gz文件中——实际上价值24G(打包)。Python确实是我的母语(哈哈),但我想知道我是否没有遇到需要学习“更快”语言的问题?

每个.gz文件包含一个纯文本文档,大约56MB的gzip文件,大约210MB的解压缩文件。

每一行都有一个n-gram(二元图、三元图、四元图等),右边还有一个频率计数。我基本上需要创建一个文件,将每个四边形的子串频率与其整个串频率计数一起存储(即,总共10个数据点的4个unigram频率、3个bigram频率和2个trigram频率)。每种类型的n-gram都有自己的目录(例如,所有bigram都出现在自己的一组33.gz文件中)。

我知道一个简单的强力解决方案,以及导入哪个模块来处理Python中的gzipped文件,但我想知道是否有什么东西不会占用我几周的CPU时间?任何关于加快这一进程的建议,无论多么微小,都将不胜感激!

最佳回答

有几个行和预期输出的例子会有所帮助。但据我所知,这里有一些想法。

你当然不想每次处理一个文件,或者更糟的是,处理一个4克的文件时都处理所有文件。理想情况下,每个文件都要浏览一次。因此,我的第一个建议是维护一个频率的中间列表(这组10个数据点),其中它们首先只考虑一个文件。然后,当您处理第二个文件时,您将更新您遇到的项目的所有频率(并可能添加新项目)。然后你会继续这样做,随着你发现更多匹配的n-gram,频率会增加。最后把所有的东西都写出来。

更具体地说,在每次迭代时,我都会将一个新的输入文件作为字符串到数字的映射读取到内存中,其中字符串是一个空格分隔的n-gram,数字是它的频率。然后,我将处理上一次迭代中的中间文件,该文件将包含您的预期输出(具有不完整的值),例如“a b c d:10 20 30 40 5 4 3 2 1 1”(有点猜测您在这里寻找的输出)。对于每一行,我都会在映射中查找映射中的所有子图,更新计数,并将更新后的行写入新的输出文件。那个将在下一次迭代中使用,直到我处理完所有的输入文件。

问题回答

暂无回答




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签