English 中文(简体)
文字的频率太慢
原标题:Word frequency tally script is too slow

Background

编写文字说明,以计算在简单文本档案中措辞的频率。 该书采取下列步骤:

  1. Count the frequency of words from a corpus.
  2. Retain each word in the corpus found in a dictionary.
  3. Create a comma-separated file of the frequencies.

http://pastebin.com/VAZdeKXs"rel=“nofollow”http://pastebin.com/VAZdeKXs。

#!/bin/bash

# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
sed -e  s/ /
/g  -e  s/[^a-zA-Z
]//g  corpus.txt | 
  tr [:upper:] [:lower:] | 
  sort | 
  uniq -c | 
  sort -rn > frequency.txt

echo Creating corpus lexicon...
rm -f corpus-lexicon.txt

for i in $(awk  {if( $2 ) print $2}  frequency.txt); do
  grep -m 1 ^$i$ dictionary.txt >> corpus-lexicon.txt;
done

echo Creating lexicon...
rm -f lexicon.txt

for i in $(cat corpus-lexicon.txt); do
  egrep -m 1 "^[0-9 ]* $i$" frequency.txt | 
    awk  {print $2, $1}  | 
    tr      ,  >> lexicon.txt;
done

Problem

以下各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各

for i in $(awk  {if( $2 ) print $2}  frequency.txt); do
  grep -m 1 ^$i$ dictionary.txt >> corpus-lexicon.txt;
done

但它运作缓慢,因为它正在扫描其认为删除任何并非字典的文字。 该法典通过逐字扫描字来完成这项任务。 (-m 1, parafalls the scan when thecompet is found.)

Question

如何优化文字,以便字典从一开始到每字都不会被扫描? 多数词不会是字典。

谢谢!

最佳回答

您可使用<条码>grep-f,以探寻一个过频率的字句。 txt:

awk  {print $2}  frequency.txt | grep -Fxf dictionary.txt > corpus-lexicon.txt
  • -F to search for fixed strings.
  • -x to match whole lines only.
  • -f to read the search patterns from dictionary.txt

事实上,你甚至可以把这同第二处路结合起来,消除中间阵容。 txt file. 两种 lo体可改为一种 gr:

grep -Fwf dictionary.txt frequency.txt | awk  {print $2 "," $1} 

通知:我更改了<代码>-x至-w

问题回答

这通常是你用Perl语写的文字之一。 但是,如果与我一样,你只用仇恨的语言书写节目,那么你就能够做到:

awk  
    BEGIN {
        while ((getline < "dictionary.txt") > 0)
            dict[$1] = 1
    }
    ($2 && $2 in dict) { print $2 }
  < frequency.txt > corpus-lexicon.txt

本版本不需要<代码>rm-f Volume-lexicon.txt。

使用真正的方案拟订语言。 所有的 app起和档案扫描都杀死了你。 例如,我仅举一个例子,在沙尔(控制法典线):

import sys, re
words = re.findall(r (w+) ,open(sys.argv[1]).read())
counts = {}
for word in words:
  counts[word] = counts.setdefault(word,0) + 1
open(sys.argv[2], w ).write("
".join([w+ , +str(c) for (w,c) in counts.iteritems()]))

Testing a against a large text file I had sitting aound (1.4MB, 80,000 words according to wc), this completes in under a second (18k unique words) on a 5 year old powermac.





相关问题
Parse players currently in lobby

I m attempting to write a bash script to parse out the following log file and give me a list of CURRENT players in the room (so ignoring players that left, but including players that may have rejoined)...

encoding of file shell script

How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1. Thanks

Bash usage of vi or emacs

From a programming standpoint, when you set the bash shell to use vi or emacs via set -o vi or set -o emacs What is actually going on here? I ve been reading a book where it claims the bash shell ...

Dynamically building a command in bash

I am construcing a command in bash dynamically. This works fine: COMMAND="java myclass" ${COMMAND} Now I want to dynamically construct a command that redirectes the output: LOG=">> myfile.log ...

Perform OR on two hash outputs of sha1sum

I want perform sha1sum file1 and sha1sum file2 and perform bitwise OR operation with them using bash. Output should be printable i.e 53a23bc2e24d039 ... (160 bit) How can I do this? I know echo $(( ...

Set screen-title from shellscript

Is it possible to set the Screen Title using a shell script? I thought about something like sending the key commands ctrl+A shift-A Name enter I searched for about an hour on how to emulate ...

热门标签