Background
编写文字说明,以计算在简单文本档案中措辞的频率。 该书采取下列步骤:
- Count the frequency of words from a corpus.
- Retain each word in the corpus found in a dictionary.
- Create a comma-separated file of the frequencies.
http://pastebin.com/VAZdeKXs"rel=“nofollow”http://pastebin.com/VAZdeKXs。
#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
sed -e s/ /
/g -e s/[^a-zA-Z
]//g corpus.txt |
tr [:upper:] [:lower:] |
sort |
uniq -c |
sort -rn > frequency.txt
echo Creating corpus lexicon...
rm -f corpus-lexicon.txt
for i in $(awk {if( $2 ) print $2} frequency.txt); do
grep -m 1 ^$i$ dictionary.txt >> corpus-lexicon.txt;
done
echo Creating lexicon...
rm -f lexicon.txt
for i in $(cat corpus-lexicon.txt); do
egrep -m 1 "^[0-9 ]* $i$" frequency.txt |
awk {print $2, $1} |
tr , >> lexicon.txt;
done
Problem
以下各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各行各
for i in $(awk {if( $2 ) print $2} frequency.txt); do
grep -m 1 ^$i$ dictionary.txt >> corpus-lexicon.txt;
done
但它运作缓慢,因为它正在扫描其认为删除任何并非字典的文字。 该法典通过逐字扫描字来完成这项任务。 (-m 1
, parafalls the scan when thecompet is found.)
Question
如何优化文字,以便字典从一开始到每字都不会被扫描? 多数词不会是字典。
谢谢!