This question is to discuss how to code a spell corrector and is not duplicate of Delphi Spell Checker component.
两年前,我发现并使用了《拼写法》。 Peter Norvig 在其网站 , 载于: 但业绩似乎并不高。 令人感兴趣的是,最近在其网页清单中增加了执行同样任务的其他语文。
Peter s页上的一些线包括:
[a + c + b for a, b in splits for c in alphabet]
如何将其转化为营养吗?
我有兴趣了解SODelphi专家将如何使用同样的理论,并承担一些合适的路线和可能的医疗或更好的业绩。 这并不是贬低任何语言,而是学会比较它们如何以不同的方式执行这项任务。
预言如此之多。
[Edit]
I will quote Marcelo Toledo who contributes C version, as saying "...While the purpose of this article [C version] was to show the algorithms, not to highlight Python...". Though his C version is with the second most lines, according to his article, his version is high performance when the dictionary file is huge. So this question is not highlight any language but to ask for delphi solution and it is not at all intended for competition, though Peter is influential in directing Google Research.
<<>>>
戴维的建议给我开了启发,研究了彼得页的理论和例行做法。 做的例行工作非常粗略,效率不高,与其他语言略有不同,地雷是GUI的。 我是德尔菲的开端和学习者,我敢于说我完全的法典(写不好)。 我将概述我如何这样做的想法。 你的意见值得欢迎,以便改进例行工作。
我的硬件和软件老旧。 这足以满足我的工作(我的专业不是计算机或相关方案)。
AMD Athlon Dual Core Processor
2.01 Ghz, 480 Memory
Windows XP SP2
IDE Delphi 7.0
This is the snapshot and record of processing time of correct word. I tried Gettickcount, Tdatetime, and Queryperformancecounter to track correct time for word, but gettickcount and Tdatetime will output o ms for each check, so I have to use Queryperformancecounter. Maybe there are other ways to do it more precisely.
总线为72条,不包括记录检查时间的职能。 马塞洛指出,线数可能不是尺度。 该员额正在讨论如何以不同的方式完成这项任务。 当然,SO富营养专家将使用最低线,以取得最佳业绩。
procedure Tmajorform.FormCreate(Sender: TObject);
begin
loaddict;
end;
procedure Tmajorform.loaddict;
var
fs: TFilestream;
templist: TStringlist;
p1: tperlregex;
w1: string;
begin
//load that big.txt (6.3M, is Adventures of Sherlock Holmes)
//templist.loadfromstream
//Use Tperlregex to tokenize ( I used regular expression by [Jan Goyvaerts][5])
//The load and tokenize time is about 7-8 seconds on my machine, Maybe there are other ways to
//speed up loading and tokenizing.
end;
procedure Tmajorform.edits1(str: string);
var
i: integer;
ch: char;
begin
// This is to simulate Peter s page in order to fast generate all possible combinations.
// I do not know how to use set in delphi. I used array.
// Peter said his routine edits1 would generate 494 elements of something . Mine will
// generate 469. I do not know why. Before duplicate ignore, mine is over 500. After setting
// duplicate ignore, there are 469 unique elements for something .
end;
procedure Tmajorform.correct(str: string);
var
i, j: integer;
begin
//This is a loop and binary search to add candidate word into list.
end;
procedure Tmajorform.Button2Click(Sender: TObject);
var
str: string;
begin
// Trigger correct(str: string);
end;
It seems by Tfilestream it can increase loading by 1-2 second. I tried using CreateFileMapping method but failed and it seemed a little complicated. Maybe there are other ways to load huge file fast. Because this big.txt will not be big considering availability of corpus, there should be more efficient way to load larger and larger file.
另一点是Delphi 7.0没有固定的表述。 我看一看其他语文,在Perter的页上进行笔记,这些语文基本上直接称呼其固定的表述。 当然,真正的专家不需要任何内在的班级或图书馆,并且可以自己建造。 开办时,一些班级或图书馆是方便的。
你的意见值得欢迎。
<<>>>
我继续进行研究,并进一步纳入了代谢2功能(距离2)。 这将增加大约12条法典。 Peter说,距离线2几乎包括所有可能性。 有114,324种可能性。 我的职能将为它带来102 727个联合国协会的可能性。 当然,建议的措辞也将增加。
如果用斜体2调整时间来纠正,显然拖延了,因为它使数据增加了约200倍。 但是,我发现,有些建议更正显然是不可行的,因为打字员不会打上错误的字词,在经过长期更正的字面上。 因此,如果大小,则距离线1会更好。 卷宗足够大,可以包括更正确的措辞。
以下是跟踪线索2正确时间的缩略语。