Question

I have to check if I have duplicate paths in a FileListBox (FileListBox has the role of some kind of job list or play list). Using Delphi s SameText, CompareStr, CompareText, takes 6 seconds. So I came with my own compare function which is (just) a bit faster but not fast enough. Any ideas how to improve it?

function SameFile(CONST Path1, Path2: string): Boolean;
VAR i: Integer;
begin
 Result:= Length(Path1)= Length(Path2);                                         { if they have different lenghts then obviously are not the same file }
 if Result then
  for i:= Length(Path1) downto 1 DO                                             { start from the end because it is more likely to find the difference there }
   if Path1[i]<> Path2[i] then
    begin
     Result:= FALSE;
     Break;
    end;
end;

我这样说:

 for x:= JList.Count-1 downto 1 DO
  begin
   sMaster:= JList.Items[x];
   for y:= x-1 downto 0 DO
    if SameFile(sMaster, JList.Items[y]) then
     begin
      JList.Items.Delete (x); { REMOVE DUPLICATES }
      Break;
     end;
  end;

注:重复的可能性很小,因此不经常要求删除。此外,不能对清单进行分类,因为项目由用户添加,有时订单可能很重要。

Update:
The thing is that I lose the asvantage of my code because it is Pascal. It would be nice if the comparison loop ( Path1[i]<> Path2[i] ) would be optimized to use Borland s ASM code.

Delphi 7, Win XP 32 bit,在清单中测试了577项物品。删除项目清单>,因为很少发生。

http://www.ohchr.org。

As Svein Bringsli pointed, my code is slow not because of the comparing algorithm but because of TListBox. The BEST solution was provided by Marcelo Cantos. Thanks a lot Marcelo.
I accepted Svein s answer because it answers directly my question "how to make my comparison function faster" with "there is no point to make it faster".
For the moment I implemented the dirty and quick to implement solution: when I have under 200 files, I use my slow code to check the duplicates. If there are more than 200 files I use dwrbudr s solution (which is damn fast) considering that if the user has so many files, the order is irrelevant anyway (human brain cannot track so many items).

我要感谢大家的想法,特别是Svein透露真相:(祖国)视觉控制缓慢!

Answer 1

我将你的代码作为起点,修改了,在寻找复制件之前要拿一份清单。时间从5 5秒到大约0 5秒。

vSL := TStringList.Create;
try
  vSL.Assign(jList.Items);
  vSL.Sorted := true;
  for x:= vSL.Count-1 downto 1 DO
  begin
   sMaster:= vSL[x];
   for y:= x-1 downto 0 DO
    if SameFile(sMaster, vSL[y]) then
     begin
      vSL.Delete (x); { REMOVE DUPLICATES }
      jList.Items.Delete (x);
      Break;
     end;
  end;
finally
  vSL.Free;
end;

显然,这不是这样做的好办法,但它表明,紫外线本身相当缓慢。我不相信,通过优化你的比较功能,你能够取得很大收益。

为了证明这一点,我以以下方式取代你的SameFile功能,但保留了你的法典的其余部分:

function SameFile(CONST Path1, Path2: string): Boolean;
VAR i: Integer;
begin
  Result := false; //Pretty darn fast code!!!
end;

时间为5.6秒至5.5秒。我不想再想到的是:

Answer 2

选择组装的废物时间。你可以从O(n²)到O(nlog(n)——通过分类清单,然后对重复进行线性扫描。

在你回头的时候,你忘记了SameFile的功能。算法的改进将缓解你在那里能够取得的任何成果。

Edit:根据评论中的反馈......

2. 您可按下列方式办理订单设计书(nlog(n)):

Sort a copy of the list.
Identify and copy duplicated entries to a third list along with their duplication count minus one.
Walk the original list backwards as per your original version.
In the inner (for y := ...) loop, traverse the duplication list instead. If an outer item matches, delete it, decrement the duplication count, and delete the duplication entry if the count reaches zero.

这显然更为复杂,但即便是你做过诸如将重复计为示物的可怕的 d脏物品,如:<代码>C:path1file1=2,并且使用诸如:

y := dupes.IndexOfName(sMaster);
if y <> -1 then
begin
    JList.Items.Delete(x);
    c := StrToInt(dupes.ValueFromIndex(y));
    if c > 1 then
        dupes.Values[sMaster] = IntToStr(c - 1);
    else
        dupes.Delete(y);
end;

说明: 双轨制比for y := loop,但鉴于重复很少,差异应当微不足道。

Answer 3

编制另一类清单,列出顺序名单:


vSL := TStringList.Create;
try
  vSL.Sorted := true;
  vSL.Duplicates := dupIgnore;
  for x:= 0 to jList.Count - 1 do
    vSL.Add(jList[x]);

  jList.Clear;
  for x:= 0 to vSL.Count - 1 do
    jList.Add(vSL[x]);
finally
  vSL.Free;
end;

Answer 4

绝对最快的办法是,不使用(以前曾提到过)一种例行做法,为扼杀而形成一种独特的第64/128/256 bith代码(我使用C#中的SHA256 管理类别)。滚动清单,编造护卫的散射法,在分类的散射代码清单中加以核对,如果发现,则体形是重复的。否则,在分类的散列代码清单中添加散列代码。

这将为扼杀、档案名称、图像(青年能够获得独特的散射法)等工作。我保证,这将比任何其他减速更快。

PS 你可以通过代表散列代码作为示意图,为散列代码使用一个示意图清单。过去,我曾使用过头盔代表(256比特;64大字),但理论上,你可以这样做。

Answer 5

多少次电话? 如果你称之为十亿次,则会取得巨大成绩。

无论如何,Length(Path1)是否每次都通过休息室得到评估? 如果是的话,在排位之前将这种变数储存在Integer。

点击器可能产生一些速度。

Try in-lining the function with: function SameFile(blah blah): Boolean; Inline;

这将节省一些时间,如果说这是每秒数千次。首先,我想看一下它是否拯救了任何东西。

EDIT:我没有认识到你的名单是按顺序排列的。显然,你应该这样做。然后,你不必对名单上的每一个其他项目进行比较——即之前或之后的项目。

Answer 6

我使用经过修改的Ternary搜索树清单。你只是把整块地块作为钥匙装入树,在每一件地上,如果钥匙已经存在,你可以重新表明(并删除你的可见条目)。然后,你 throw树。我们的技转中心负荷功能通常能够在第二大条件下装满100 000个逐项物品。更不用说重开你的名单,适当利用开始和结束日期。《技术手册》是记忆,但并非这样,如果你只拥有500件物品,你就根本会注意到。而比分类、比较和组装要简单得多(当然,如果你执行适当的技术指示)。

Answer 7

无需使用洗衣桌,单一分类清单给我带来10毫秒的结果,即0.01秒钟,大约是500倍。这里是我使用TListBox的试验守则:

procedure TForm1.Button1Click(Sender: TObject);
var
  lIndex1: Integer;
  lString: string;
  lIndex2: Integer;
  lStrings: TStringList;
  lCount: Integer;
  lItems: TStrings;
begin
  ListBox1.Clear;
  for lIndex1 := 1 to 577 do begin
    lString :=   ;
    for lIndex2 := 1 to 100 do
      if (lIndex2 mod 6) = 0 then
       lString := lString + Chr(Ord( a ) + Random(2))
      else
        lString := lString +  a ;
    ListBox1.Items.Add(lString);
  end;

  CsiGlobals.AddLogMsg( Start ,  Test , llBrief);

  lStrings := TStringList.Create;
  try
    lStrings.Sorted := True;
    lCount := 0;
    lItems := ListBox1.Items;
    with lItems do begin
      BeginUpdate;
      try
        for lIndex1 := Count - 1 downto 0 do begin
          lStrings.Add(Strings[lIndex1]);
          if lStrings.Count = lCount then
            Delete(lIndex1)
          else
            Inc(lCount);
        end;
      finally
        EndUpdate;
      end;
    end;
  finally
    lStrings.Free;
  end;

  CsiGlobals.AddLogMsg( Stop ,  Test , llBrief);
end;

Answer 8

我还愿指出,如果对一个庞大的清单(如包含100 000 000 000件或更多的物品)适用,你的解决办法将耗费大量时间。即便制定一份清单,也需要太多时间。

在你可以尝试另一种做法的情况下: 每一成员都哈希,但并非全方位的地名录,而是创立了一个范围(范围足够大,足以包含与投入项目一样的接近点数),并且只是将每一比值设定为由散列函数所抵消。如果借方为0,则改为1。如果已经是1,则在一份单独的清单中注意到犯罪扼杀指数,并继续。由此产生了在散射中发生碰撞的星号清单,因此,为了找到这些碰撞的第一个原因,你不得不第二次操作。之后,你应进行分类和编制;将插图索引列入本清单(因为除第一个指数外,所有指数将两次出现)。一旦这样做,你就应当重新确定名单,但这种时间将清单归入试剂中,以便很容易在以下单一扫描中发现重复。

给予这一期限可能是极端的,但至少是一大批量可行的解决办法! (Oh,如果重复件数非常高,散射机故障传播,或选用薄板块中的排位数太小,这将造成许多实际上重复的碰撞,那么这种工作仍会赢。)

友情链接