Question

在辅助存储器使用必须最小的限制下，最有效地从数组中删除重复项的方法是什么，最好小到甚至不需要任何堆分配？排序似乎是显而易见的选择，但这显然不是渐近有效的。是否有更好的算法可以就地或接近就地完成？如果排序是最好的选择，那么这样的排序最好用什么类型？

Answer 1

我会回答我的问题，因为在发帖后，我想出了一个非常聪明的算法来做这件事。它使用哈希，在原地构建类似哈希集合的东西。它保证在夹带空间中为O(1)（递归是尾调用），通常为O(N)时间复杂度。算法如下：

Take the first element of the array, this will be the sentinel.
Reorder the rest of the array, as much as possible, such that each element is in the position corresponding to its hash. As this step is completed, duplicates will be discovered. Set them equal to sentinel.
Move all elements for which the index is equal to the hash to the beginning of the array.
Move all elements that are equal to sentinel, except the first element of the array, to the end of the array.
What s left between the properly hashed elements and the duplicate elements will be the elements that couldn t be placed in the index corresponding to their hash because of a collision. Recurse to deal with these elements.

This can be shown to be O(N) provided no pathological scenario in the hashing: Even if there are no duplicates, approximately 2/3 of the elements will be eliminated at each recursion. Each level of recursion is O(n) where small n is the amount of elements left. The only problem is that, in practice, it s slower than a quick sort when there are few duplicates, i.e. lots of collisions. However, when there are huge amounts of duplicates, it s amazingly fast.

编辑：在当前的 D 实现中，hash_t 是32位的。该算法的所有内容都假定在完整的32位空间中，很少会发生哈希碰撞。但是，在模数空间中可能会频繁发生碰撞。但是，对于任何合理大小的数据集来说，这一假设很可能是正确的。如果键小于或等于32位，则可以是自己的哈希，这意味着在完整的32位空间中发生哈希碰撞是不可能的。如果键更大，将无法将足够数量的键放入32位内存地址空间中，因此不会出现问题。我假设在D的64位实现中，hash_t将增加到64位，其中数据集可以更大。此外，如果这确实成为问题，可以在每个递归级别更改哈希函数。

这里有一个用D编程语言实现的示例：

void uniqueInPlace(T)(ref T[] dataIn) {
    uniqueInPlaceImpl(dataIn, 0);
}

void uniqueInPlaceImpl(T)(ref T[] dataIn, size_t start) {
    if(dataIn.length - start < 2)
        return;

    invariant T sentinel = dataIn[start];
    T[] data = dataIn[start + 1..$];

    static hash_t getHash(T elem) {
        static if(is(T == uint) || is(T == int)) {
            return cast(hash_t) elem;
        } else static if(__traits(compiles, elem.toHash)) {
            return elem.toHash;
        } else {
            static auto ti = typeid(typeof(elem));
            return ti.getHash(&elem);
        }
    }

    for(size_t index = 0; index < data.length;) {
        if(data[index] == sentinel) {
            index++;
            continue;
        }

        auto hash = getHash(data[index]) % data.length;
        if(index == hash) {
            index++;
            continue;
        }

        if(data[index] == data[hash]) {
            data[index] = sentinel;
            index++;
            continue;
        }

        if(data[hash] == sentinel) {
            swap(data[hash], data[index]);
            index++;
            continue;
        }

        auto hashHash = getHash(data[hash]) % data.length;
        if(hashHash != hash) {
            swap(data[index], data[hash]);
            if(hash < index)
                index++;
        } else {
            index++;
        }
    }


    size_t swapPos = 0;
    foreach(i; 0..data.length) {
        if(data[i] != sentinel && i == getHash(data[i]) % data.length) {
            swap(data[i], data[swapPos++]);
        }
    }

    size_t sentinelPos = data.length;
    for(size_t i = swapPos; i < sentinelPos;) {
        if(data[i] == sentinel) {
            swap(data[i], data[--sentinelPos]);
        } else {
            i++;
        }
    }

    dataIn = dataIn[0..sentinelPos + start + 1];
    uniqueInPlaceImpl(dataIn, start + swapPos + 1);
}

Answer 2

保持辅助存储器的使用最小化，最好的方法是进行高效的排序，以便将它们排序，然后使用 "从" 和 "到" 索引进行一次数组单通。

每次循环时都会提升FROM索引。当关键字与上一个不同时，您仅将元素从FROM复制到TO（并将TO递增）。

通过快速排序，平均情况下时间复杂度为O(n-log-n)，最后一遍时间复杂度为O(n)。

Answer 3

如果你对数组进行排序，仍需要另一次遍历来移除重复项，因此最坏情况下的复杂度为O(NN)（假设使用快速排序），或者使用希尔排序为O(Nsqrt(N))。

您可以通过简单地扫描数组，同时去除重复项，每个元素即可实现O（N*N）。

这是Lua的一个示例：

function removedups (t) local result = {} local count = 0 local found for i,v in ipairs(t) do found = false if count > 0 then for j = 1,count do if v == result[j] then found = true; break end end end if not found then count = count + 1 result[count] = v end end return result, count end

Answer 4

我看不到任何不使用类似冒泡排序的方式来完成这个任务的可能性。当您发现重复项时，需要缩减数组的长度。快速排序并不适用于数组大小的变化。

这个算法始终是O（n ^ 2），但几乎不使用额外的内存 - 堆栈或堆。

// returns the new size
int bubblesqueeze(int* a, int size) {
   for (int j = 0; j < size - 1; ++j) {
      for (int i = j + 1; i < size; ++i) {
          // when a dupe is found, move the end value to index j
          // and shrink the size of the array
          while (i < size && a[i] == a[j]) {
             a[i] = a[--size];
          }
          if (i < size && a[i] < a[j]) {
             int tmp = a[j];
             a[j] = a[i];
             a[i] = tmp;
          }
      }
   }
   return size;
}

Answer 5

如果你有两个不同的遍历数据集的变量而不是只有一个，那么你可以通过排除当前数据集中已经存在的所有重复项来限制输出。

显然，这个在C语言中的例子不是一个高效的排序算法，但它只是一个看待问题的方式的例子。

你也可以先盲目地对数据进行排序，然后重新定位数据以消除重复，但我不确定这样是否会更快。

#define ARRAY_LENGTH 15
int stop = 1;
int scan_sort[ARRAY_LENGTH] = {5,2,3,5,1,2,5,4,3,5,4,8,6,4,1};

void step_relocate(char tmp,char s,int *dataset)
{
    for(;tmp<s;s--)
        dataset[s] = dataset[s-1];
}
int exists(int var,int *dataset)
{
    int tmp=0;
    for(;tmp < stop; tmp++)
    {
        if( dataset[tmp] == var)
            return 1;/* value exsist */
        if( dataset[tmp] > var)
            tmp=stop;/* Value not in array*/
    }
    return 0;/* Value not in array*/
}
void main(void)
{
    int tmp1=0;
    int tmp2=0;
    int index = 1;
    while(index < ARRAY_LENGTH)
    {
        if(exists(scan_sort[index],scan_sort))
            ;/* Dismiss all values currently in the final dataset */
        else if(scan_sort[stop-1] < scan_sort[index])
        {
            scan_sort[stop] = scan_sort[index];/* Insert the value as the highest one */
            stop++;/* One more value adde to the final dataset */
        }
        else
        {
            for(tmp1=0;tmp1<stop;tmp1++)/* find where the data shall be inserted */
            {
                if(scan_sort[index] < scan_sort[tmp1])
                {
                    index = index;
                    break;
                }
            }
            tmp2 = scan_sort[index]; /* Store in case this value is the next after stop*/
            step_relocate(tmp1,stop,scan_sort);/* Relocated data already in the dataset*/
            scan_sort[tmp1] = tmp2;/* insert the new value */
            stop++;/* One more value adde to the final dataset */
        }
        index++;
    }
    printf("Result: ");
    for(tmp1 = 0; tmp1 < stop; tmp1++)
        printf( "%d ",scan_sort[tmp1]);
    printf("
");
    system( "pause" );
}

我喜欢这个问题，所以我写了一个简单的C测试程序，就如你上面所看到的。如果你有任何问题或者看到错误，可以留言给我。

友情链接