English 中文(简体)
.NET 泛型字典应该初始化一个容量,该容量等于它将包含的项数吗?
原标题:
  • 时间:2009-01-05 18:57:56
  •  标签:

如果我有100个待存储在字典中的项目,我应该这样初始化它吗?

var myDictionary = new Dictionary<Key, Value>(100);

我的理解是,当.NET字典达到给定的负载时,它会在内部重新调整大小,而负载阈值的定义是容量的比率。

这将表明如果上述字典中添加了100个条目,则在添加其中一个条目时它将自动调整大小。调整字典的大小是我想避免的事情,因为这会对性能造成影响,浪费内存。

哈希碰撞的概率与字典的负载成正比。因此,即使字典不会调整大小(并使用其所有插槽),由于这些碰撞,性能也必须下降。

假设您知道字典中将有多少项,最好如何决定初始化字典的容量?

最佳回答

What you should initialize the dictionary capacity to depends on two factors: (1) The distribution of the gethashcode function, and (2) How many items you have to insert.

你的哈希函数应该是随机分布的,或者它应该是针对你的输入集特别制定的。让我们假设第一个,但如果你对第二个感兴趣,请查找完美哈希函数。

如果你有100个项目要插入字典、一个随机分布的哈希函数,并将容量设置为100,则在将第i个项目插入哈希表时,有(i-1)/100的概率第i个项目在插入时会与另一个项目发生冲突。如果你想降低冲突的概率,可以增加容量。将预期容量加倍可以将冲突的几率减半。

此外,如果您知道每个字典中的项目访问频率,您可能希望按照降序顺序插入项目,因为您首先插入的项目平均访问速度将更快。

问题回答

Improved benchmark:

  • Hardware: Intel Core i7-10700K x64, .NET 5, Optimized build. LINQPad 6 for .NET 5 run and LINQPad 5 for .NET Fx 4.8 run.
  • Times are in fractional milliseconds to 3 decimal places.
    • 0.001ms is 1 microsecond.
    • I am unsure of the actual resolution of Stopwatch as it s system-dependent, so don t stress over differences at the microsecond level.
  • Benchmark was re-run dozens of times with consistent results. Times shown are averages of all runs.
  • Conclusion: Consistent 10-20% overall speedup by setting capacity in the Dictionary<String,String> constructor.

.NET: .NET Framework 4.8 .NET 5
With initial capacity of 1,000,000
Constructor 1.170ms 0.003ms
Fill in loop 353.420ms 181.846ms
Total time 354.590ms 181.880ms
Without initial capacity
Constructor 0.001ms 0.001ms
Fill in loop 400.158ms 228.687ms
Total time 400.159ms 228.688ms
Speedup from setting initial capacity
Time 45.569ms 46.8ms
Speedup % 11% 20%
  • I did repeat the benchmark for smaller initial sizes (10, 100, 1000, 10000, and 100000) and the 10-20% speedup was also observed at those sizes, but in absolute terms a 20% speedup on an operation that takes a fraction of a millisecond
  • While I saw consistent results (the numbers shown are averages), but there are some caveats:
    • This benchmark was performed with a rather extreme size of 1,000,000 items but with tight-loops (i.e. not much else going on inside the loop body) which is not a realistic scenario. So always profile and benchmark your own code to know for sure rather than trusting a random benchmark you found on the Internet (just like this one).
    • The benchmark doesn t isolate the time spent generating the million or so String instances (caused by i.ToString().
    • A reference-type (String) was used for both keys and values, which uses the same size as a native pointer size (8 bytes on x64), so results will be different when re-run if the keys and/or values use a larger value-type (such as a ValueTuple). There are other type-size factors to consider as well.
    • As things improved drastically from .NET Framework 4.8 to .NET 5 it means that you shouldn t trust these numbers if you re running on .NET 6 or later.
      • Also, don t assume that newer .NET releases will _always) make things faster: there have been times when performance actually worsened with both .NET updates and OS security patches.
// Warmup:
{
    var foo1 = new Dictionary<string, string>();
    var foo2 = new Dictionary<string, string>( capacity: 10_000 );
    foo1.Add( "foo", "bar" );
    foo2.Add( "foo", "bar" );
}


Stopwatch sw = Stopwatch.StartNew();

// Pre-set capacity:
TimeSpan pp_initTime;
TimeSpan pp_populateTime;
{
    var dict1 = new Dictionary<string, string>(1000000);

    pp_initTime = sw.GetElapsedAndRestart();

    for (int i = 0; i < 1000000; i++)
    {
        dict1.Add(i.ToString(), i.ToString());
    }
}
pp_populateTime = sw.GetElapsedAndRestart();

//
TimeSpan empty_initTime;
TimeSpan empty_populateTime;
{
    var dict2 = new Dictionary<string, string>();

    empty_initTime = sw.GetElapsedAndRestart();

    for (int i = 0; i < 1000000; i++)
    {
        dict2.Add(i.ToString(), i.ToString());
    }
}
empty_populateTime = sw.GetElapsedAndRestart();

//

Console.WriteLine("Pre-set capacity. Init time: {0:N3}ms, Fill time: {1:N3}ms, Total time: {2:N3}ms.", pp_initTime.TotalMilliseconds, pp_populateTime.TotalMilliseconds, ( pp_initTime + pp_populateTime ).TotalMilliseconds );
Console.WriteLine("Empty capacity. Init time: {0:N3}ms, Fill time: {1:N3}ms, Total time: {2:N3}ms.", empty_initTime.TotalMilliseconds, empty_populateTime.TotalMilliseconds, ( empty_initTime + empty_populateTime ).TotalMilliseconds );

// Extension methods:

[MethodImpl( MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization )]
public static TimeSpan GetElapsedAndRestart( this Stopwatch stopwatch )
{
    TimeSpan elapsed = stopwatch.Elapsed;
    stopwatch.Restart();
    return elapsed;
}

Original benchmark:

原始基准,没有冷启动预热阶段和低精度的 DateTime 计时。

  • With capacity (dict1) total time is 1220.778ms (for construction and population).
  • Without capacity (dict2) total time is 1502.490ms (for construction and population).
  • So a capacity saved 320ms (~20%) compared to not setting a capacity.
static void Main(string[] args)
{
    const int ONE_MILLION = 1000000;

    DateTime start1 = DateTime.Now;
    
    {
        var dict1 = new Dictionary<string, string>( capacity: ONE_MILLION  );

        for (int i = 0; i < ONE_MILLION; i++)
        {
            dict1.Add(i.ToString(), i.ToString());
        }
    }
        
    DateTime stop1 = DateTime.Now;
        
    DateTime start2 = DateTime.Now;

    {
        var dict2 = new Dictionary<string, string>();

        for (int i = 0; i < ONE_MILLION; i++)
        {
            dict2.Add(i.ToString(), i.ToString());
        }
    }
        
    DateTime stop2 = DateTime.Now;
        
    Console.WriteLine("Time with size initialized: " + (stop1.Subtract(start1)) + "
Time without size initialized: " + (stop2.Subtract(start2)));
    Console.ReadLine();
}

我认为你正在过度复杂化问题。如果你知道字典中会有多少项,那么请在构建时明确指定。这将帮助字典在其内部数据结构中分配必要的空间,以避免重新分配和重排数据。

Dictionary构造函数中指定初始容量可以提高性能,因为在ADD操作期间存储字典值的内部结构的重新分配次数较少。

考虑到您在`Dictionary`构造函数中指定了初始容量k, 那么:

  1. The Dictionary will reserve the amount of memory necessary to store k elements;
  2. QUERY performance against the dictionary is not affected and it will not be faster or slower;
  3. ADD operations will not require more memory allocations (perhaps expensive) and thus will be faster.

来自MSDN:

The capacity of a Dictionary(TKey, TValue) is the number of elements that can be added to the Dictionary(TKey, TValue) before resizing is necessary. As elements are added to a Dictionary(TKey, TValue), the capacity is automatically increased as required by reallocating the internal array.

If the size of the collection can be estimated, specifying the initial capacity eliminates the need to perform a number of resizing operations while adding elements to the Dictionary(TKey, TValue).

是的,与使用重新散列作为解决冲突的方法的哈希表相反,字典将使用链接。所以,是的,使用计数是一个好方法。对于哈希表,您可能想使用计数*(1 / 填充因子)。

最初的大小只是一个建议。例如,大多数哈希表都喜欢使用质数或2的幂作为大小。





相关问题
热门标签