English 中文(简体)
Arraylist vs List vs Dictionary
原标题:Arraylist vs List vs Dictionary
  • 时间:2012-05-05 17:28:06
  •  标签:
  • c#
  • c#-4.0

谁能拯救我? 我有以下法典:

private List<string> GenerateTerms(string[] docs)
{
    List <string> uniques = new List<string>();

    for (int i = 0; i < docs.Length; i++)
    {
        string[] tokens = docs[i].Split(   );

        List<string> toktolist = new List<string>(tokens.ToList());

        var query = toktolist.GroupBy(word => word)
             .OrderByDescending(g => g.Count())
             .Select(g => g.Key)
             .Take(20000);              

        foreach (string k in query)
        {
            if (!uniques.Contains(k)) 
                uniques.Add(k);
        }
    }            

    return uniques;            
}

它将根据最频繁程度从文件数目中产生术语。 i 采用了相同的程序。 在这两种情况下,花费了440毫秒。 但令人惊讶的是,在使用程序时,如在以下法典中使用阵列清单。

private ArrayList GenerateTerms(string[] docs)
{
    Dictionary<string, int> yy = new Dictionary<string, int>();
    ArrayList uniques = new ArrayList();

    for (int i = 0; i < docs.Length; i++)
    {
        string[] tokens = docs[i].Split(   );
        yy.Clear();
        for (int j = 0; j < tokens.Length; j++)
            {
                if (!yy.ContainsKey(tokens[j].ToString()))
                    yy.Add(tokens[j].ToString(), 1);
                else
                    yy[tokens[j].ToString()]++;
            }

            var sortedDict = (from entry in yy
                              orderby entry.Value descending
                              select entry).Take(20000).ToDictionary
                          (pair => pair.Key, pair => pair.Value);               

            foreach (string k in sortedDict.Keys)
            {                    
                if (!uniques.Contains(k)) 
                uniques.Add(k);
            }
        }            

        return uniques;            
    }  

它花费了350毫秒。 t光清单是否比清单和字典要慢? 请允许我以这种紧张态度来挽救我。

最佳回答

Your code does a lot of unnecessary work and uses inefficient data structures.

相反:

private List<string> GenerateTerms(string[] docs)
{
     var result = docs
         .SelectMany(doc => doc.Split(   )
                               .GroupBy(word => word)
                               .OrderByDescending(g => g.Count())
                               .Select(g => g.Key)
                               .Take(20000))
         .Distinct()
         .ToList();   
     return result;
}

内容提要

private List<string> GenerateTerms(string[] docs)
{
    return docs.SelectMany(doc => ProcessDocument(doc)).Distinct().ToList();
}

private IEnumerable<string> ProcessDocument(string doc)
{
    return doc.Split(   )
              .GroupBy(word => word)
              .OrderByDescending(g => g.Count())
              .Select(g => g.Key)
              .Take(10000);
}
问题回答

我喜欢马克的解决办法。 然而,我认为,如果你适当利用一名理论家的话,你会挤出更多的业绩。 够了,这非常快......

private static List<string> GenerateTerms(string[] docs)
{
    var termsDictionary = new Dictionary<string, int>();

    foreach (var doc in docs)
    {
        var terms = doc.Split(   );
        int uniqueTermsCount = 0;

        foreach (string term in terms)
        {
            if (termsDictionary.ContainsKey(term))
                termsDictionary[term]++;
            else
            {
                uniqueTermsCount++;
                termsDictionary[term] = 1;
            }
        }

        if (uniqueTermsCount >= 20000)
            break;
    }

    return (from entry in termsDictionary
                    orderby entry.Value descending
                    select entry.Key).ToList();
}

为作简要解释,<代码>术语/代码>有术语词典,每一术语的编号重复。 然后,Linq询问最终将按发生率排列的降级顺序。

<>>>>>

我添加了法典,将独一无二的术语数目限制在每只字20 000字。

Here are the benchmarking results...

  • 322 ms (Original)
  • 284 ms (Mark Byers solution)
  • 113 ms (Leveraging the Dictionary as above)

下面是用于生成<条码>的代码I。 试验阵列......

static void Main(string[] args)
{
    string[] docs = new string[50000];

    for (int i = 0; i < docs.Length; i++)
    {
        docs[i] = "a man a plan a canal panama";
    }

    // warm up (don t time this)
    GenerateTermsOriginal(docs);

    Stopwatch sw = new Stopwatch();
    sw.Restart();
    var t1 = GenerateTermsOriginal(docs);
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds + " ms");

    sw.Restart();
    var t2 = GenerateTermsLinq(docs);
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds + " ms");

    sw.Restart();
    var t3 = GenerateTermsDictionary(docs);
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds + " ms");
}




相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签