English 中文(简体)
Something like HashMap but sorted?
原标题:
  • 时间:2009-12-01 19:53:32
  •  标签:
  • java
  • hashmap

I m writing a Java program that parses all the words from a text file and then adds them to a HashMap. I need to count how many distinct words are contained in the file. I also need to figure out the highest counted words. The HashMap is comprised of each word mapped to an integer which represents how many times the word occurs.

Is there something like HashMap that will help me sort this?

问题回答

The Manual way to do it is as follows:

  • Create a composite WordCount class with word and count fields.
  • Create a Comparator for that class that sorts by count.
  • When you re done filling your HashMap, create a new List of WordCount objects created from values in the HashMap.
  • Sort the List using your comparator.

You could use a HashMultiset from google-collections:

import com.google.common.collect.*;
import com.google.common.collect.Multiset.Entry;

...

  final Multiset<String> words = HashMultiset.create();
  words.addAll(...);

  Ordering<Entry<String>> byIncreasingCount = new Ordering<Entry<String>>() {
    @Override public int compare(Entry<String> a, Entry<String> b) {
      // safe because count is never negative
      return left.getCount() - right.getCount();
    }
  });

  Entry<String> maxEntry = byIncreasingCount.max(words.entrySet())
  return maxEntry.getElement();

EDIT: oops, I thought you wanted only the single most common word. But it sounds like you want the several most common -- so, you could replace max with sortedCopy and now you have a list of all the entries in order.

To find the number of distinct words: words.elementSet().size()

If you want to sort the Map by word then TreeMap is the Java built-in answer. You can either make sure your Word objects are Comparable or supply a custom Comparator.

SortedMap<Word,Integer> map = new TreeMap<Word,Integer>();
...
for all words {
    Integer count = map.get(word);
    if (count == null ) count = 0;
    map.put(word, count+1);
}

If you want to sort by frequency then you will be better off doing this after all of the words have been counted. Sorted collections don t take kindly to having their ordering messed up through external changes. Sorting by frequency requires a composite word + count object as others have posted.

Here is a Groovy version of the most popular answer to this question:

List leastCommon(Multiset myMultiset, Integer quantity)
{

    Ordering<Multiset.Entry<String>> byIncreasingCount = new Ordering<Multiset.Entry<String>>() {
      @Override public int compare(Multiset.Entry<String> a, Multiset.Entry<String> b) {
          return a.getCount() - b.getCount() }
    }

    maxIndex = Math.min(quantity, myMultiset.entrySet().size() - 1)
    return byIncreasingCount.sortedCopy(myMultiset.entrySet()).subList(0, maxIndex)

}

List mostCommon(Multiset myMultiset, Integer quantity)
{

    Ordering<Multiset.Entry<String>> byDecreasingCount = new Ordering<Multiset.Entry<String>>() {
      @Override public int compare(Multiset.Entry<String> a, Multiset.Entry<String> b) {
          return b.getCount() - a.getCount() }
    }

    maxIndex = Math.min(quantity, myMultiset.entrySet().size() - 1)
    return byDecreasingCount.sortedCopy(myMultiset.entrySet()).subList(0, maxIndex)

}

It looks like the TreeBag class from the commons collections library might do what you want. It keeps track of how many copies of an object are added to the bag, and sorts them in ascending order of count. To get the highest count item just call the last() method. One thing to be aware of is that the commons collections stuff hasn t been updated to using generics yet, so you might get a ton of compiler warnings using it.

For the count, stuff the words in a Set and count the size when done.

For the highest, iterate over all Entries and hold on to the key with the highest value.

Have you checked out java.util.PriorityQueue? A PriorityQueue is basically a list with a priority mapped to each element (implemented by a non-synchronized priority heap). Every time you read in a new string you could add it in or increase it s priority by 1 if it is already present (logarithmic time). The is present check is in linear time, and in the end this will be really easy to use. To get the numbers that appear with the most frequency just poll() for each one when you are done!

edit The standard PriorityQueue doesn t allow you to edit priority directly since it requires a comparator. You d be better off with a simple Hash implementation or something like this

  • YourBean implements Comparable<YourBean>
  • method compareTo : order by number of word
  • TreeMap instead of hashmap




相关问题
Spring Properties File

Hi have this j2ee web application developed using spring framework. I have a problem with rendering mnessages in nihongo characters from the properties file. I tried converting the file to ascii using ...

Logging a global ID in multiple components

I have a system which contains multiple applications connected together using JMS and Spring Integration. Messages get sent along a chain of applications. [App A] -> [App B] -> [App C] We set a ...

Java Library Size

If I m given two Java Libraries in Jar format, 1 having no bells and whistles, and the other having lots of them that will mostly go unused.... my question is: How will the larger, mostly unused ...

How to get the Array Class for a given Class in Java?

I have a Class variable that holds a certain type and I need to get a variable that holds the corresponding array class. The best I could come up with is this: Class arrayOfFooClass = java.lang....

SQLite , Derby vs file system

I m working on a Java desktop application that reads and writes from/to different files. I think a better solution would be to replace the file system by a SQLite database. How hard is it to migrate ...

热门标签