English 中文(简体)
Java large datastructure for storing a matrix
原标题:

I need to store a 2d matrix containing zip codes and the distance in km between each one of them. My client has an application that calculates the distances which are then stored in an Excel file. Currently, there are 952 places. So the matrix would have 952x952 = 906304 entries.

I tried to map this into a HashMap[Integer, Float]. The Integer is the hash code of the two Strings for two places, e.g. "A" and "B". The float value is the distance in km between them.

While filling in the data I run into OutOfMemoryExceptions after 205k entries. Do you have a tip how I can store this in a clever way? I even don t know if it s clever to have the whole bunch in memory. My options are SQL and MS Access...

The problem is that I need to access the data very quickly and possibly very often which is why I chose the HashMap because it runs in O(1) for the look up.

Thansk for your replies and suggestions!

Marco

最佳回答

Can you simply boost the memory available to the JVM ?

java -Xmx512m ...

By default the maximum memory configuration is 64Mb. Some more tuning tips here. If you can do this then you can keep the data in-process and maximise the performance (i.e. you don t need to calculate on the fly).

问题回答

A 2d array would be more memory efficient. You can use a small hashmap to map the 952 places into a number between 0 and 951 . Then, just do:

float[][] distances= new float[952][952];

To look things up, just use two hash lookups to convert the two places into two integers, and use them as indexes into the 2d array.

By doing it this way, you avoid the boxing of floats, and also the memory overhead of the large hashmap.

However, 906304 really isn t that many entries, you may just need to increase the Xmx maximum heap size

I would have thought that you could calculate the distances on the fly. Presumably someone has already done this, so you simply need to find out what algorithm they used, and the input data; e.g. longitude/latitude of the notional centres of each ZIP code.

EDIT: There are two commonly used algorithms for finding the (approximate) geodesic distance between two points given by longitude/latitude pairs.

  • The Vicenty formula is based on an ellipsoid approximation. It is more accurate, but more complicated to implement.

  • The Haversine formula is based on a spherical approximation. It is less accurate (0.3%), but simpler to implement.

I upvoted Chi s and Benjamin s answers, because they re telling you what you need to do, but while I m here, I d like to stress that using the hashcode of the two strings directly will get you into trouble. You re likely to run into the problem of hash collisions.

This would not be a problem if you were concatenating the two strings (being careful to use a delimiter which cannot appear in the place designators), and letting HashMap do its magic, but the method you suggested, using the hashcodes for the two strings as a key, that s going to get you into trouble.

You will simply need more memory. When starting your Java process, kick it off like so:

java -Xmx256M MyClass

The -Xmx defines the max heap size, so this says the process can use up to 256 MB of memory for the heap. If you still run out, keep bumping that number up until you hit the physical limit.

Lately I ve managed similar requisites for my master thesis.

I ended with a Matrix class that uses a double[], not a double[][], in order to alleviate double deref costs (data[i] that is an array, then array[i][j] that is a double) while allowing the VM to allocate a big, contiguous chunk of memory:

public class Matrix {

    private final double data[];
    private final int rows;
    private final int columns;

    public Matrix(int rows, int columns, double[][] initializer) {
        this.rows = rows;
        this.columns = columns;
        this.data = new double[rows * columns];

        int k = 0;

        for (int i = 0; i < initializer.length; i++) {
            System.arraycopy(initializer[i], 0, data, k, initializer[i].length);
            k += initializer[i].length;
        }
    }

    public Matrix set(int i, int j, double value) {
        data[j + i * columns] = value;
        return this;
    }

    public double get(int i, int j) {
        return data[j + i * columns];
    }
}

this class should use less memory than an HashMap since it uses a primitive array (no boxing needed): it needs only 906304 * 8 ~ 8 Mb (for doubles) or 906304 * 4 ~ 4 Mb (for floats). My 2 cents.

NB I ve omitted some sanity checks for simplicity s sake

Stephen C. has a good point: if the distances are as-the-crow-flies, then you could probably save memory by doing some calculations on the fly. All you d need is space for the longitude and latitude for 952 zip codes and then you could use the vicenty formula to do your calculation when you need to. This would make your memory usage O(n) in zipcodes.

Of course, that solution makes some assumptions that may turn out to be false in your particular case, i.e. that you have longitude and latitude data for your zipcodes and that you re concerned with as-the-crow-flies distances and not something more complicated like driving directions.

If those assumptions are true though, trading a few computes for a whole bunch of memory might help you scale in the future if you ever need to handle a bigger dataset.

The above suggestions regarding heap size will be helpful. However, I am not sure if you gave an accurate description of the size of your matrix.

Suppose you have 4 locations. Then you need to assess the distances between A->B, A->C, A->D, B->C, B->D, C->D. This suggests six entries in your HashMap (4 choose 2).

That would lead me to believe the actual optimal size of your HashMap is (952 choose 2)=452,676; NOT 952x952=906,304.

This is all assuming, of course, that you only store one-way relationships (i.e. from A->B, but not from B->A, since that is redundant), which I would recommend since you are already experiencing problems with memory space.

Edit: Should have said that the size of your matrix is not optimal, rather than saying the description was not accurate.

Create a new class with 2 slots for location names. Have it always put the alphabetically first name in the first slot. Give it a proper equals and hashcode method. Give it a compareTo (e.g. order alphabetically by names). Throw them all in an array. Sort it.

Also, hash1 = hash2 does not imply object1 = object2. Don t ever do this. It s a hack.





相关问题
Spring Properties File

Hi have this j2ee web application developed using spring framework. I have a problem with rendering mnessages in nihongo characters from the properties file. I tried converting the file to ascii using ...

Logging a global ID in multiple components

I have a system which contains multiple applications connected together using JMS and Spring Integration. Messages get sent along a chain of applications. [App A] -> [App B] -> [App C] We set a ...

Java Library Size

If I m given two Java Libraries in Jar format, 1 having no bells and whistles, and the other having lots of them that will mostly go unused.... my question is: How will the larger, mostly unused ...

How to get the Array Class for a given Class in Java?

I have a Class variable that holds a certain type and I need to get a variable that holds the corresponding array class. The best I could come up with is this: Class arrayOfFooClass = java.lang....

SQLite , Derby vs file system

I m working on a Java desktop application that reads and writes from/to different files. I think a better solution would be to replace the file system by a SQLite database. How hard is it to migrate ...

热门标签