English 中文(简体)
Sorting a very large text file in Java
原标题:

I have a large text file I need to sort in Java. The format is:

word [tab] frequency [new line]

The algorithm for sorting is:

  • Read some of the file, filtering for purlely alphabetic words.
  • Once you have X number of alphabetic words, call Collections.sort and write the result to a file.
  • Repeat until you have finished reading the file.
  • Start reading two sorted files, comparing line by line for the word with higher frequency, and writing at the same time to a new file as to not load much into your memory
  • Repeat until all files are merged into one large file

Right now I ve divided the large file into smaller ones (sorted by descending frequency) with 10,000 lines each. I know I need to somehow merge these files back together, but I m not sure how to go about this.

I ve created a LinkedList to keep track of all the files created. The algorithm says to compare each line in the two files, except I ve tried a case where , say file1 = 8,6,5,3,1 and file2 = 9,8,8,8,8. Then if I compare them line by line I would get file3 = 9,8,8,6,8,5,8,3,8,1 which is incorrectly sorted (they should be in decreasing order).

I think I m misunderstanding some part of the algorithm. If someone could point out what I should do instead, I d greatly appreciate it. Thanks.

edit: Yes this is an assignment. We aren t allowed to increase memory unfortunately :(

最佳回答

You have the right idea, but with a small error. As you read the lines from the 2 files, you shouldn t be outputting both lines, because the next line in the file with the greater number might still be greater than the first line in the file with the smaller number (as it is in your test case).

So, it s quite simply this:

Read a line from each file to start.
Then repeat this:
.The line with the highest value is written to a new file
.Read another line from that file only

This is the basic algorithm, but of course you have to allow for what happens when one of the files runs out (in which case you just read lines and output from the remaining file - whether this is a separate loop or part of the same loop is up to you - I would look at what the code looks like before making that decision).

问题回答

If the file is too large to fit into memory, use a database. Something like MySQL might be too heavyweight, but there are embeddable databases you can use in java.

One of them is berkely DB which is a Key/value database system.

Apache Derby is a relational database system that lets you use SQL.

If you already know SQL, derby might be the easiest way to go. I haven t used it myself.





相关问题
Spring Properties File

Hi have this j2ee web application developed using spring framework. I have a problem with rendering mnessages in nihongo characters from the properties file. I tried converting the file to ascii using ...

Logging a global ID in multiple components

I have a system which contains multiple applications connected together using JMS and Spring Integration. Messages get sent along a chain of applications. [App A] -> [App B] -> [App C] We set a ...

Java Library Size

If I m given two Java Libraries in Jar format, 1 having no bells and whistles, and the other having lots of them that will mostly go unused.... my question is: How will the larger, mostly unused ...

How to get the Array Class for a given Class in Java?

I have a Class variable that holds a certain type and I need to get a variable that holds the corresponding array class. The best I could come up with is this: Class arrayOfFooClass = java.lang....

SQLite , Derby vs file system

I m working on a Java desktop application that reads and writes from/to different files. I think a better solution would be to replace the file system by a SQLite database. How hard is it to migrate ...

热门标签