English 中文(简体)
Storing and accessing large amounts of data
原标题:
  • 时间:2009-11-15 22:43:02
  •  标签:
  • mongodb

My application creates pieces of data that, in xml, would look like this:

<resource url="someurl">
   <term>
      <name>somename</name>
      <frequency>somenumber</frequency>
   </term>    
   ...
   ...
   ...
</resource>

This is how I m storing these "resources" now. A resource per XML file. As many "term" per "resource" as needed. The problem is, I ll need to generate about 2 million of these resources. I ve generated almost 500.000 and my mac isn t very happy about it. So my question is: how should I store this data?

  • A database? that would be hard, because the structure of the data isn t fixed...
  • Maybe merge some resources into larger XML files?
  • ...?

I don t need to change the data once it s created. Right now I m accessing a specific resource by the name of that resource s file.

Any suggestions are greatly appreciated!

最佳回答

Not all databases are relational. Have a look at for example mongodb. It stores your data as json-like objects, similar to your resources.

An example using the shell:

$ mongo
> db.resources.save({url: "someurl", 
                     terms: [{name: "name1", frequency: 17.0},
                             {name: "name2", frequency: 42.0}]})
> db.resources.find()
{"_id" :  ObjectId( "4b00884b3a77b8b2fa3a8f77"), 
 "url" : "someurl" , 
 "terms" : [{"name" : "name1" , "frequency" : 17},
            {"name" : "name2" , "frequency" : 42}]}
问题回答

If your can t predict how your data is going to be organized, maybe http://couchdb.apache.org/ can be interesting for you. It is a schema-less database.

Anyways, XML is maybe not the best choice for big amout of data.

Maybe trying JSON or YAML works out better? They need less space and are easier to parse (I have however no experience on using those formats on larger scale. Maybe I m wrong).

You should deffinetely have several resourses per XML file, but only if you are expected to have all the resources toguether at the same time. If you need to send only a handfull of resourses to anybody, then keep making the individual XML.

Even in that situation, you could keep the large XML file, and generate on demand the smaller ones from the original dataset.

Using a database like SQLite3 would allow you to have faster seek times and easier manipulation of the data, using SQL syntax.





相关问题
Access DB Ref MongoDB

Whats the best way to access/query a DB Ref: UPDATE: users: name, groupref : {$ref:"groups",$id:"ObjectId ..." } } groups: name, topic, country,...,.. Assumption is that user belongs to only one ...

MongoDB nested sets

What re the best practices to store nested sets (like trees of comments) in MongoDB? I mean, every comment can have a parent comment and children-comments (answers). Storing them like this: { ...

MongoMapper and migrations

I m building a Rails application using MongoDB as the back-end and MongoMapper as the ORM tool. Suppose in version 1, I define the following model: class SomeModel include MongoMapper::Document ...

MongoDB takes long for indexing

I have the following setup: Mac Pro with 2 GB of RAM (yes, not that much) MongoDB 1.1.3 64-bit 8 million entries in a single collection index for one field (integer) wanted Calling .ensureIndex(...) ...

Storing and accessing large amounts of data

My application creates pieces of data that, in xml, would look like this: <resource url="someurl"> <term> <name>somename</name> <frequency>somenumber</...

热门标签