Question

We re thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.

To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.

The question is how to balance and design the system so that ;

(1) the data can be fetched and presented to the user in real-time (<2 seconds),

(2) data can be aggregated on per-hour and per-day basis, and

(2) as large amount of data can still be stored in the warehouse, and

Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.

Is it possible it get real-time query performance from a "simple" data warehouse like this, and still have it store a lot of data (it would be nice to be able to never throw away any data)

Are there ways to aggregate the data into higher resolution tables?

I got a feeling that this isn t really a new question (i ve googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.

Maybe I m grasping for too much.

UPDATE

My schema looks like this;

dimensions:
- client (ip-address)
- server
- url
facts;
- timestamp (in seconds)
- bytes transmitted

Answer 1

Doesn t sound like it would be a problem. MySQL is very fast.

For storing logging data, use MyISAM tables -- they re much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren t necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.

If you re still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.

Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you re talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.

You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won t be for more than 3-4 years).

Answer 2

Seth s answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.

Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.

Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case. Namely, the capped collections (do a search for mongodb capped collections for more info)

And the fast increment operation for things like keeping track of page views, hits, etc. http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

Answer 3

Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.

If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.

With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.

Answer 4

This has gotten to be a fairly common data warehousing application. I ve run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn t even on a huge server.

Your data volumes aren t too large, so I wouldn t think you d need very expensive hardware. But I d still go multi-core, 64-bit with a lot of memory.

But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.

As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.

Regarding the database - if it were me I d try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you re likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I d consider a commercial database like db2, oracle, sql server. Then you d get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.

友情链接