English 中文(简体)
MySQL query puzzle - finding what WOULD have been the most recent date
原标题:

I ve looked all over and haven t yet found an intelligent way to handle this, though I feel sure one is possible:

One table of historical data has quarterly information:

CREATE TABLE Quarterly (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
PRIMARY KEY (unique_ID));

Another table of historical data (which is very large) contains daily information:

CREATE TABLE Daily (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
qtr_ID INT UNSIGNED,
PRIMARY KEY (unique_ID));

The qtr_ID field is not part of the feed of daily data that populated the database - instead, I need to retroactively populate the qtr_ID field in the Daily table with the Quarterly.unique_ID row ID, using what would have been the most recent quarterly data on that Daily.date_posted for that data source.

For example, if the quarterly data is

101 2009-03-31 1 4.5
102 2009-06-30 1 4.4
103 2009-03-31 2 7.6
104 2009-06-30 2 7.7
105 2009-09-30 1 4.7

and the daily data is

1001 2009-07-14 1 3.5 ??
1002 2009-07-15 1 3.4 &&
1003 2009-07-14 2 2.3 ^^

then we would want the ?? qtr_ID field to be assigned 102 as the most recent quarter for that data source on that date, and && would also be 102 , and ^^ would be 104 .

The challenges include that both tables (particularly the daily table) are actually very large, they can t be normalized to get rid of the repetitive dates or otherwise optimized, and for certain daily entries there is no preceding quarterly entry.

I have tried a variety of joins, using datediff (where the challenge is finding the minimum value of datediff greater than zero), and other attempts but nothing is working for me - usually my syntax is breaking somewhere. Any ideas welcome - I ll execute any basic ideas or concepts and report back.

最佳回答

Just subquery for the quarter id using something like:

(
 SELECT unique_ID 
 FROM Quarterly 
 WHERE 
     datasource = ? 
     AND date_posted >= ? 
 ORDER BY
     unique_ID ASC
 LIMIT 1
)

Of course, this probably won t give you the best performance, and it assumes that dates are added to Quarterly sequentially (otherwise order by date_posted). However, it should solve your problem.

You would use this subquery on your INSERT or UPDATE statements as the value of your qtr_ID field for your Daily table.

问题回答

The following appears to work exactly as intended but it surely is ugly (with three calls to the same DATEDIFF!!), perhaps by seeing a working query someone might be able to further reduce it or improve it:

UPDATE Daily SET qtr_ID = (select unique_ID from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) = 
(SELECT MIN(DATEDIFF(Daily.date_posted, Quarterly.date_posted)) from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) > 0));

After more work on this query, I ended up with enormous performance improvements over the original concept. The most important improvement was to create indices in both the Daily and Quarterly tables - in Daily I created indices on (datasource, date_posted) and (date_posted, datasource) USING BTREE and on (datasource) USING HASH, and in Quarterly I did the same thing. This is overkill but it made sure I had an option that the query engine could use. That reduced the query time to less than 1% of what it had been. (!!)

Then, I learned that given my particular circumstances I could use MAX() instead of ORDER BY and LIMIT so I use a call to MAX() to get the appropriate unique_ID. That reduced the query time by about 20%.

Finally, I learned that with the InnoDB storage engine I could segment the chunk of the Daily table that I was updating with any one query, which allowed me to multi-thread the queries with a little elbow-grease and scripting. The parallel processing worked well and every thread reduced the query time linearly.

So, the basic query that is performing literally 1000 times better than my own first attempt is:

UPDATE Daily
SET qtr_ID =
(
  SELECT MAX(unique_ID)
  FROM Quarterly
  WHERE Daily.datasource = Quarterly.datasource AND
        Daily.date_posted > Quarterly.dateposted
)
WHERE unique_ID > ScriptVarLowerBound AND
      unique_ID <= ScriptVarHigherBound
;




相关问题
SQL SubQuery getting particular column

I noticed that there were some threads with similar questions, and I did look through them but did not really get a convincing answer. Here s my question: The subquery below returns a Table with 3 ...

please can anyone check this while loop and if condition

<?php $con=mysql_connect("localhost","mts","mts"); if(!con) { die( unable to connect . mysql_error()); } mysql_select_db("mts",$con); /* date_default_timezone_set ("Asia/Calcutta"); $date = ...

php return a specific row from query

Is it possible in php to return a specific row of data from a mysql query? None of the fetch statements that I ve found return a 2 dimensional array to access specific rows. I want to be able to ...

Character Encodings in PHP and MySQL

Our website was developed with a meta tag set to... <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> This works fine for M-dashes and special quotes, etc. However, I ...

Pagination Strategies for Complex (slow) Datasets

What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don t want to hit the DB for every page view. Currently there are ~...

Averaging a total in mySQL

My table looks like person_id | car_id | miles ------------------------------ 1 | 1 | 100 1 | 2 | 200 2 | 3 | 1000 2 | 4 | 500 I need to ...

热门标签