English 中文(简体)
Aggregating SQL rows with precedence
原标题:

I have a table full of items from different sources. Some of the sources might have the same location (in my example, different BBC news feeds would be different sources, but they all come from the BBC). Each item has a "unique" ID which can be used to identify it among others from the same location. This means that items relating to the same news story on a site but published under different feeds will have the same "unique ID", but that s not necessarily globally unique.

The problem is that I want to eliminate duplicates at display time, so that (depending which feeds you re seeing) you only get at most one version of each story, even though two or three of your feeds might contain links to it.

I have a sources table with information about each source, and location_id and location_precedence fields. I then have an items table that contains each item, its unique_id, source_id, and content. Items with the same unique_id and source location_id should appear at most once, with the highest source location_precedence winning.

I would have thought that something like:

SELECT `sources`.`name` AS `source`,
       `items`.`content`,
       `items`.`published`
FROM `items` INNER JOIN `sources`
  ON `items`.`source_id` = `sources`.`id` AND `sources`.`active` = 1
GROUP BY `items`.`unique_id`, `sources`.`location_id`
ORDER BY `sources`.`location_priority` DESC

would do the trick, but that seems to ignore the location priority field. What have I missed?


Example data:

CREATE TABLE IF NOT EXISTS `sources` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `location_id` int(10) unsigned NOT NULL,
  `location_priority` int(11) NOT NULL,
  `active` tinyint(1) unsigned NOT NULL default  1 ,
  `name` varchar(150) NOT NULL,
  `url` text NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `active` (`active`)
);

INSERT INTO `sources` (`id`, `location_id`, `location_priority`, `active`, `name`, `url`) VALUES
(1, 1, 25, 1,  BBC News Front Page ,  http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml ),
(2, 1, 10, 1,  BBC News England ,  http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/england/rss.xml ),
(3, 1, 15, 1,  BBC Technology News ,  http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/technology/rss.xml ),
(4, 2, 0, 1,  Slashdot ,  http://rss.slashdot.org/Slashdot/slashdot ),
(5, 3, 0, 1,  The Daily WTF ,  http://syndication.thedailywtf.com/TheDailyWtf );

CREATE TABLE IF NOT EXISTS `items` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `source_id` int(10) unsigned NOT NULL,
  `published` datetime NOT NULL,
  `content` text NOT NULL,
  `unique_id` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `unique_id` (`unique_id`,`source_id`),
  KEY `published` (`published`),
  KEY `source_id` (`source_id`)
);

INSERT INTO `items` (`id`, `source_id`, `published`, `content`, `unique_id`) VALUES
(1,  1,  2009-12-01 16:25:53 ,  Story about Subject One ,                      abc ),
(2,  2,  2009-12-01 16:21:31 ,  Subject One in story ,                         abc ),
(3,  3,  2009-12-01 16:17:20 ,  Techy goodness ,                               def ),
(4,  2,  2009-12-01 16:05:57 ,  Further updates on Foo case ,                  ghi ),
(5,  3,  2009-12-01 15:53:39 ,  Foo, Bar and Quux in court battle ,            ghi ),
(6,  2,  2009-12-01 15:52:02 ,  Anti-Fubar protests cause disquiet ,           mno ),
(7,  4,  2009-12-01 15:39:00 ,  Microsoft Bleh meets lukewarm reception ,      pqr ),
(8,  5,  2009-12-01 15:13:45 ,  Ever thought about doing it in VB? ,           pqr ),
(9,  1,  2009-12-01 15:13:15 ,  Celebrity has 'new friend' ,         pqr ),
(10, 1,  2009-12-01 15:09:57 ,  Microsoft launches Bleh worldwide ,            stu ),
(11, 2,  2009-12-01 14:57:22 ,  Microsoft launches Bleh in UK ,                stu ),
(12, 3,  2009-12-01 14:57:22 ,  Microsoft launches Bleh ,                      stu ),
(13, 3,  2009-12-01 14:42:15 ,  Tech round-up ,                                vwx ),
(14, 2,  2009-12-01 14:36:26 ,  Estates 'old news' say government ,  yza ),
(15, 1,  2009-12-01 14:15:21 ,  Iranian doctor 'was poisoned' ,      bcd ),
(16, 4,  2009-12-01 14:14:02 ,  Apple fans overjoyed by iBlah ,                axf );

Expected content after query:

  • Story about Subject One
  • Techy goodness
  • Foo, Bar and Quux in court battle
  • Anti-Fubar protests cause disquiet
  • Microsoft Bleh meets lukewarm reception
  • Ever thought about doing it in VB?
  • Celebrity has new friend
  • Microsoft launches Bleh worldwide
  • Tech round-up
  • Estates old news say government
  • Iranian doctor was poisoned
  • Apple fans overjoyed by iBlah

I ve tried a variation on the solution by Andomar, with some success:

SELECT      s.`name` AS `source`,
            i.`content`,
            i.`published`
FROM        `items` i
INNER JOIN  `sources` s
ON          i.`source_id` = s.`id`
AND         s.`active` = 1
INNER JOIN (
  SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
  FROM `items` i
  INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1
  GROUP BY `location_id`, `unique_id`
) `filter`
ON          i.`unique_id` = `filter`.`unique_id`
AND         s.`location_priority` = `filter`.`prio`
ORDER BY    i.`published` DESC
LIMIT 50

With AND s.location_priority = filter.prio things almost work as I want. Because an item can come from multiple sources with the same priority, items can be repeated. In this case, an extra GROUP BY i.unique_id on the outer query does the job, and I suppose it doesn t matter which source "wins" if priorities are equal.

I had tried with AND i.source_id = filter.source_id instead, which almost works (i.e. eliminates the extra GROUP BY) but doesn t give results from the right sources. In the example above, it gives me "Further updates on Foo case" (source "BBC News England") rather than "Foo, Bar and Quux in court battle" (source "BBC Technology News". Looking at the results of the inner query, I get:

unique_id:  ghi 
source_id: 2
prio: 15

Note that the source ID is not correct (expected: 3).

最佳回答

Order by merely orders the rows, it doesn t pick among them.

One of the ways to filter out rows with a lower location_priority is to use an inner join as filter:

SELECT     s.name, i.content, i.published
FROM       items i 
INNER JOIN sources s
ON         i.source_id = s.id
AND        s.active = 1
INNER JOIN (
    SELECT unique_id, max(location_priority) as prio
    FROM items i
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
    GROUP BY unique_id) filter
ON         i.unique_id = filter.unique_id
AND        s.location_priority = filter.prio;

An alternative is a where ... in <subquery> clause, for example:

SELECT     s.name, i.content, i.published
FROM       items i 
INNER JOIN sources s
ON         i.source_id = s.id
AND        s.active = 1
WHERE      (i.unique_id, s.location_priority) IN (
    SELECT unique_id, max(location_priority)
    FROM items i
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
    GROUP BY unique_id
);

This problem is also known as "Selecting records holding a group-wide maximum." Quassnoi has written a nice article on it.

EDIT: One way to break ties with multiple sources at the same priority is a WHERE clause with a subquery. This example breaks ties on i.id DESC:

SELECT     s.name, i.unique_id, i.content, i.published
FROM       (
           SELECT unique_id, min(location_priority) as prio
           FROM items i
           INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
           GROUP BY unique_id
           ) filter
JOIN       items i
JOIN       sources s
ON         s.id = i.source_id 
           AND s.active = 1
WHERE      i.id =
           (
           SELECT   i.id
           FROM     items i
           JOIN     sources s 
           ON       s.id = i.source_id 
                    AND s.active = 1
           WHERE    i.unique_id = filter.unique_id
           AND      s.location_priority = filter.prio
           ORDER BY i.id DESC
           LIMIT 1
           )

Quassnoi also has an article on selecting records holding group-wise maximum (resolving ties) :)

问题回答

do a self join to a derived table like

select max(location_priority) from table where ...

What have I missed?

The ORDER BY happens after the GROUP BY has already reduced each group to a single row. Paul gives one resolution.

As for the problem with the query:

SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
FROM `items` i
INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1
GROUP BY `location_id`, `unique_id`

source_id is neither an aggregate nor grouped. As a result, which value you get is indeterminate.





相关问题
SQL SubQuery getting particular column

I noticed that there were some threads with similar questions, and I did look through them but did not really get a convincing answer. Here s my question: The subquery below returns a Table with 3 ...

please can anyone check this while loop and if condition

<?php $con=mysql_connect("localhost","mts","mts"); if(!con) { die( unable to connect . mysql_error()); } mysql_select_db("mts",$con); /* date_default_timezone_set ("Asia/Calcutta"); $date = ...

php return a specific row from query

Is it possible in php to return a specific row of data from a mysql query? None of the fetch statements that I ve found return a 2 dimensional array to access specific rows. I want to be able to ...

Character Encodings in PHP and MySQL

Our website was developed with a meta tag set to... <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> This works fine for M-dashes and special quotes, etc. However, I ...

Pagination Strategies for Complex (slow) Datasets

What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don t want to hit the DB for every page view. Currently there are ~...

Averaging a total in mySQL

My table looks like person_id | car_id | miles ------------------------------ 1 | 1 | 100 1 | 2 | 200 2 | 3 | 1000 2 | 4 | 500 I need to ...

热门标签