I have a table full of items from different sources. Some of the sources might have the same location (in my example, different BBC news feeds would be different sources, but they all come from the BBC). Each item has a "unique" ID which can be used to identify it among others from the same location. This means that items relating to the same news story on a site but published under different feeds will have the same "unique ID", but that s not necessarily globally unique.
The problem is that I want to eliminate duplicates at display time, so that (depending which feeds you re seeing) you only get at most one version of each story, even though two or three of your feeds might contain links to it.
I have a sources
table with information about each source, and location_id
and location_precedence
fields. I then have an items
table that contains each item, its unique_id
, source_id
, and content
. Items with the same unique_id
and source location_id
should appear at most once, with the highest source location_precedence
winning.
I would have thought that something like:
SELECT `sources`.`name` AS `source`,
`items`.`content`,
`items`.`published`
FROM `items` INNER JOIN `sources`
ON `items`.`source_id` = `sources`.`id` AND `sources`.`active` = 1
GROUP BY `items`.`unique_id`, `sources`.`location_id`
ORDER BY `sources`.`location_priority` DESC
would do the trick, but that seems to ignore the location priority field. What have I missed?
Example data:
CREATE TABLE IF NOT EXISTS `sources` (
`id` int(10) unsigned NOT NULL auto_increment,
`location_id` int(10) unsigned NOT NULL,
`location_priority` int(11) NOT NULL,
`active` tinyint(1) unsigned NOT NULL default 1 ,
`name` varchar(150) NOT NULL,
`url` text NOT NULL,
PRIMARY KEY (`id`),
KEY `active` (`active`)
);
INSERT INTO `sources` (`id`, `location_id`, `location_priority`, `active`, `name`, `url`) VALUES
(1, 1, 25, 1, BBC News Front Page , http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml ),
(2, 1, 10, 1, BBC News England , http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/england/rss.xml ),
(3, 1, 15, 1, BBC Technology News , http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/technology/rss.xml ),
(4, 2, 0, 1, Slashdot , http://rss.slashdot.org/Slashdot/slashdot ),
(5, 3, 0, 1, The Daily WTF , http://syndication.thedailywtf.com/TheDailyWtf );
CREATE TABLE IF NOT EXISTS `items` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`source_id` int(10) unsigned NOT NULL,
`published` datetime NOT NULL,
`content` text NOT NULL,
`unique_id` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_id` (`unique_id`,`source_id`),
KEY `published` (`published`),
KEY `source_id` (`source_id`)
);
INSERT INTO `items` (`id`, `source_id`, `published`, `content`, `unique_id`) VALUES
(1, 1, 2009-12-01 16:25:53 , Story about Subject One , abc ),
(2, 2, 2009-12-01 16:21:31 , Subject One in story , abc ),
(3, 3, 2009-12-01 16:17:20 , Techy goodness , def ),
(4, 2, 2009-12-01 16:05:57 , Further updates on Foo case , ghi ),
(5, 3, 2009-12-01 15:53:39 , Foo, Bar and Quux in court battle , ghi ),
(6, 2, 2009-12-01 15:52:02 , Anti-Fubar protests cause disquiet , mno ),
(7, 4, 2009-12-01 15:39:00 , Microsoft Bleh meets lukewarm reception , pqr ),
(8, 5, 2009-12-01 15:13:45 , Ever thought about doing it in VB? , pqr ),
(9, 1, 2009-12-01 15:13:15 , Celebrity has 'new friend' , pqr ),
(10, 1, 2009-12-01 15:09:57 , Microsoft launches Bleh worldwide , stu ),
(11, 2, 2009-12-01 14:57:22 , Microsoft launches Bleh in UK , stu ),
(12, 3, 2009-12-01 14:57:22 , Microsoft launches Bleh , stu ),
(13, 3, 2009-12-01 14:42:15 , Tech round-up , vwx ),
(14, 2, 2009-12-01 14:36:26 , Estates 'old news' say government , yza ),
(15, 1, 2009-12-01 14:15:21 , Iranian doctor 'was poisoned' , bcd ),
(16, 4, 2009-12-01 14:14:02 , Apple fans overjoyed by iBlah , axf );
Expected content after query:
- Story about Subject One
- Techy goodness
- Foo, Bar and Quux in court battle
- Anti-Fubar protests cause disquiet
- Microsoft Bleh meets lukewarm reception
- Ever thought about doing it in VB?
- Celebrity has new friend
- Microsoft launches Bleh worldwide
- Tech round-up
- Estates old news say government
- Iranian doctor was poisoned
- Apple fans overjoyed by iBlah
I ve tried a variation on the solution by Andomar, with some success:
SELECT s.`name` AS `source`,
i.`content`,
i.`published`
FROM `items` i
INNER JOIN `sources` s
ON i.`source_id` = s.`id`
AND s.`active` = 1
INNER JOIN (
SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio`
FROM `items` i
INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1
GROUP BY `location_id`, `unique_id`
) `filter`
ON i.`unique_id` = `filter`.`unique_id`
AND s.`location_priority` = `filter`.`prio`
ORDER BY i.`published` DESC
LIMIT 50
With AND s.location_priority = filter.prio
things almost work as I want. Because an item can come from multiple sources with the same priority, items can be repeated. In this case, an extra GROUP BY i.unique_id
on the outer query does the job, and I suppose it doesn t matter which source "wins" if priorities are equal.
I had tried with AND i.source_id = filter.source_id
instead, which almost works (i.e. eliminates the extra GROUP BY
) but doesn t give results from the right sources. In the example above, it gives me "Further updates on Foo case" (source "BBC News England") rather than "Foo, Bar and Quux in court battle" (source "BBC Technology News". Looking at the results of the inner query, I get:
unique_id: ghi
source_id: 2
prio: 15
Note that the source ID is not correct (expected: 3).