English 中文(简体)
从 MySQL 的列中移除重件, 该列太大到 INDEX 大小
原标题:Removing duplicates from MySQL on a column that s too big to INDEX

我试图从有百万行的表格中删除重复的行。 我想检查的字段中重复的行太长( 它保存 URLs), 无法将 < code> UNIQUE 索引放上。 是否有办法快速删除重复的?

建议的消除重复的方法:

DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 ON t1.id>t2.id AND t1.name=t2.name;

似乎从来没有完成它的工作, 虽然我想它可能只是 需要很多的时间做。

我在这里听到的一个想法是为索引和比较创建一个 MD5 hash列。 这是推荐的路线吗? 如果是的话,我是否应该为空间/速度考虑缩短该列的长度?

问题回答

The hash would give you a column you could put and index on so then t1.Name = t2.Name would be a far less expensive t1.Hash = t2.Hash. Adding the hash to 1,000,000 records would take a while though.

如果这是关闭一个选项, 另一种选项是将清除打印出来, 例如, 类似

Where T1 >= 0 and T1 < 10000
Where T1 >= 10001 and T1 < 20000

我可能没想清楚 但值得一试

  • Create a column, md5url, and fill it with the md5 of the url (UPDATE table1 SET md5url = MD5(url)
  • Make a (non-unique) index on that column md5url - md5 should be short enough to do so
  • 将语句更改为:

    DELETE t1 
    FROM table1 AS t1 
         JOIN table1 AS t2 
              ON t1.md5url = t2.md5url 
                 AND t1.name=t2.name 
                 AND t1.id>t2.id;
    

这样, JOIN 条件主要在索引上起作用。 如果索引列 md5url 适合, 那么我们实际上检查 URL - 因为根据您有多少 URL, 在某些时候, 两个 URL may 有相同的 MD5 。 第三个条件是清楚的 - 确保两个相同的列中只有一个被删除 。

我很想听听这是否有效—— 在我的脑海中,它现在完全有意义;-)





相关问题
SQL SubQuery getting particular column

I noticed that there were some threads with similar questions, and I did look through them but did not really get a convincing answer. Here s my question: The subquery below returns a Table with 3 ...

please can anyone check this while loop and if condition

<?php $con=mysql_connect("localhost","mts","mts"); if(!con) { die( unable to connect . mysql_error()); } mysql_select_db("mts",$con); /* date_default_timezone_set ("Asia/Calcutta"); $date = ...

php return a specific row from query

Is it possible in php to return a specific row of data from a mysql query? None of the fetch statements that I ve found return a 2 dimensional array to access specific rows. I want to be able to ...

Character Encodings in PHP and MySQL

Our website was developed with a meta tag set to... <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> This works fine for M-dashes and special quotes, etc. However, I ...

Pagination Strategies for Complex (slow) Datasets

What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don t want to hit the DB for every page view. Currently there are ~...

Averaging a total in mySQL

My table looks like person_id | car_id | miles ------------------------------ 1 | 1 | 100 1 | 2 | 200 2 | 3 | 1000 2 | 4 | 500 I need to ...

热门标签