我在制作我的网页时面临一个问题,这里指的是:
这一网络应用(按字母顺序排列)是基于用户生成的内容(通常较短的文章,尽管篇幅可能相当长,大约四分之一的屏幕),每个用户至少提交其中10条,因此数量应当迅速增长。 从性质上看,约10%的条款将重复,因此,我需要一种算法来计算。
我提出了以下步骤:
- On submission fetch a length of text and store it in a separated table (
article_id
,length), the problem is the articles are encoded using PHP special_entities() function, and users post content with slight modifications (some one will miss the comma, accent or even skip some words) - Then retrieve all the entries from database with length range =
new_post_length
+/- 5% (should I use another threshold, keeping in mind that human factor on articles submission?) - Fetch the first 3 keywords and compare them against the articles fetched in the step 2
- Having a final array with the most probable matches compare the new entry using PHP s levenstein() function
这一过程必须在提交条款时执行,而不是使用 cr。 然而,我怀疑这将在服务器上产生重负荷。
您能否提出任何想法?
Thank you! Mike