English 中文(简体)
两套数据集之间的共同体
原标题:matching common strings between two data sets

我正在着手转换网站。 我将数据库备份作为薄壳。 我还从植被上拆除了网站。

我想要做的是地图数据库表和碎页的目录、网页和页数栏。 我愿把这一点自动化。

Is there some tool or a script out there that could pull strings from one source and look for them in the other? Ideally, it would return a set of results that would say soemthing like

string "piece of website content here" on line 453 in table.sql matches string in website.com/subdirectory/certain_page.asp on line 56.

我不想作线性比较,因为数据库倾斜线(INSERT INTO table VALUES(......)的斜线与实际居住的页线相匹配(<div id=左侧_column ><div id=左_content >...</div>)。

我认识到,这是一项细致的工作,但甚至让它从周末起步,都会受到罚款。

I ve found similar questions, but I don t have enough CS background to know if they are identical to my problem or not. SO kindly suggested this question, but it appears to be dealing with a known set of needles to match against the haystack. In my case, I need to compare haystack to haystack, and see matching straws of hay.

那里是否有一条指挥线或指挥线,或者这是我需要建造的? 如果我这样做,我是否应该像另一个问题所建议的那样使用阿霍-库拉克算法?

问题回答

因此,你有两个问题:1)已经找到了能做你想要的事的解决办法;2) 如果您使用Aho-Corasick算法。

The first answer is that I doubt you ll find a ready-built tool that will meet your needs. The second answer is that, since you don t care about performance and have a limited CS background, that you should use whatever algorithm you find simplest to implement.

我将向前迈进一步,提出架构。

首先,你需要能够将 files子档案进行有意义的整理,即逐行和回归表名、姓名和价值一栏。 这样做可能最合适。

第二,你们需要一个网站的主子,这些网页将逐项内容,并将每个文本节点和每个括号的名称全部交到html的内容及其母体档案名称。 一种XmlTextReader或类似的XML教区,例如SAXON,只要在无效力的XML上运行,很可能是最好的。

你们需要把这两个教区同一种相互搜索算法联系起来。 你们必须定制,以满足你们的需要。 阿霍-库拉克如果你能够退出,显然会取得最佳业绩。 宽算法很容易实施,但在此,如何:

假设你有两个司,通过每个领域(一方面)和每个文本节点(另一方面),取走两个司之一,并穿过其数据来源的每一个座标,要求另一个区长寻找其他数据来源,以便所有可能的匹配,并记录其发现的数据。

这至少无法可靠地发挥作用。 最好的例子:在你的超文本档案中,你将每一件数据都与其对应,但你会有许多不实之处。 例如,用户名称是实际字数等。

此外,案文在展示之前经常被操纵。 现场经常将标题或定型案文用于预约等。

AFAIK没有这种工具,我认为,不可能存在充分解决你问题的工具。

你们的最佳选择是使源代码能够使用/使用并分析。 如果无法做到这一点,你就必须对数据库进行人工分析。 尽可能地从URL中获取内容,并设法适应问题。





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签