English 中文(简体)
How can I index HTML documents?
原标题:

I am using Lucene .NEt to do full-text searching. Till now I have been indexing PDF docs, but now I have a few webpages that I need to index. What s the best/easiest way to index HTML documents to add to my Lucene index? I am using .NET/C#

问题回答

I am currently working on this problem, the best answer I have found to date is using the HTML Agility Pack to get the plain text content out of the HTML.





相关问题
Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

How do I compare two decimals to 10 decimal places?

I m using decimal type (.net), and I want to see if two numbers are equal. But I only want to be accurate to 10 decimal places. For example take these three numbers. I want them all to be equal. 0....

Exception practices when creating a SynchronizationContext?

I m creating an STA version of the SynchronizationContext for use in Windows Workflow 4.0. I m wondering what to do about exceptions when Post-ing callbacks. The SynchronizationContext can be used ...

Show running instance in single instance application

I am building an application with C#. I managed to turn this into a single instance application by checking if the same process is already running. Process[] pname = Process.GetProcessesByName("...

How to combine DataTrigger and EventTrigger?

NOTE I have asked the related question (with an accepted answer): How to combine DataTrigger and Trigger? I think I need to combine an EventTrigger and a DataTrigger to achieve what I m after: when ...

热门标签