Question

I want to crawl onyl html pages so when I changed the regular expression here in this code.. it is still crawling some xml page also.. Any suggestions why is it happening..

public class MyCrawler extends WebCrawler {


    Pattern filters = Pattern.compile("(.(html))");

    public MyCrawler() {
    }

    public boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase();
        if (filters.matcher(href).matches()) {
            return false;
        }
        if (href.startsWith("http://www.somehost.com/")) {
            return true;
        }
        return false;
    }

    public void visit(Page page) {
        int docid = page.getWebURL().getDocid();

        String url = page.getWebURL().getURL();         
        String text = page.getText();
        List<WebURL> links = page.getURLs();
        int parentDocid = page.getWebURL().getParentDocid();

        System.out.println("Docid: " + docid);
        System.out.println("URL: " + url);
        System.out.println("Text length: " + text.length());
        System.out.println("Number of links: " + links.size());
        System.out.println("Docid of parent page: " + parentDocid);
        System.out.println("=============");
    }   
}

Answer 1

扩展在网上毫无意义,特别是采用更新的“SEO”模式。你们必须分析其内容类型。

You can do this by requesting (with the HTTP GET or possibly HEAD method) each URL and analyze its response headers. If the Content-Type response header is not what you want, you don t have to download it - otherwise it s what you want to look at.

Edit: HTML should have text/html as content-type, XHTML is application/xhtml+xml (but note that the latter may be subject to content-negotiation, which is usually dependent on the content of your accept header and the user agent in the request).

详情请见rel=“nofollow”>。

友情链接