YangTao
厌浅尝辄止喜有始有终
E.t's Blog
webmagic爬虫入门
webmagic爬虫入门

近期有了解到webmagic爬虫,在这做一个笔记方便以后查看,话不多说直接上代码;

 <dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.7.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.3</version>
</dependency>


步骤一:抓取网站的相关配置,包括编码、抓取间隔、重试次数等(使用官网默认)

步骤二:process是定制爬虫逻辑的核心接口,在这里编写抽取逻辑

步骤三:从页面发现后续的url地址来抓取

附上代码:

package com.test;

import java.util.ArrayList;
import java.util.List;

import cn.test.entity.DataEntity;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

public class GithubRepoPageProcessor implements PageProcessor {
	static List<DataEntity> list=new ArrayList<DataEntity>();
    private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
    int i=0;
    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
    	
    	Spider.create(new GithubRepoPageProcessor()).addUrl("http://www.cankaoxiaoxi.com").thread(5).run();
    	for (DataEntity l :list ) {
			System.out.println(l);
		}
    }
	@Override
	public void process(Page page) {
		String data1=page.getHtml().xpath("//h1[@class='articleHead']/text()").toString();
		String data2=page.getHtml().xpath("//div[@class='articleAbs']/span/text()").toString();
		list.add(new DataEntity(data1, data2));
		page.addTargetRequests(page.getHtml().links().regex("http://www.cankaoxiaoxi.com/china/20181220/.*").all());
	}
}
没有标签
首页      Speak      webmagic爬虫入门
https://secure.gravatar.com/avatar/77f815bec37eb34e2eef92ae146f899a?s=256&d=mm&r=g

et

文章作者

发表评论

textsms
account_circle
email

E.t's Blog

webmagic爬虫入门
近期有了解到webmagic爬虫,在这做一个笔记方便以后查看,话不多说直接上代码; <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</art…
扫描二维码继续阅读
2018-12-20