需求描述
Java爬虫实战-获取国家行政区域码, 主要通过Xpath解析html,利用WebMagic实现对页面各级链接的递归爬取
实现
引入WebMagic
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.6</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.6</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
ConsolePipeline
import us.codecraft.