爬虫入门1---谈谈网络爬虫
爬虫入门2---爬虫框架webmagic
爬虫入门3---爬虫实战
3 爬虫实战
3.1 需求
每日某时间段从****博客中爬取文档,存入文章数据库中。
3.2 数模准备
下面是****各频道地址:
这边先准备两张表:
频道表:
文章表:
向tb_channel表添加记录:
3.3 代码编写
3.3.1模块编写
(1)idea创建springboot工程(这里不做详细讲解),创建模块article_crawler ,引入依赖
<dependencies>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
</dependency>
</dependencies>
(2)创建配置文件application.yml
server:
port: 9015
spring:
application:
name: article-crawler #指定服务名
datasource:
driverClassName: com.mysql.jdbc.Driver
url: jdbc:mysql://****:3306/test_article?characterEncoding=UTF8
username: ****
password: ****
jpa:
database: MySQL
show-sql: true
redis:
host: ****
password: ****
(3)创建启动类
@SpringBootApplication
@EnableScheduling
public class ArticleCrawlerApplication {
public static void main(String[] args) {
SpringApplication.run(ArticleCrawlerApplication.class);
}
@Value("${spring.redis.host}")
private String redis_host;
@Value("${spring.redis.password}")
private String redis_password;
@Bean
public IdWorker idWorker(){
return new IdWorker(1,1);
}
@Bean
public RedisScheduler redisScheduler(){
JedisPoolConfig config = new JedisPoolConfig();// 连接池的配置对象
config.setMaxTotal(100);// 设置最大连接数
config.setMaxIdle(10);// 设置最大空闲连接数
JedisPool jedisPool=new JedisPool(config,redis_host,6379,20000,redis_password);
return new RedisScheduler(jedisPool);
}
(4)实体类及数据访问接口(这里不做详解)
3.3.2 爬取类
创建文章爬取类ArticleProcessor
/**
* 文章爬取类
*/
@Component
public class ArticleProcessor implements PageProcessor {
@Override
public void process(Page page) {
page.addTargetRequests( page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());
//文章标题
String title=page.getHtml().xpath("//*[@id=\"mainBox\"]/main/div[1]/div/div/div[1]/h1").get();
String content=page.getHtml().xpath("//*[@id=\"article_content\"]/div[2]").get();
if(title!=null && content!=null){
page.putField("title" ,title );
page.putField("content",content);
}else{
page.setSkip(true);//跳过
}
}
@Override
public Site getSite() {
return Site.me().setRetryTimes(3000).setSleepTime(100);
}
}
3.3.3 入库类
创建文章入库类ArticleDbPipeline ,负责将爬取的数据存入数据库
@Component
public class ArticleDbPipeline implements Pipeline {
@Autowired
private ArticleDao articleDao;
@Autowired
private IdWorker idWorker;
private String channelId;//频道ID
public void setChannelId(String channelId) {
this.channelId = channelId;
}
@Override
public void process(ResultItems resultItems, Task task) {
String title=resultItems.get("title");//取出标题
String content=resultItems.get("content");//内容
Article article=new Article();
article.setId(idWorker.nextId()+"");
article.setChannelid(channelId);
article.setTitle(title);
article.setContent(content);
articleDao.save(article);
}
}
3.3.4 任务类
创建任务类,可根据@Scheduled设置定时抓取
/**
* 任务类
*/
@Component
public class ArticleTask {
@Autowired
private ArticleProcessor articleProcessor;
@Autowired
private ArticleDbPipeline articleDbPipeline;
@Autowired
private RedisScheduler redisScheduler;
/**
* 爬取AI文章
*/
@Scheduled(cron = "0 15 15 * * ?")
public void aiTask(){
System.out.println("开始爬取CSDN文章");
Spider spider =Spider.create(articleProcessor);
spider.addUrl("https://blog.csdn.net/nav/ai");
articleDbPipeline.setChannelId("ai");
spider.addPipeline(articleDbPipeline);
spider.setScheduler(redisScheduler);
spider.start();
}
}
运行springboot工程,查询数据库数据,可以看到数据库入库数据。
当然上面还只是一个简单的爬虫入门工程,真正应用到生产上面是需要设置代理ip,绕验证码等较复杂操作,这里不做详解,有兴趣的童鞋,可以自己研究下。