前段时间需要爬取网页上的信息,自己对于爬虫没有任何了解,就了解了一下webmagic,写了个简单的爬虫。
一、首先介绍一下webmagic:
webmagic采用完全模块化的设计,功能覆盖整个爬虫的生命周期(链接提取、页面下载、内容抽取、持久化),支持多线程抓取,分布式抓取,并支持自动重试、自定义UA/cookie等功能。
实现理念:
Maven依赖:
us.codecraft
webmagic-core
0.7.3
us.codecraft
webmagic-extension
0.7.3
us.codecraft
webmagic-extension
0.7.3
org.slf4j
slf4j-log4j12
jdbc模式:
ublic class CsdnBlogDao {
private Connection conn = null;
private Statement stmt = null;
public CsdnBlogDao() {
try {
Class.forName("com.mysql.jdbc.Driver");
String url = "jdbc:mysql://localhost:3306/test?"
+ "user=***&password=***3&useUnicode=true&characterEncoding=UTF8";
conn = DriverManager.getConnection(url);
stmt = conn.createStatement();
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
}
public int add(CsdnBlog csdnBlog) {
try {
String sql = "INSERT INTO `test`.`csdnblog` (`keyes`, `titles`, `content` , `dates`, `tags`, `category`, `views`, `comments`, `copyright`) VALUES (?, ?, ?, ?, ?, ?, ?, ?,?);";
PreparedStatement ps = conn.prepareStatement(sql);
ps.setInt(1, csdnBlog.getKey());
ps.setString(2, csdnBlog.getTitle());
ps.setString(3,csdnBlog.getContent());
ps.setString(4, csdnBlog.getDates());
ps.setString(5, csdnBlog.getTags());
ps.setString(6, csdnBlog.getCategory());
ps.setInt(7, csdnBlog.getView());
ps.setInt(8, csdnBlog.getComments());
ps.setInt(9, csdnBlog.getCopyright());
return ps.executeUpdate();
} catch (SQLException e) {
e.printStackTrace();
}
return -1;
}
}
实体类:
public class CsdnBlog {
private int key;// 编号
private String title;// 标题
private String dates;// 日期
private String tags;// 标签
private String category;// 分类