之前有在程序人生上有幸学习了大神python的爬虫从入门到进阶再到高级,尝试了python版的python爬虫. 于是一时兴起
尝试了用java实现爬虫,简单的爬取了智联招聘上的信息.
使用jar包: Httpunit-2.23.jar jsoup-1.83.jar
pom.xml 依赖配置
<dependencies>
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.23</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.16</version>
</dependency>
整体的思路还是比较简单:
①第一步,模拟浏览器发送请求,并携带参数,获取Document
参数说明: place:工作地, keyword:收拾关键字 i : 第几页
②通过Document对象,拿到Elemens,Elements中封装了所有dom对象
③数据处理,正则匹配,字符过滤
④ 组织数据,去除重复数据多余数据
⑤导入Excel
最后,一个简单的爬虫就实现了.附上源码:
package cn.spider;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.jsoup.Jsoup;
public class Spider {
public static void main(String[] args) throws Exception{
//收缩条件
String workplace="成都";
String keywords = "java工程师";
Integer nextInt = 20;
//爬取数据
List<String> data = getData(workplace,keywords,nextInt);
//导入Excel
HSSFWorkbook wb = createExcel(data);
//输出到实体文件
wb.write(new java.io.File("C:/result"+workplace+keywords+System.currentTimeMillis()+".xlsx"));
}
//填充Excel
private static HSSFWorkbook createExcel(List<String> data) throws Exception {
//拿到页数
HSSFWorkbook wb = new HSSFWorkbook();
HSSFSheet sheet = wb.createSheet("result");
HSSFRow row = sheet.createRow(0);
HSSFCell cell = null;
String[] title = new String[]{"地点","薪资","链接","公司"};
//创建标题
for(int i=0;i<4;i++){
cell = row.createCell(i);
cell.setCellValue(title[i]);
}
for (int i = 1; i <=data.size()/4; i++) {
row = sheet.createRow(i);
//地点
row.createCell(0).setCellValue(data.get(4*(i-1)));
//薪资
row.createCell(1).setCellValue(data.get(4*(i-1)+1));
//链接
row.createCell(2).setCellValue(data.get(4*(i-1)+2));
//公司
row.createCell(3).setCellValue(data.get(4*(i-1)+3));
}
return wb;
}
//拿到数据
public static List<String> getData(String place,String Keyword,Integer pageNum) throws IOException {
//装爬取的数据
List data = new ArrayList();
for (int i =1 ;i<=pageNum;i++) {
//连接10秒超时
org.jsoup.nodes.Document parse = Jsoup
.connect("http://sou.zhaopin.com/jobs/searchresult.ashx?jl="+
place+"&kw="+Keyword+"&sm=0&sg=f9d94acccc0843d78b71e6099d39a048&p="+i)
.timeout(10000).execute().parse();
//通过jsoup拿到所有dom标签
org.jsoup.select.Elements table = parse.getElementsByTag("table");
org.jsoup.select.Elements cless = parse.getElementsByClass("gsmc");
org.jsoup.select.Elements cless2 = parse.getElementsByClass("zwyx");
org.jsoup.select.Elements cless3 = parse.getElementsByClass("gzdd");
String string = cless.toString();
//System.out.println(string);
//正在表达式拿到有用的信息
String companyandsite = string.replaceAll("<td class=\"gsmc\"><a href=\"", "")
.replaceAll("\" target=\"_blank\">", ",")
.replaceAll("</a>(.*)</td>", ",")
.replaceAll("<th class=\"gsmc\">公司名称</th>", "")
.replaceAll("\n", "");
String[] companyandsites = companyandsite.split(",");
String salary = cless2.html().replaceAll("职位月薪", "").trim();
String[] salarys = salary.split("\n");
String location = cless3.html().replaceAll("工作地点", "").trim();
String[] locations = location.split("\n");
for (int j = 0; j < locations.length; j++) {
//去除URL重复的行
if(data.contains(companyandsites[2*j])){
continue;
}
//去除后缀不为.htm,及不合格的数据
if(!companyandsites[2*j].endsWith(".htm")){
continue;
}
//工作地
data.add(locations[j]);
System.out.println(locations[j]);
data.add(salarys[j]);
//薪资
System.out.println(salarys[j]);
//链接
data.add(companyandsites[2*j]);
System.out.println(companyandsites[2*j]);
//公司名称
data.add(companyandsites[2*j+1]);
System.out.println(companyandsites[2*j+1]);
}
}
return data;
}
}
题外:第一次写博客,有点乱还是小白....见谅见谅