早在之前就想学一学爬虫但是一直木有时间,这几天抽空学了写入门级的爬虫,接下来简单介绍下爬虫的具体步骤以及具体的类以及操作流程;(按照如下流程搭建爬虫项目)
按照五层开始搭建爬虫项目:
1.用户接口层
2.任务调度层
3.网络爬取层
4.数据解析层
5.数据持久化层
开始搭建项目
首先新建一个maven项目
把爬虫大概需要的类包打包好:
download包:负责下载url界面以及编码获取编码类的一些工具类
paser包:
persistence包:
persistence包:
pojos包:存放bean类的包
schedule包:负责接收外部传过来的url任务,通过一定的分发策略,将相应的url任务分发到采集任务当中
ui包:负责爬虫系统对外开放的接口设计与实现
utils包:编写一些常用的工具类的包
以及在外部新建一个seeds.txt文件
配置好pom.xml文件:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.tl.spider</groupId>
<artifactId>SimpleYouthNewsSpider4Job002</artifactId>
<version>0.0.1-SNAPSHOT</version>
<!-- 首先配置仓库的服务器位置,首选阿里云,也可以配置镜像方式,效果雷同 -->
<repositories>
<repository>
<id>nexus-aliyun</id>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
</dependencies>
<build>
<filename>SimpleYouthNewsSpider4Job002</filename>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>assembly</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-compilder-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
</project>
1)在Util包里面编写一个读取文件的类(并返回一个以换行分割的列表,拿取系统的种子) 读取seeds.txt里面的内容:
public static List<String> getFileLineList(String filePath,String charset) throws IOException{
File fileObj = new File(filePath);
FileInputStream fis = new FileInputStream(fileObj);
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader br = new BufferedReader(isr);
List<String> lineList = new ArrayList<String>();
String temp = null;
while ((temp = br.readLine()) != null) {
temp = temp.trim();
if(temp.length()>0) {
lineList.add(temp);
}
}
br.close();
return lineList;
}
2)在存放常量的类里面编写一个util类编写一类存放常量
public class StaticValue {
public final static String NEXT_LINE="\n";
public final static String ENCODING_DEFAULT="utf-8";
}
3)在download包里面编写一个爬取网页的类传入url的编码(url,charset)输出网页内容:
(但是这样网页的编码是固定传入没有自由应变)
public static String getHtmlSourceBySocket(String url,String charset) throws Exception {
URL urlObj = new URL(url);
InputStream is = urlObj.openStream();
InputStreamReader isr = new InputStreamReader(is,charset);
BufferedReader br = new BufferedReader(isr);
StringBuilder stringBuilder = new StringBuilder();
String temp = null;
int lineCounter = 0;
while((temp=br.readLine())!=null) {
if(lineCounter>0) {
stringBuilder.append(StaticValue.NEXT_LINE);
}
lineCounter++;
stringBuilder.append(temp);
}
br.close();
return stringBuilder.toString();
}
4)根据具体情况来改变爬虫所对应的编码形式,分为两种
第一种:根据网页源码里面的Conten-Type属性来获取编码( 最准的方式):
代码实现获取:
public static String getCharset(String url) throws Exception {
String finalCharset = null;
URL urlObj = new URL(url);
URLConnection urlConn = urlObj.openConnection();
//用header来获取url的charset
Map<String,List<String>> allHeaderMap = urlConn.getHeaderFields();
List<String> kvList = allHeaderMap.get("Conten-Type");
if(kvList!=null&&!kvList.isEmpty()) {
String line = kvList.get(0);
String[] kvArray = line.split(";");
for (String kv : kvArray) {
String[] eleArray = kv.split("=");
if(eleArray.length==2) {
if(eleArray[0].equals("charset")) {
finalCharset = eleArray[1].trim();
}
}
}
}
System.out.println(finalCharset);//finalCharset为取出的Conten-Type的值
}
第二种:根据网页源码的…里面的meta里面对应的charset属性的值来获取编码
代码实现获取:
public static String getCharset(String url) throws Exception {
BufferedReader br = WebPageDownLoadUtil.getBR(url,StaticValue.ENCODING_DEFAULT);
String temp = null;
while((temp=br.readLine())!=null) {
temp = temp.toLowerCase();//把网页源码都转成小写
String charset = getCharsetVaue4Line(temp);
if(charset!=null) {
finalCharset=charset;
System.out.println(charset);
break;
}
if(temp.contains("</head>")) {
break;
}
}
br.close();
}
根据第二种方法要搭建对应的正值表达式来匹配对应的网页根据网页的具体情况来设定,如:
public static String getCharsetVaue4Line(String line) {
String regex = "charset=\"?(.+?)\"?\\s?/?>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(line);
String charsetValue = null;
if(matcher.find()) {
charsetValue = matcher.group(1);
}
return charsetValue ;
}
然后把上面两种情况结合一下,在第一种情况获取不到的情况下根据第二种方案来获取 :
结合代码为:
public static String getCharset(String url) throws Exception {
String finalCharset = null;
URL urlObj = new URL(url);
URLConnection urlConn = urlObj.openConnection();
//用header来获取url的charset
Map<String,List<String>> allHeaderMap = urlConn.getHeaderFields();
List<String> kvList = allHeaderMap.get("Conten-Type");
if(kvList!=null&&!kvList.isEmpty()) {
String line = kvList.get(0);
String[] kvArray = line.split(";");
for (String kv : kvArray) {
String[] eleArray = kv.split("=");
if(eleArray.length==2) {
if(eleArray[0].equals("charset")) {
finalCharset = eleArray[1].trim();
}
}
}
}
if(finalCharset == null) {
//启用meta取charset的方式
BufferedReader br = WebPageDownLoadUtil.getBR(url,StaticValue.ENCODING_DEFAULT);
String temp = null;
while((temp=br.readLine())!=null) {
temp = temp.toLowerCase();//把网页源码都转成小写
String charset = getCharsetVaue4Line(temp);
if(charset!=null) {
finalCharset=charset;
System.out.println(charset);
break;
}
if(temp.contains("</head>")) {
break;
}
}
br.close();
}
return finalCharset;
}