最近对爬虫起了兴趣,但是网上都说做爬虫最好得语言是py。但是我只会java,所以就想能不能用java实现一个爬虫,百度搜索发现,其实java也有很多优秀得开源爬虫框架,包括Gecco,webmagic,Jsoup等等非常多得优秀开源框架,可以让我们在不是十分熟悉正则表达式得情况下也能实现爬虫爬取数据。
本案例使用Jsoup解析网页。使用Jsoup可以很方便的使用类似Jquery得选择器语法来选择html中得元素以提取数据。十分方便
这里有Jsoup得使用简介:http://www.open-open.com/jsoup/
本爬虫得目标网站是https://www.4493.com得美图板块。
先上成果,质量还是很高得,由于硬盘容量有限,我只爬了十页数据。就已经一个多GB了。
接下来是代码:
package .crawl.mote;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
* 爬取4493美图图片使用jsoup和使用正则表达式各一则
* @author jiuwei
*
*/
public class main4493 {
public static final String URL = "https://www.4493.com";
/**
* 性感美女
*/
public static String XGMN = "https://www.4493.com/xingganmote/";
public static int xgmnPageCount = 10;
public static final String XGMN_DIR = "性感美女";
/**
* 丝袜美腿
*/
public static String SWMT = "https://www.4493.com/siwameitui/";
public static int swmtPageCount = 0;
public static final String SWMT_DIR = "丝袜美腿";
/**
* 唯美写真
*/
public static String WMXZ = "https://www.4493.com/weimeixiezhen/";
public static int wmxzPageCount = 0;
public static final String WMXZ_DIR = "唯美写真";
/**
* 网络美女
*/
public static String WLMN = "https://www.4493.com/wangluomeinv/";
public static int wlmnPageCount = 0;
public static final String WLMN_DIR = "网络美女";
/**
* 高清美女
*/
public static String GQMN = "https://www.4493.com/gaoqingmeinv/";
public static int gqmnPageCount = 0;
public static final String GQMN_DIR = "高清美女";
/**
* 模特美女
*/
public static String MTMN = "https://www.4493.com/motemeinv/";
public static int mtmnPageCount = 0;
public static final String MTMN_DIR = "模特美女";
/**
* 体育美女
*/
public static String TYMN = "https://www.4493.com/tiyumeinv/";
public static int tymnPageCount = 0;
public static final String TYMN_DIR = "体育美女";
/**
* 动漫美女
*/
public static String DMMN = "https://www.4493.com/dongmanmeinv/";
public static int dmmnPageCount = 0;
public static final String DMMN_DIR = "动漫美女";
public static File DIR = new File( "d:\\4493\\" );
public static void main(String[] args) throws Exception {
for(int i = 0; i<xgmnPageCount;i++){
String url = XGMN;
if(i>0){
url = XGMN + "index-"+(i+1)+".htm";
}
List<Model> list = getPage(url);//获取所有得图片页对象
downloadJpg(list,XGMN_DIR);
}
}
/**
* 取得当前页对象
* @param url
* @return
*/
public static List<Model> getPage(String url){
/**
* 使用jsoup请求页面并分析
*/
List<Model> pageUrl = null;
Document document = getDocument(url);//获取主页面以取得本页面所有得页得url
pageUrl = new ArrayList<Model>();//存放所有得页url
Elements ulElement = document.select("ul.clearfix");//类选择器选择ul
Elements liElement = ulElement.select("li");//类选择器选择li
//取得当前页得每一页得对象
for(int i = 0; i < liElement.size(); i++){
List<String> imgUrlList = new ArrayList<String>();
Element e = liElement.get(i);
Model model = new Model();
model.setTitle(e.select("span").text());
String aurl = e.select("a").attr("href");
aurl = aurl.substring(0 , aurl.length()-6);
model.setUrl(URL+aurl+".htm");
model.setGxsj(e.select("b.b1").text());
Document document1 = getDocument(model.getUrl());
Elements divElement = document1.select("div.picsbox");//取得图片div
Elements imgElement = divElement.select("img");//从div中取得所有得img标签
for(int j = 0;j<imgElement.size();j++){
imgUrlList.add(imgElement.get(j).attr("src"));
}
model.setImgUrl(imgUrlList);
model.setZsl(imgUrlList.size());//总数量为但也浏览页面得所有url得size
pageUrl.add(model);
}
return pageUrl;
}
public static void downloadJpg(List<Model> list,String dir2){
//分别下载图片到硬盘,按照标题分开
File file = new File(DIR,dir2);
if(!file.exists()){
file.mkdirs();//创建多级文件夹
}
System.out.println( file + ":创建成功" );
for(int i =0; i<list.size(); i++){
File file1 = new File(file,list.get(i).getTitle());
if(!file1.exists()){
file1.mkdirs();//创建多级文件夹
}
System.out.println( file1 + ":创建成功" );
List<String> srcList = list.get(i).getImgUrl();
for(int j = 0; j<srcList.size(); j++){
String src = srcList.get(j);
File file2 = new File( file1, (j+1) + ".jpg" );
if(file2.exists()){
System.out.println(file2 + "已经存在,跳过;");
continue;
}
URL url;
try {
url = new URL(src);
BufferedInputStream biStream = new BufferedInputStream(url.openStream());
BufferedOutputStream ouStream = new BufferedOutputStream(new FileOutputStream(file2));
System.out.println( list.get(i).getTitle() + ":" + src + "开始下载..." );
byte[] buf = new byte[1024];
int len;
while((len = biStream.read(buf)) != -1){
ouStream.write(buf,0,len);
}
biStream.close();
ouStream.close();
System.out.println( list.get(i).getTitle() + "下载完成!");
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static Document getDocument(String url){
Connection connect = Jsoup.connect(url);
Document document = null;
try {
document = connect.timeout(100000).get();
return document;
} catch (IOException e) {
System.out.println("连接超时!!");
e.printStackTrace();
}
return document;
}
}
详细得代码已经打包,连接在这里,这里只为练习Jsoup,所以功能不是十分得完善。
http://download.csdn.net/download/wangqq335/10106691,欢迎前辈指正!!