简易java爬虫 改改路径可直接运行(httpclient+jsoup)

这几天在研究java爬虫,争取整理出个教程,一般都是用httpclient 和 Jsoup 来做的,

 

httpclient 下载地址:http://mirrors.hust.edu.cn/apache//httpcomponents/httpclient/binary/httpcomponents-client-4.3.5-bin.zip

jsoup 下载地址: http://jsoup.org/download

 

导入到myeclipse 就可以了

先来个例子:

 

下面是参照部分资料写的实例代码,由于这个网站结构可能会变化,不保证这程序在永久能运行,如果运行不了,改一下select 语句后面的东西,注意部分路径比如图片存放目录要改一下

package com.hxw.spider;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.params.CoreConnectionPNames;
import org.apache.http.util.EntityUtils;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class httpGetPics {

	
	/**
	 * 抓取图片存放目录
	 */
	private static final String PIC_DIR = "E:/电脑桌面/captcha";   //事先要建好文件夹
	
	private static final int TIME_OUT = 500;  //链接超时
	static void getPics(String url) throws Exception {
	    Connection conn= Jsoup.connect(url);
	    Document doc = conn.get();
	    Elements links = doc.select("div.cc a[href]");
	    for(int i=0;i<links.size();i++){
	        Element element = links.get(i);
	        final String dirUrl = "http://www.3lian.com"+element.attr("href");
	        System.out.println("首页一级图片地址: "+dirUrl);
	        Thread.sleep(500);
            new Thread(new Runnable() {  //创建多个线程来下载这些图片
                public void run() {
                    try {
                        Connection conn= Jsoup.connect(dirUrl);
                        Document doc = conn.get();
                        Elements images = doc.select("ul.ulBigPic li img[src]");
                        for(int j=0;j<images.size();j++){
                            Element img = images.get(j);
                            String imgUrl = img.attr("src");
                            save(imgUrl);
                        }
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                }
            }).start();
	    }
    }
	
	/**
	 * 保存图片
	 * @param url
	 * @param i
	 * @throws Exception
	 */
	static void save(String url) throws Exception {
		String fileName = url.substring(url.lastIndexOf("/"));
		String filePath = PIC_DIR + "/" + fileName;
		BufferedOutputStream out = null;
		byte[] bit = getByte(url);
		if (bit.length > 0) {
			try {
				out = new BufferedOutputStream(new FileOutputStream(filePath));
				out.write(bit);
				out.flush();
				System.out.println("图片下载成功!");
			} finally {
				if (out != null)
					out.close();
			}
		}
	}
	
	/**
	 * 获取图片字节流
	 * @param uri
	 * @return
	 * @throws Exception
	 */
	static byte[] getByte(String uri) throws Exception {
		HttpClient client = new DefaultHttpClient();
//		client.getParams().setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, TIME_OUT);
		HttpGet get = new HttpGet(uri);
//		get.getParams().setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, TIME_OUT);
		try {
			HttpResponse resonse = client.execute(get);
			if (resonse.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
				HttpEntity entity = resonse.getEntity();
				if (entity != null) {
					return EntityUtils.toByteArray(entity);
				}
			}
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			client.getConnectionManager().shutdown();
		}
		System.out.println("获取失败!");
		return new byte[0];
	}

	public static void main(String[] args) throws Exception {
		// 开始抓取图片
	    getPics("http://www.3lian.com/gif/more/03/0301.html");
	}
}

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值