简易java爬虫改改路径可直接运行（httpclient+jsoup）

最新推荐文章于 2021-08-22 12:22:50 发布

贺显伟

最新推荐文章于 2021-08-22 12:22:50 发布

阅读量155

点赞数

分类专栏： java 文章标签： java 爬虫 httpclient jsoup

本文链接：https://blog.csdn.net/a512592151/article/details/84627671

版权

java 专栏收录该内容

13 篇文章 1 订阅

订阅专栏

这几天在研究java爬虫，争取整理出个教程，一般都是用httpclient 和 Jsoup 来做的，

httpclient 下载地址：http://mirrors.hust.edu.cn/apache//httpcomponents/httpclient/binary/httpcomponents-client-4.3.5-bin.zip

jsoup 下载地址： http://jsoup.org/download

导入到myeclipse 就可以了

先来个例子：

下面是参照部分资料写的实例代码，由于这个网站结构可能会变化，不保证这程序在永久能运行，如果运行不了，改一下select 语句后面的东西，注意部分路径比如图片存放目录要改一下：

package com.hxw.spider;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.params.CoreConnectionPNames;
import org.apache.http.util.EntityUtils;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class httpGetPics {

	
	/**
	 * 抓取图片存放目录
	 */
	private static final String PIC_DIR = "E:/电脑桌面/captcha";   //事先要建好文件夹
	
	private static final int TIME_OUT = 500;  //链接超时
	static void getPics(String url) throws Exception {
	    Connection conn= Jsoup.connect(url);
	    Document doc = conn.get();
	    Elements links = doc.select("div.cc a[href]");
	    for(int i=0;i<links.size();i++){
	        Element element = links.get(i);
	        final String dirUrl = "http://www.3lian.com"+element.attr("href");
	        System.out.println("首页一级图片地址： "+dirUrl);
	        Thread.sleep(500);
            new Thread(new Runnable() {  //创建多个线程来下载这些图片
                public void run() {
                    try {
                        Connection conn= Jsoup.connect(dirUrl);
                        Document doc = conn.get();
                        Elements images = doc.select("ul.ulBigPic li img[src]");
                        for(int j=0;j<images.size();j++){
                            Element img = images.get(j);
                            String imgUrl = img.attr("src");
                            save(imgUrl);
                        }
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                }
            }).start();
	    }
    }
	
	/**
	 * 保存图片
	 * @param url
	 * @param i
	 * @throws Exception
	 */
	static void save(String url) throws Exception {
		String fileName = url.substring(url.lastIndexOf("/"));
		String filePath = PIC_DIR + "/" + fileName;
		BufferedOutputStream out = null;
		byte[] bit = getByte(url);
		if (bit.length > 0) {
			try {
				out = new BufferedOutputStream(new FileOutputStream(filePath));
				out.write(bit);
				out.flush();
				System.out.println("图片下载成功！");
			} finally {
				if (out != null)
					out.close();
			}
		}
	}
	
	/**
	 * 获取图片字节流
	 * @param uri
	 * @return
	 * @throws Exception
	 */
	static byte[] getByte(String uri) throws Exception {
		HttpClient client = new DefaultHttpClient();
//		client.getParams().setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, TIME_OUT);
		HttpGet get = new HttpGet(uri);
//		get.getParams().setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, TIME_OUT);
		try {
			HttpResponse resonse = client.execute(get);
			if (resonse.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
				HttpEntity entity = resonse.getEntity();
				if (entity != null) {
					return EntityUtils.toByteArray(entity);
				}
			}
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			client.getConnectionManager().shutdown();
		}
		System.out.println("获取失败!");
		return new byte[0];
	}

	public static void main(String[] args) throws Exception {
		// 开始抓取图片
	    getPics("http://www.3lian.com/gif/more/03/0301.html");
	}
}

贺显伟

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
简易java爬虫改改路径可直接运行（httpclient+jsoup）

这几天在研究java爬虫，争取整理出个教程，一般都是用httpclient 和 Jsoup 来做的， httpclient 下载地址：http://mirrors.hust.edu.cn/apache//httpcomponents/httpclient/binary/httpcomponents-client-4.3.5-bin.zipjsoup 下载地址： http://jso...
复制链接

扫一扫