Java爬虫入门（一）小白学习

最新推荐文章于 2024-05-14 19:21:39 发布

小负子

最新推荐文章于 2024-05-14 19:21:39 发布

阅读量4.1k

点赞数 10

分类专栏：爬虫文章标签： java爬虫爬虫入门爬虫基础

本文链接：https://blog.csdn.net/qq_41083742/article/details/79255368

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

今天了解了一下爬虫技术，对于java爬虫，主要有webmagic，jsoup，httpclient。这些都需要去下载jar包，要么这个包少了，要么那个包少了很麻烦，而且网上也不好下载完整版。

所以了解了爬虫技术思想后，利用java自带的库写了一个小程序实现从网上爬图（只能爬静态网页）。

一.分析网页源代码

我选的是我的女神绫濑遥的图片，按F12打开网页源代码；找到图片容器。

Error

找到图片链接

Error 二.下载整个页面

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.net.URLConnection;
import java.util.HashSet;
import java.util.Iterator;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Pac1 {
	static String url1="https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E7%BB%AB%E6%BF%91%E9%81%A5%E5%89%A7%E7%85%A7";
    public static void main(String[] args) throws FileNotFoundException {
		File file=new File("C:\\Users\\小负子\\Pictures\\test\\123.txt");
		try{URL url2=new URL(url1);
		URLConnection con=url2.openConnection();
		BufferedReader bu=new BufferedReader(new InputStreamReader(con.getInputStream()));
		FileOutputStream fi =new FileOutputStream(file);
		BufferedWriter bf=new BufferedWriter(new OutputStreamWriter(fi));
		
		while(bu.readLine() != null) {
			String str=bu.readLine();
			bf.write(str);
			bf.flush();
		}
		bf.flush();
		bu.close();
		bf.close();
		}catch (Exception e) {
			// TODO: handle exception
		}

三.提取图片

通过正则表达式："https://ss\\d+\\.bdstatic\\.com\\S+\\.jpg"，匹配图片链接，并存储到一个set集合中。

正则表达式教程入口：http://www.runoob.com/regexp/regexp-tutorial.html

String patter="https://ss\\d+\\.bdstatic\\.com\\S+\\.jpg";
		BufferedReader bu1=new BufferedReader(new InputStreamReader(new FileInputStream(file)));
		Pattern p = Pattern.compile(patter);
		int i=0;
		String str1=null;
		StringBuilder str2=new StringBuilder();
		HashSet<String> set=new HashSet<String>();
		try {
			while((str1=bu1.readLine())!=null) {
				str2.append(str1);
				
			}
			Matcher m = p.matcher(str2);
			while(m.find()) {
	        	m.start();
	        	set.add(m.group());
	        	i++;
			}
	        bu1.close();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}

四.下载图片

从存储的set集合中读取图片URL并下载到本地。

先写一张下载IO类，从set集合中每取一个出来就new 一个下载类。

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

public class Matchtest {

	public Matchtest(String str1,int i) throws IOException {
		// TODO Auto-generated method stub
		File test=new File("C:\\Users\\小负子\\Pictures\\test\\绫濑遥图片");
		if(!test.exists()) {
			test.mkdir();
		}
        File file=new File("C:\\Users\\小负子\\Pictures\\test\\绫濑遥图片\\"+"第"+i+"张"+".jpg");
        URL url = null;
		try {
			url = new URL(str1);
		} catch (MalformedURLException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		URLConnection con = null;
		try {
			con = url.openConnection();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		 InputStream io=con.getInputStream();
		FileOutputStream fi = null;
		try {
			fi = new FileOutputStream(file);
		} catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		try {
			byte[] buf = new byte[1024];
			int len = 0;
			while((len=io.read(buf)) != -1) {
				
				fi.write(buf, 0, len);
				
			}
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		
		fi.close();
		io.close();
        
	}

}

set集合遍历

 Iterator<String>  it = set.iterator();
		 int cout=0;
		while(it.hasNext()) {
			cout++;
			try {
				String string=it.next().toString();
				new Matchtest(string, cout);
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		}

成果：

Error
收工。

小负子

关注

10
点赞
踩
59

收藏

觉得还不错? 一键收藏
5
评论
Java爬虫入门（一）小白学习

今天了解了一下爬虫技术，对于java爬虫，主要有webmagic，jsoup，httpclient。这些都需要去下载jar包，要么这个包少了，要么那个包少了很麻烦，而且网上也不好下载完整版。所以了解了爬虫技术思想后，利用java自带的库写了一个小程序实现从网上爬图（只能爬静态网页）。一.分析网页源代码我选的是我的女神绫濑遥的图片，按F12打开网页源代码；找到图片容器。
复制链接

扫一扫