基于HTTP的简单网络爬虫

早上吃啥中午吃啥晚上吃啥

已于 2022-08-06 15:06:45 修改

阅读量233

点赞数 3

文章标签：爬虫 eclipse java 网络 http

于 2022-08-06 15:06:38 首次发布

本文链接：https://blog.csdn.net/qq_50587186/article/details/125901361

版权

HTTP概述

HTTP是目前使用最广泛的Web应用程序使用的基础协议，例如，浏览器访问网站，手机App访问后台服务器，都是通过HTTP协议实现的。

HTTP是HyperText Transfer Protocol的缩写，翻译为超文本传输协议，它是基于TCP协议之上的一种请求-响应协议。

HTTP请求的格式是固定的，它由HTTP Header和HTTP Body两部分构成。第一行总是请求方法路径 HTTP版本：例如，GET / HTTP/1.1表示使用GET请求，路径是/，版本是HTTP/1.1。
后续的每一行都是固定的Header: Value格式，我们称为HTTP Header，服务器依靠某些特定的Header来识别客户端请求，例如：
Host：表示请求的域名，因为一台服务器上可能有多个网站，因此有必要依靠Host来识别请求是发给哪个网站的；
User-Agent：表示客户端自身标识信息，不同的浏览器有不同的标识，服务器依靠User-Agent判断客户端类型是IE还是Chrome，是Firefox还是一个Python爬虫；
Accept：表示客户端能处理的HTTP响应格式，*/*表示任意格式，text/*表示任意文本，image/png表示PNG格式的图片；
Accept-Language：表示客户端接收的语言，多种语言按优先级排序，服务器依靠该字段给用户返回特定语言的网页版本。

如果是GET请求，那么该HTTP请求只有HTTP Header，没有HTTP Body。

如果是POST请求，那么该HTTP请求带有Body，以一个空行分隔。

POST请求通常要设置Content-Type表示Body的类型，Content-Length表示Body的长度，这样服务器就可以根据请求的Header和Body做出正确的响应。

HTTP响应也是由Header和Body两部分组成，一个典型的HTTP响应如下：

HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 133251

<!DOCTYPE html>
<html><body>
<h1>Hello</h1>
...

响应的第一行总是 HTTP版本响应代码响应说明

例如，HTTP/1.1 200 OK表示版本是HTTP/1.1，响应代码是200，响应说明是OK。客户端只依赖响应代码判断HTTP响应是否成功。HTTP有固定的响应代码：
1xx：表示一个提示性响应，例如101表示将切换协议，常见于WebSocket连接；
2xx：表示一个成功的响应，例如200表示成功，206表示只发送了部分内容；
3xx：表示一个重定向的响应，例如301表示永久重定向，303表示客户端应该按指定路径重新发送请求；
4xx：表示一个因为客户端问题导致的错误响应，例如400表示因为Content-Type等各种原因导致的无效请求，404表示指定的路径不存在；
5xx：表示一个因为服务器问题导致的错误响应，例如500表示服务器内部故障，503表示服务器暂时无法响应。

HTTP编程

URL：统一资源定位符

爬取一张图片的编码方法如下：
在网站中得到图片的路径，通过路径创建URL实例

URL imageur1=new URL("https://img2.doubanio.com/view/photo/m/public/p2875247682.webp");

通过URL实例打开连接

HttpURLConnection connect = (HttpURLConnection)imageur1.openConnection();

设置请求方式GET

connect.setRequestMethod("GET");

设置请求Header属性

connect.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49");

最后通过输入输出流读取并写入图片

具体代码如下：

package com.gjh.demo01;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.ProtocolException;
import java.net.URL;

public class Text02 {
	public static void main(String[] args) {
		
		//HttpURLConnection connect;
		try {
			//某张电影海报的图片（该图片的统一资源定位符）
			URL imageur1=new URL("https://img2.doubanio.com/view/photo/m/public/p2875247682.webp");
			//打开连接
			HttpURLConnection connect = (HttpURLConnection)imageur1.openConnection();
			//设置请求方式GET
			connect.setRequestMethod("GET");
		
			//设置请求Header属性
			connect.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49");
			
			try(//读取图片
					BufferedInputStream bis=new BufferedInputStream(connect.getInputStream());
					//存储图片（写入输出图片的字节内容）
					BufferedOutputStream bos=new BufferedOutputStream(new FileOutputStream("D:\\Text\\douban file\\"+System.currentTimeMillis()+".jpg"));) {
			//边读边写
			byte[] buff=new byte[1024];
			int len=-1;
			while((len=bis.read(buff))!=-1) {
				bos.write(buff,0,len);
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
		} catch (MalformedURLException e1) {
			e1.printStackTrace();
		} catch (ProtocolException e1) {
			e1.printStackTrace();
		} catch (IOException e1) {
			e1.printStackTrace();
		}
		
	}

}

运行结果如下：

爬取网站首页全部海报图片的编码方式

创建URL实例时传入是网站首页的路径，之后需要对从网站获取的html格式信息进行截取（信息格式如下图），这里我们可以对截取下来的信息在循环中以字符串的形式进行截取，也可以用jsoup方法解析。

1.以字符串的形式进行截取

package com.gjh.demo01;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.ProtocolException;
import java.net.URL;
import java.nio.charset.StandardCharsets;

public class Text03 {
	public static void main(String[] args) {
		
		//获取豆瓣首页的海报图片，存入指定目录
		
		try {
			
			URL imageur1=new URL("https://movie.douban.com/");
			
			HttpURLConnection connect = (HttpURLConnection)imageur1.openConnection();
			
			connect.setRequestMethod("GET");


			connect.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49");
			
			BufferedReader reader=new BufferedReader(new InputStreamReader(connect.getInputStream(),StandardCharsets.UTF_8));
			
			String line=null;
			while((line=reader.readLine())!=null) {
				line=line.trim();
				if(line.startsWith("<img")  &&  line.contains("https://img")  &&  line.contains(".jpg")) {
					//System.out.println(line);
					
				//使用字符串截取的方式获得指定的字符串
					
					int startPath=line.indexOf("https:");
					int endPath=line.indexOf(".jpg")+4;
					String Path=line.substring(startPath, endPath);
					
					int startName=line.indexOf("alt=")+5;
					int endName=line.indexOf("\"",startName);
					String Name=line.substring(startName,endName);
//					System.out.println(Path);
//					System.out.println(Name);
					
					URL imageUr1=new URL(Path);
					HttpURLConnection imageUr1connect=(HttpURLConnection)imageUr1.openConnection();
					
					try (BufferedInputStream in=new BufferedInputStream(imageUr1connect.getInputStream());
						BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream("D:\\Text\\douban file\\"+Name+".jpg"));){
						
						
						byte[] buff=new byte[1024];
						int len=-1;
						while((len=in.read(buff))!=-1) {
							out.write(buff,0,len);
						}
					} catch (Exception e) {
						e.printStackTrace();
					}
					
				}
			}
		} catch (MalformedURLException e) {
			e.printStackTrace();
		} catch (ProtocolException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

2.用jsoup方法解析

jsoup类的作用：进行原始解析
Document类：网页文档（包含解析到的所有标签）
Elements类：若干元素Element形成的集合（继承自ArrayList）
Element类：某一个html元素

package com.gjh.demo01;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.ProtocolException;
import java.net.URL;
import java.nio.charset.StandardCharsets;


import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;

public class Text04 {
	public static void main(String[] args) {
		
		/**
		 * 在循环中每次在line获取属性
		 */
		try {
			URL imageur1=new URL("https://movie.douban.com/");
			
			HttpURLConnection connect = (HttpURLConnection)imageur1.openConnection();
			
			connect.setRequestMethod("GET");


			connect.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49");
			
			BufferedReader reader=new BufferedReader(new InputStreamReader(connect.getInputStream(),StandardCharsets.UTF_8));
			
			String line=null;
			while((line=reader.readLine())!=null) {
				line=line.trim();
				if(line.startsWith("<img")  &&  line.contains("https://img")  &&  line.contains(".jpg")) {
					//System.out.println(line);
				

					//解析成Document对象
					Document doc=Jsoup.parse(line);
					//从Document中获取名称为img的所有标签元素（Elements)
					//从所有代表img的Elements元素集合中获取第一个
					Element imagelement=doc.getElementsByTag("img").first();
					
					//获取img标签元素src属性和alt属性
					String src = imagelement.attr("src");  //提取图片的路径src
					String alt = imagelement.attr("alt");  //电影名称alt
					
					URL imageUr1=new URL(src);
					HttpURLConnection imageUr1connect=(HttpURLConnection)imageUr1.openConnection();
					
					try (BufferedInputStream in=new BufferedInputStream(imageUr1connect.getInputStream());
						BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream("D:\\Text\\douban file\\"+alt+".jpg"));){
						
						
						byte[] buff=new byte[1024];
						int len=-1;
						while((len=in.read(buff))!=-1) {
							out.write(buff,0,len);
						}
					} catch (Exception e) {
						e.printStackTrace();
					}
					
				}}
			
		} catch (MalformedURLException e) {
			e.printStackTrace();
		} catch (ProtocolException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

两种获取图片路径的方式的结果如下：

早上吃啥中午吃啥晚上吃啥

关注

3
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
基于HTTP的简单网络爬虫

HTTP是目前使用最广泛的Web应用程序使用的基础协议，例如，浏览器访问网站，手机App访问后台服务器，都是通过HTTP协议实现的。HTTP是HyperText Transfer Protocol的缩写，翻译为超文本传输协议，它是基于TCP协议之上的一种请求-响应协议。HTTP请求的格式是固定的，它由HTTP Header和HTTP Body两部分构成。第一行总是请求方法路径 HTTP版本...
复制链接

扫一扫