基于Http协议实现网络爬虫读取数据

最新推荐文章于 2023-05-31 15:10:31 发布

记得爱蓝色

最新推荐文章于 2023-05-31 15:10:31 发布

阅读量1k

点赞数 6

文章标签： java eclipse

本文链接：https://blog.csdn.net/qq_59616295/article/details/125879259

版权

一什么是Http协议

HTTP就是目前使用最广泛的Web应用程序使用的基础协议，例如，浏览器访问网站，手机App访问后台服务器，都是通过HTTP 协议实现的。HTTP 是HyperText Transfer Protocol的缩写，翻译为超文本传输协议，它是基于TCP协议之上的一种请求-响应协议。

当浏览器希望访问某个网站时，浏览器和网站服务器之间首先建立TCP连接，且服务器总是使用80端口和加密端口443 ，然后，浏览器向服务器发送一个HTTP请求，服务器收到后，返回一一个HTTP响应，并且在响应中包含了HTML 的网页内容，浏览器解析HTML后就可以给用户显示网页了。一个完整的HTTP 请求响应如下:

            GET / HTTP/1.1
            Host: www.sina.com.cn
            User-Agent: Mozilla/5 MSIE
            Accept: */*                ┌────────┐
┌─────────┐ Accept-Language: zh-CN,en  │░░░░░░░░│
│O ░░░░░░░│───────────────────────────>├────────┤
├─────────┤<───────────────────────────│░░░░░░░░│
│         │ HTTP/1.1 200 OK            ├────────┤
│         │ Content-Type: text/html    │░░░░░░░░│
└─────────┘ Content-Length: 133251     └────────┘
  Browser   <!DOCTYPE html>              Server
            <html><body>
            <h1>Hello</h1>
            ...

HTTP请求的格式是固定的，它由HTTP Header和HTTP Body 两部分构成。第一行总是请求方法路径HTTP版本:例如，GET / HTTP/1.1表示使用GET请求，路径是/，版本是HTTP/1.1
后续的每一行都是固定的Header: Value格式，我们称为HTTP Header ，服务器依靠某些特定的Header来识别客户端请求，例如:
Host :表示请求的域名，因为一台服务器上可能有多个网站，因此有必要依靠Host 来识别请求是发日给哪个网站的。
User-Agent :表示客户端自身标识信息，不同的浏览器有不同的标识，服务器依靠User-Agent 判断客户端类型是IE还是Chrome，是Firefox还是一个Python 爬虫;
Accept :表示客户端能处理的HTTP响应格式，*/* 表示任意格式，text/* 表示任意文本，image/png表示PNG格式的图片;
Accept-Language :表示客户端接收的语言，多种语言按优先级排序，服务器依靠该字段给用户返回特定语言的网页版本。

注意：
如果是GET 请求，那么该HTTP请求只有HTTP Header,没有 HTTP Body 。如果是POST请求，那么该请求，那就带有HTTP Body。

HTTP响应也是由Header和Body组成的。

HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 133251

<!DOCTYPE html>
<html><body>
<h1>Hello</h1>
...

下面使用java模拟对服务器的请求以及服务器的响应


		
		//模拟服务器使用TCP连接处理客户端的HTTP请求
		try (ServerSocket server = new ServerSocket(8080)) {
			
			while(true) {
				
				//获取客户端浏览器连接
				Socket browserClient = server.accept();
				
//				//读取客户端请求(request)
//				BufferedReader reader = new BufferedReader(new InputStreamReader(browserClient.getInputStream()));
//				String line = null;
//				
//				while((line = reader.readLine()) != null) {
//					System.out.println(line);
//				}
				
				//模拟服务器的响应(response)
				try(BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(browserClient.getOutputStream()));) {
					
					//响应头
					writer.write("HTTP/1.1 200 OK");
					writer.newLine();
					writer.newLine();
					//响应内容
					writer.write(UUID.randomUUID().toString());
					
					
				} catch (Exception e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}
				
				
			}
			
		} catch (IOException e) {
			
			e.printStackTrace();
		}
	}

二、HTTP编程之利用网络爬虫读取数据

1.利用爬虫读取单张图片

我们模拟爬出读取数据时，需要使用统一资源定位符URL，然后使用openConnection()，来打开连接，打开连接需要模拟用户来访问服务器，不能让服务器端发现我们是爬虫来读取数据，因此我们在请求的时候需要设置请求头的属性，之后使用“流”来读取图片，并保存在本地。

由于读取的是单张照片，读写操作时，直接使用字节流进行读取

示例代码：

try {
			//统一资源定位符
			URL url = new URL("https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2874262709.webp");
			
			//打开连接
			HttpsURLConnection connection = (HttpsURLConnection)url.openConnection();
			
			//设置请求方式(GET)
			connection.setRequestMethod("GET");
			
			//设置请求Header属性
			connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62");
			
			try(
				//读取图片
				BufferedInputStream bis = new BufferedInputStream(connection.getInputStream());
				//存储图片(写入输出图片的字节内容)
				BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream("d:\\test\\img\\" + System.currentTimeMillis() + ".webp"));){
		
				int len = -1;
				//边读边写
				byte[] buff = new byte[1024];
				
				while((len = bis.read(buff)) != -1) {
					
					bos.write(buff, 0, len);
				}
			}		
			
		} catch (MalformedURLException e) {
		
			e.printStackTrace();
		} catch (ProtocolException e) {
		
			e.printStackTrace();
		} catch (FileNotFoundException e) {
		
			e.printStackTrace();
		} catch (IOException e) {
		
			e.printStackTrace();
		}

2.利用爬虫读取网页全部图片

步骤与读取单张图片的步骤大致相似，不同的是URL就是当前网页的网址，进行读取时，使用的是字符流进行读取当前网页的HTML源码，通过纯字符串形式进行去除，然后只留下与图片相关的HTML标签，进行读取以及保存到本地。

示例代码：

try {
			//统一资源定位符
			URL movieHomeURL = new URL("https://movie.douban.com/");
			
			//打开连接
			HttpsURLConnection connection = (HttpsURLConnection)movieHomeURL.openConnection();
			
			//设置请求方式(GET)
			connection.setRequestMethod("GET");
			
			//设置请求Header属性
			connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62");
			
			//读取网站源码
			BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(),StandardCharsets.UTF_8));
			
			String line = null;
			
			while((line = reader.readLine()) != null) {
				//去除空格
				line = 	line.trim();
				//判断当前行是否包含海报图片路径
				if(line.contains("https://img") && line.contains(".jpg") && line.startsWith("<img")) {

				//提取图片的路径src和电影名称alt
				//System.out.println(line);
										
				 int  beginIndex =line.indexOf("https://");
				 int  endIndex = line.indexOf(".jpg") + 4;
				 String src = line.substring(beginIndex, endIndex);
				 
				 beginIndex = line.indexOf("alt=") + 5;
				 endIndex = line.indexOf("\"",beginIndex);
				 String alt = line.substring(beginIndex, endIndex);
				
				//读取图片
				URL imageUrL = new URL(src);
				HttpsURLConnection imageConnection = (HttpsURLConnection)imageUrL.openConnection();
				
				try(BufferedInputStream in = new BufferedInputStream(imageConnection.getInputStream());
					BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream("d:\\test\\img\\" + alt + ".jpg"));){
			
					//边读边写
					int len  = -1;
					byte[] buff = new byte[1024];
					
					while((len = in.read(buff)) != -1) {
						out.write(buff, 0, len);
					
						}
					}
				
				}	
			}
			
		} catch (MalformedURLException e) {
			
			e.printStackTrace();
		} catch (ProtocolException e) {
		
			e.printStackTrace();
		} catch (IOException e) {
		
			e.printStackTrace();
		}

使用纯字符串进行截取图片的html标签不太方便，这是我们就使用jsoup解析html。这是我们就需要手动导入jar包。

// 使用jsoup解析html
// JSOUP类：进行原始解析
// Document类：网页文档(包含解析到的所有标签
// Elements类：若干元素Element形成的集合(继承自ArrayList)
// Element类：某一个html元素
					String src = "", alt="";
					// 解析成Document对象
					Document doc = Jsoup.parse(line);
					// 从Document中获取名称为img的所有标签元素(Elements)
					// 从所有代表img的Elements元素集合中获取第一个

					Element imgElement = doc.getElementsByTag("img").get(0);
					
					// 获取img标签元素src属性和alt属性
					src = imgElement.attr("src");
					alt = imgElement.attr("alt");
					
					URL imageUrL = new URL(src);
					HttpsURLConnection imageConnection = (HttpsURLConnection)imageUrL.openConnection();