java作业--用正则表达式提取豆瓣网电影信息

最新推荐文章于 2023-04-07 09:35:49 发布

天天开心&&天天向上

最新推荐文章于 2023-04-07 09:35:49 发布

阅读量1.5k

点赞数 10

分类专栏： Java 文章标签：正则表达式

本文链接：https://blog.csdn.net/m0_46288176/article/details/116841593

版权

Java 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

该博客介绍了如何使用Java和正则表达式从豆瓣网上抓取电影的名称、导演、演员、上映时间和评分。首先，通过设置非贪婪模式的正则表达式来提取网站和网址信息。接着，展示了如何设置HTTP请求头以模拟浏览器行为，避免被网站屏蔽。代码示例中提供了获取网页内容的方法，并给出了针对豆瓣电影排行榜页的正则表达式，用于匹配电影的关键信息。博主提醒，实际爬虫操作还需要根据网页源码调整正则表达式。

摘要由CSDN通过智能技术生成

java-正则表达式提取豆瓣网电影信息

留档自用，谨慎参考。

（1）利用正则表达式实现从网页中提取网站和网址信息：如：

输入的字符串为：

“<a href=“http://www.265g.com/”>265G游戏<a href=“http://www.07073.com/”>07073游戏<a href=“http://zt.ztgame.com/url/hao.html”>征途”

提取结果为：

265G游戏：http://www.265g.com

07073游戏：http://www.07073.com

征途：http://zt.ztgame.com/url/hao.html

这个老师给的代码的正则表达式改为非贪婪模式就行：

正则表达式：

String regex = "<a href=\"(.+?)\">(.+?)</a>"; //非贪婪模式

（2）利用正则表达式到豆瓣网爬取电影信息，只需要爬取电影名称，导演，演员，上映时间，评分即可。

读取网页信息的参考代码

https://blog.csdn.net/dufufd/article/details/72781248

这个参考代码不能直接使用，需要设置请求方式。

具体操作为：

打开豆瓣，按F12
在这里插入图片描述

随便点进去一个，在Header最下方有一个User-Agent复制，然后在参考代码connection.setRequestMethod(“GET”);

后面加上

connection.setRequestProperty(“User-Agent”, “（刚刚复制的内容）”);

这一步是设置访问方式，~~具体原理不是很了解~~

想要爬具体电影的评论，影评啥的，点进电影的页面，进行相同操作即可。

偷懒设置两个类，一个存放豆瓣电影排行的User-Agent,一个是具体电影页面User-Agent

代码：

package regular_expression;
import java.net.URL;
import java.net.MalformedURLException;
import java.net.HttpURLConnection;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;

//从所给的网站名获取一个字符串
public class GetString {
	private String name;
	private static HttpURLConnection connection = null;

	public static String httpRequest(String url)
	{
		
	String content = "";
	try{
	URL u = new URL(url);
	connection = (HttpURLConnection)u.openConnection();
	connection.setRequestMethod("GET");
	connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36");
	int code = connection.getResponseCode();
	if(code == 200){
	InputStream in = connection.getInputStream();
	InputStreamReader isr = new InputStreamReader(in,"UTF-8");
	BufferedReader reader = new BufferedReader(isr);
	String line = null;
	while((line = reader.readLine()) != null){
	content += line;
	}
	}
	}catch(MalformedURLException e){
	e.printStackTrace();
	}catch(IOException e){
	e.printStackTrace();
	}finally{
	if(connection != null){
	connection.disconnect();
	}
	}
	return content;
	}
}

package regular_expression;
import java.net.URL;
import java.net.MalformedURLException;
import java.net.HttpURLConnection;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;

//从所给的网站名获取一个字符串
public class GetString2 {
	private String name;
	private static HttpURLConnection connection = null;

	public static String httpRequest(String url)
	{
		
	String content = "";
	try{
	URL u = new URL(url);
	connection = (HttpURLConnection)u.openConnection();
	connection.setRequestMethod("GET");
	connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36");
	int code = connection.getResponseCode();
	if(code == 200){
	InputStream in = connection.getInputStream();
	InputStreamReader isr = new InputStreamReader(in,"UTF-8");
	BufferedReader reader = new BufferedReader(isr);
	String line = null;
	while((line = reader.readLine()) != null){
	content += line;
	}
	}
	}catch(MalformedURLException e){
	e.printStackTrace();
	}catch(IOException e){
	e.printStackTrace();
	}finally{
	if(connection != null){
	connection.disconnect();
	}
	}
	return content;
	}
}

这两个类只有User-Agent的不同，也可以全写到Text类里面。
正则表达式是根据网页源码找规律写的，可以根据crtl+U查看网页源码~

package regular_expression;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Test {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		// TODO Auto-generated method stub
		GetString a=new GetString();
		String content =a.httpRequest("https://movie.douban.com/chart");
	 
		//String regex = "<a href=\"(.+?)\"  class=\".+?\">(.+?)<span style=\"font-size:13px;\">(.+?)</span>.+?<p class=.+?>(.+?)</p>.+?<span class=\"rating_nums\">(.+?)</span>";
		String regex = "<a class=.+? href=\"(.+?)\"  title=\"(.+?)\">.+? <span style=\"font-size:13px;\">(.+?)</span>.+?<p class=.+?>(.+?)</p>.+?<span class=\"rating_nums\">(.+?)</span>"; 
 
		
	    Pattern p = Pattern.compile(regex);  
	    Matcher m=p.matcher(content);
	    while(m.find())
	    {
	       
	     
	      System.out.println("电影名："+m.group(2)+"//"+m.group(3) + " \n演员列表: " +m.group(4)+"\n评分: "+m.group(5));
	      
	       
	      if(m.group(1).length()<80)
	      {
	      System.out.println("电影连接： "+m.group(1));
	      
	      //试图爬取评论
	      GetString2 tma=new GetString2();
	      String tmp =tma.httpRequest(m.group(1));
//	      String regex1="<div class=\"short-content\">(.+?)</div>";//这个可以爬影评，就是有点丑
	      String regex1=" <span class=\"short\">(.+?)</span>";//爬评论
	      Pattern q=Pattern.compile(regex1);
	      Matcher n=q.matcher(tmp);
	      while(n.find())
	      {
	    	  System.out.println("评论："+n.group(1));
	      }
	       
	       System.out.println("\n===================分割线==================\n");
	      }
	    }
	}
}

运行截图：

在这里插入图片描述

天天开心&&天天向上

关注

10
点赞
踩
10

收藏

觉得还不错? 一键收藏
2
评论
java作业--用正则表达式提取豆瓣网电影信息

java-正则表达式提取豆瓣网电影信息留档自用，谨慎参考。（1）利用正则表达式实现从网页中提取网站和网址信息：如：输入的字符串为：“<a href=“http://www.265g.com/”>265G游戏<a href=“http://www.07073.com/”>07073游戏<a href=“http://zt.ztgame.com/url/hao.html”>征途”提取结果为：265G游戏：http://www.265g.com07073游戏：ht
复制链接

扫一扫

专栏目录