java-正则表达式提取豆瓣网电影信息
留档自用,谨慎参考。
(1)利用正则表达式实现从网页中提取网站和网址信息:如:
输入的字符串为:
“<a href=“http://www.265g.com/”>265G游戏<a href=“http://www.07073.com/”>07073游戏<a href=“http://zt.ztgame.com/url/hao.html”>征途”
提取结果为:
265G游戏:http://www.265g.com
07073游戏:http://www.07073.com
征途:http://zt.ztgame.com/url/hao.html
这个老师给的代码的正则表达式改为非贪婪模式就行:
正则表达式:
String regex = "<a href=\"(.+?)\">(.+?)</a>"; //非贪婪模式
(2)利用正则表达式到豆瓣网爬取电影信息,只需要爬取电影名称,导演,演员,上映时间,评分即可。
读取网页信息的参考代码
https://blog.csdn.net/dufufd/article/details/72781248
这个参考代码不能直接使用,需要设置请求方式。
具体操作为:
打开豆瓣,按F12
随便点进去一个,在Header最下方有一个User-Agent复制,然后在参考代码connection.setRequestMethod(“GET”);
后面加上
connection.setRequestProperty(“User-Agent”, “(刚刚复制的内容)”);
这一步是设置访问方式,具体原理不是很了解
想要爬具体电影的评论,影评啥的,点进电影的页面,进行相同操作即可。
偷懒设置两个类,一个存放豆瓣电影排行的User-Agent,一个是具体电影页面User-Agent
代码:
package regular_expression;
import java.net.URL;
import java.net.MalformedURLException;
import java.net.HttpURLConnection;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;
//从所给的网站名获取一个字符串
public class GetString {
private String name;
private static HttpURLConnection connection = null;
public static String httpRequest(String url)
{
String content = "";
try{
URL u = new URL(url);
connection = (HttpURLConnection)u.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36");
int code = connection.getResponseCode();
if(code == 200){
InputStream in = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(in,"UTF-8");
BufferedReader reader = new BufferedReader(isr);
String line = null;
while((line = reader.readLine()) != null){
content += line;
}
}
}catch(MalformedURLException e){
e.printStackTrace();
}catch(IOException e){
e.printStackTrace();
}finally{
if(connection != null){
connection.disconnect();
}
}
return content;
}
}
package regular_expression;
import java.net.URL;
import java.net.MalformedURLException;
import java.net.HttpURLConnection;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;
//从所给的网站名获取一个字符串
public class GetString2 {
private String name;
private static HttpURLConnection connection = null;
public static String httpRequest(String url)
{
String content = "";
try{
URL u = new URL(url);
connection = (HttpURLConnection)u.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36");
int code = connection.getResponseCode();
if(code == 200){
InputStream in = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(in,"UTF-8");
BufferedReader reader = new BufferedReader(isr);
String line = null;
while((line = reader.readLine()) != null){
content += line;
}
}
}catch(MalformedURLException e){
e.printStackTrace();
}catch(IOException e){
e.printStackTrace();
}finally{
if(connection != null){
connection.disconnect();
}
}
return content;
}
}
这两个类只有User-Agent的不同,也可以全写到Text类里面。
正则表达式是根据网页源码找规律写的,可以根据crtl+U查看网页源码~
package regular_expression;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
GetString a=new GetString();
String content =a.httpRequest("https://movie.douban.com/chart");
//String regex = "<a href=\"(.+?)\" class=\".+?\">(.+?)<span style=\"font-size:13px;\">(.+?)</span>.+?<p class=.+?>(.+?)</p>.+?<span class=\"rating_nums\">(.+?)</span>";
String regex = "<a class=.+? href=\"(.+?)\" title=\"(.+?)\">.+? <span style=\"font-size:13px;\">(.+?)</span>.+?<p class=.+?>(.+?)</p>.+?<span class=\"rating_nums\">(.+?)</span>";
Pattern p = Pattern.compile(regex);
Matcher m=p.matcher(content);
while(m.find())
{
System.out.println("电影名:"+m.group(2)+"//"+m.group(3) + " \n演员列表: " +m.group(4)+"\n评分: "+m.group(5));
if(m.group(1).length()<80)
{
System.out.println("电影连接: "+m.group(1));
//试图爬取评论
GetString2 tma=new GetString2();
String tmp =tma.httpRequest(m.group(1));
// String regex1="<div class=\"short-content\">(.+?)</div>";//这个可以爬影评,就是有点丑
String regex1=" <span class=\"short\">(.+?)</span>";//爬评论
Pattern q=Pattern.compile(regex1);
Matcher n=q.matcher(tmp);
while(n.find())
{
System.out.println("评论:"+n.group(1));
}
System.out.println("\n===================分割线==================\n");
}
}
}
}
运行截图: