米娜桑,之前我们使用了正则匹配猫眼top100是不是感觉意犹未尽呢,这次我们来使用xpath匹配标题,为简化表示,这次只提取top100的标题。
思路分析:
1、首先对猫眼top100网页进行请求,老样子,把headers伪装成浏览器,猫眼的反爬机制并不强,很明显,那是放开让我们爬的,不然不可能这么轻易的获取的。把他包装成一个函数,如果请求成功,其status_code的值为200。如果请求失败,虽然不可能(小声bb),则我们返回一个None。
def get_one_page (url):
headers = {
'User_Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
return None
2、我们观察可以发现
其网址为:
https://maoyan.com/board/4?offset=0
然后点击下一页,可观察到:
其offset的值增加了10,这一点极为关键。
3、没错,先令offset为0,然后自增一后,暂停1秒钟,继续下一步。
这时offset应该为10。
所以,我们可以构造以下代码:
url = "https://maoyan.com/board/4?offset=" + str(offset)
4、在主函数中使用range改变offset的值,则可以得到相对应页面的排行。
代码如下:
for i in range(10):
main(offset=i * 10)
time.sleep(1)
5、接着我们开始进行xpath匹配,我们请求得到第一到第十的代码如下:
<!DOCTYPE html>
<!--[if IE 8]><html class="ie8"><![endif]-->
<!--[if IE 9]><html class="ie9"><![endif]-->
<!--[if gt IE 9]><!--><html><!--<![endif]-->
<head>
<title>TOP100榜 - 猫眼电影 - 一网打尽好电影</title>
<link rel="dns-prefetch" href="//p0.meituan.net" />
<link rel="dns-prefetch" href="//p1.meituan.net" />
<link rel="dns-prefetch" href="//ms0.meituan.net" />
<link rel="dns-prefetch" href="//s0.meituan.net" />
<link rel="dns-prefetch" href="//ms1.meituan.net" />
<link rel="dns-prefetch" href="//analytics.meituan.com" />
<link rel="dns-prefetch" href="//report.meituan.com" />
<link rel="dns-prefetch" href="//frep.meituan.com" />
<meta charset="utf-8">
<meta name="keywords" content="猫眼电影,电影排行榜,热映口碑榜,最受期待榜,国内票房榜,北美票房榜,猫眼TOP100">
<meta name="description" content="猫眼电影热门榜单,包括热映口碑榜,最受期待榜,国内票房榜,北美票房榜,猫眼TOP100,多维度为用户进行选片决策">
<meta http-equiv="cleartype" content="yes" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="renderer" content="webkit" />
<meta name="HandheldFriendly" content="true" />
<meta name="format-detection" content="email=no" />
<meta name="format-detection" content="telephone=no" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<script>"use strict";!function(){var i=0<arguments.length&&void 0!==arguments[0]?arguments[0]:"_Owl_",n=window;n[i]||(n[i]={isRunning:!1,isReady:!1,preTasks:[],dataSet:[],use:function(i,t){this.isReady&&n.Owl&&n.Owl[i](t),this.preTasks.push({api:i,data:[t]})},add:function(i){this.dataSet.push(i)},run:function(){var t=this;if(!this.isRunning){this.isRunning=!0;var i=n.onerror;n.onerror=function(){this.isReady||this.add({type:"jsError",data:arguments}),i&&i.apply(n,arguments)}.bind(this),(n.addEventListener||n.attachEvent)("error",function(i){t.isReady||t.add({type:"resError",data:[i]})},!0)}}},n[i].run())}();</script>
<script>
cid = "c_wx6zb55";
ci = 361;
val = {"subnavId":4}; window.system = {};
window.openPlatform = '';
window.openPlatformSub = '';
window.$mtsiFlag = '0';
</script>
<link rel="stylesheet" href="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/common.d0f96cc2.css"/>
<link rel="stylesheet" href="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/board-index.92a06072.css"/>
<script crossorigin="anonymous" src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/stat.88d57c80.js"></script>
<script>if(window.devicePixelRatio >= 2) { document.write('<link rel="stylesheet" href="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image-2x.8ba7074d.css"/>') }</script>
<style>
@font-face {
font-family: stonefont;
src: url('//vfile.meituan.net/colorstone/d79b34d2b506c39cc8044b44b551caff3168.eot');
src: url('//vfile.meituan.net/colorstone/d79b34d2b506c39cc8044b44b551caff3168.eot?#iefix') format('embedded-opentype'),
url('//vfile.meituan.net/colorstone/b40bee9fdba2d56ec44668f4335b2bb52080.woff') format('woff');
}
.stonefont {
font-family: stonefont;
}
</style>
</head>
<body>
<div class="header">
<div class="header-inner">
<a href="/" class="logo" data-act="icon-click"></a>
<div class="city-container" data-val="{currentcityid:361 }">
<div class="city-selected">
<div class="city-name">
兰州
<span class="caret"></span>
</div>
</div>
<div class="city-list" data-val="{ localcityid: 361 }">
<div class="city-list-header">定位城市:<a class="js-geo-city">兰州</a></div>
</div>
</div>
<div class="nav">
<ul class="navbar">
<li><a href="/" data-act="home-click" >首页</a></li>
<li><a href="/films" data-act="movies-click" >电影</a></li>
<li><a href="/cinemas" data-act="cinemas-click" >影院</a></li>
<li><a href="http://www.gewara.com">演出</a></li>
<li><a href="/board" data-act="board-click" class="active" >榜单</a></li>
<li><a href="/news" data-act="hotNews-click" >热点</a></li>
<li><a href="/edimall" >商城</a></li>
</ul>
</div>
<div class="user-info">
<div class="user-avatar J-login">
<img src="https://p0.meituan.net/movie/7dd82a16316ab32c8359debdb04396ef2897.png">
<span class="caret"></span>
<ul class="user-menu">
<li><a href="javascript:void 0">登录</a></li>
</ul>
</div>
</div>
<form action="/query" target="_blank" class="search-form" data-actform="search-click">
<input name="kw" class="search" type="search" maxlength="32" placeholder="找影视剧、影人、影院" autocomplete="off">
<input class="submit" type="submit" value="">
</form>
<div class="app-download">
<a href="/app" target="_blank">
<span class="iphone-icon"></span>
<span class="apptext">APP下载</span>
<span class="caret"></span>
<div class="download-icon">
<p class="down-title">扫码下载APP</p>
<p class='down-content'>选座更优惠</p>
</div>
</a>
</div>
</div>
</div>
<div class="header-placeholder"></div>
<div class="subnav">
<ul class="navbar">
<li>
<a data-act="subnav-click" data-val="{subnavClick:7}"
href="/board/7"
>热映口碑榜</a>
</li>
<li>
<a data-act="subnav-click" data-val="{subnavClick:6}"
href="/board/6"
>最受期待榜</a>
</li>
<li>
<a data-act="subnav-click" data-val="{subnavClick:1}"
href="/board/1"
>国内票房榜</a>
</li>
<li>
<a data-act="subnav-click" data-val="{subnavClick:2}"
href="/board/2"
>北美票房榜</a>
</li>
<li>
<a data-act="subnav-click" data-val="{subnavClick:4}"
data-state-val="{subnavId:4}"
class="active" href="javascript:void(0);"
>TOP100榜</a>
</li>
</ul>
</div>
<div class="container" id="app" class="page-board/index" >
<div class="content">
<div class="wrapper">
<div class="main">
<p class="update-time">2019-06-09<span class="has-fresh-text">已更新</span></p>
<p class="board-content">榜单规则:将猫眼电影库中的经典影片,按照评分和评分人数从高到低综合排序取前100名,每天上午10点更新。相关数据来源于“猫眼电影库”。</p>
<dl class="board-wrapper">
<dd>
<i class="board-index board-index-1">1</i>
<a href="/films/1203" title="霸王别姬" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c" alt="霸王别姬" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/1203" title="霸王别姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王别姬</a></p>
<p class="star">
主演:张国荣,张丰毅,巩俐
</p>
<p class="releasetime">上映时间:1993-01-01</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-2">2</i>
<a href="/films/1297" title="肖申克的救赎" class="image-link" data-act="boarditem-click" data-val="{movieId:1297}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@160w_220h_1e_1c" alt="肖申克的救赎" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/1297" title="肖申克的救赎" data-act="boarditem-click" data-val="{movieId:1297}">肖申克的救赎</a></p>
<p class="star">
主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿
</p>
<p class="releasetime">上映时间:1994-09-10(加拿大)</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-3">3</i>
<a href="/films/2641" title="罗马假日" class="image-link" data-act="boarditem-click" data-val="{movieId:2641}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p0.meituan.net/movie/289f98ceaa8a0ae737d3dc01cd05ab052213631.jpg@160w_220h_1e_1c" alt="罗马假日" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/2641" title="罗马假日" data-act="boarditem-click" data-val="{movieId:2641}">罗马假日</a></p>
<p class="star">
主演:格利高里·派克,奥黛丽·赫本,埃迪·艾伯特
</p>
<p class="releasetime">上映时间:1953-09-02(美国)</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">1</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-4">4</i>
<a href="/films/4055" title="这个杀手不太冷" class="image-link" data-act="boarditem-click" data-val="{movieId:4055}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p1.meituan.net/movie/6bea9af4524dfbd0b668eaa7e187c3df767253.jpg@160w_220h_1e_1c" alt="这个杀手不太冷" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/4055" title="这个杀手不太冷" data-act="boarditem-click" data-val="{movieId:4055}">这个杀手不太冷</a></p>
<p class="star">
主演:让·雷诺,加里·奥德曼,娜塔莉·波特曼
</p>
<p class="releasetime">上映时间:1994-09-14(法国)</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-5">5</i>
<a href="/films/267" title="泰坦尼克号" class="image-link" data-act="boarditem-click" data-val="{movieId:267}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@160w_220h_1e_1c" alt="泰坦尼克号" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/267" title="泰坦尼克号" data-act="boarditem-click" data-val="{movieId:267}">泰坦尼克号</a></p>
<p class="star">
主演:莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩
</p>
<p class="releasetime">上映时间:1998-04-03</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-6">6</i>
<a href="/films/837" title="唐伯虎点秋香" class="image-link" data-act="boarditem-click" data-val="{movieId:837}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p0.meituan.net/movie/da64660f82b98cdc1b8a3804e69609e041108.jpg@160w_220h_1e_1c" alt="唐伯虎点秋香" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/837" title="唐伯虎点秋香" data-act="boarditem-click" data-val="{movieId:837}">唐伯虎点秋香</a></p>
<p class="star">
主演:周星驰,巩俐,郑佩佩
</p>
<p class="releasetime">上映时间:1993-07-01(中国香港)</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">1</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-7">7</i>
<a href="/films/2760" title="魂断蓝桥" class="image-link" data-act="boarditem-click" data-val="{movieId:2760}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p0.meituan.net/movie/46c29a8b8d8424bdda7715e6fd779c66235684.jpg@160w_220h_1e_1c" alt="魂断蓝桥" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/2760" title="魂断蓝桥" data-act="boarditem-click" data-val="{movieId:2760}">魂断蓝桥</a></p>
<p class="star">
主演:费雯·丽,罗伯特·泰勒,露塞尔·沃特森
</p>
<p class="releasetime">上映时间:1940-05-17(美国)</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">2</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-8">8</i>
<a href="/films/7431" title="乱世佳人" class="image-link" data-act="boarditem-click" data-val="{movieId:7431}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p0.meituan.net/movie/223c3e186db3ab4ea3bb14508c709400427933.jpg@160w_220h_1e_1c" alt="乱世佳人" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/7431" title="乱世佳人" data-act="boarditem-click" data-val="{movieId:7431}">乱世佳人</a></p>
<p class="star">
主演:费雯·丽,克拉克·盖博,奥利维娅·德哈维兰
</p>
<p class="releasetime">上映时间:1939-12-15(美国)</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">1</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-9">9</i>
<a href="/films/1228" title="天空之城" class="image-link" data-act="boarditem-click" data-val="{movieId:1228}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p1.meituan.net/movie/ba1ed511668402605ed369350ab779d6319397.jpg@160w_220h_1e_1c" alt="天空之城" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/1228" title="天空之城" data-act="boarditem-click" data-val="{movieId:1228}">天空之城</a></p>
<p class="star">
主演:寺田农,鹫尾真知子,龟山助清
</p>
<p class="releasetime">上映时间:1992</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">1</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-10">10</i>
<a href="/films/3667" title="辛德勒的名单" class="image-link" data-act="boarditem-click" data-val="{movieId:3667}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p0.meituan.net/movie/b0d986a8bf89278afbb19f6abaef70f31206570.jpg@160w_220h_1e_1c" alt="辛德勒的名单" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/3667" title="辛德勒的名单" data-act="boarditem-click" data-val="{movieId:3667}">辛德勒的名单</a></p>
<p class="star">
主演:连姆·尼森,拉尔夫·费因斯,本·金斯利
</p>
<p class="releasetime">上映时间:1993-12-15(美国)</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">2</i></p>
</div>
</div>
</div>
</dd>
</dl>
</div>
<div class="pager-main">
<ul class="list-pager">
<li class="active">
<a class="page_1"
href="javascript:void(0);" style="cursor: default"
>1</a>
</li>
<li >
<a class="page_2"
href="?offset=10"
>2</a>
</li>
<li >
<a class="page_3"
href="?offset=20"
>3</a>
</li>
<li >
<a class="page_4"
href="?offset=30"
>4</a>
</li>
<li >
<a class="page_5"
href="?offset=40"
>5</a>
</li>
<li class="sep">...</li>
<li >
<a class="page_10"
href="?offset=90"
>10</a>
</li>
<li> <a class="page_2"
href="?offset=10"
>下一页</a>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="footer">
<p class="friendly-links">
关于猫眼 :
<a href="http://ir.maoyan.com/s/index.php#pageScroll0" target="_blank">关于我们</a>
<span></span>
<a href="http://ir.maoyan.com/s/index.php#pageScroll1" target="_blank">管理团队</a>
<span></span>
<a href="http://ir.maoyan.com/s/index.php#pageScroll2" target="_blank">投资者关系</a>
友情链接 :
<a href="http://www.meituan.com" data-query="utm_source=wwwmaoyan" target="_blank">美团网</a>
<span></span>
<a href="http://www.gewara.com" data-query="utm_source=wwwmaoyan">格瓦拉</a>
<span></span>
<a href="http://i.meituan.com/client" data-query="utm_source=wwwmaoyan" target="_blank">美团下载</a>
<span></span>
<a href="https://www.huanxi.com" data-query="utm_source=maoyan_pc" target="_blank">欢喜首映</a>
</p>
<p class="friendly-links">
商务合作邮箱:v@maoyan.com
客服电话:10105335
违法和不良信息举报电话:4006018900
<br/>
投诉举报邮箱:tousujubao@meituan.com
舞弊线索举报邮箱:wubijubao@maoyan.com
</p>
<p>
©2016
猫眼电影 maoyan.com
<a href="https://tsm.miit.gov.cn/pages/EnterpriseSearchList_Portal.aspx?type=0&keyword=京ICP证160733号&pageNo=1" target="_blank">京ICP证160733号</a>
<a href="http://www.miibeian.gov.cn" target="_blank">京ICP备16022489号-1</a>
<a href="http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11010102003232" target="_blank">京公网安备 11010102003232号</a>
<a href="/about/licence" target="_blank">网络文化经营许可证</a>
<a href="http://www.meituan.com/about/rules" target="_blank">电子公告服务规则</a>
</p>
<p>北京猫眼文化传媒有限公司</p>
</div>
<script crossorigin="anonymous" src="//www.dpfile.com/app/owl/static/owl_1.7.11.js"></script>
<script>
Owl.start({
project: 'com.sankuai.movie.fe.mywww',
pageUrl: location.href.split('?')[0].replace(/\/\d+/g, '/:id'),
devMode: false
})
</script>
<!--[if IE 8]><script crossorigin="anonymous" src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/es5-shim.bbad933f.js"></script><![endif]-->
<!--[if IE 8]><script crossorigin="anonymous" src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/es5-sham.d6ea26f4.js"></script><![endif]-->
<script crossorigin="anonymous" src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/common.96634b92.js"></script>
<script crossorigin="anonymous" src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/board-index.0cdf8e36.js"></script>
</body>
</html>
我们从第一的霸王别姬开始找,其附近的源码为:
<div class="container" id="app" class="page-board/index" >
<div class="content">
<div class="wrapper">
<div class="main">
<p class="update-time">2019-06-09<span class="has-fresh-text">已更新</span></p>
<p class="board-content">榜单规则:将猫眼电影库中的经典影片,按照评分和评分人数从高到低综合排序取前100名,每天上午10点更新。相关数据来源于“猫眼电影库”。</p>
<dl class="board-wrapper">
<dd>
<i class="board-index board-index-1">1</i>
<a href="/films/1203" title="霸王别姬" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c" alt="霸王别姬" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/1203" title="霸王别姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王别姬</a></p>
<p class="star">
主演:张国荣,张丰毅,巩俐
</p>
<p class="releasetime">上映时间:1993-01-01</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>
</div>
</div>
</div>
</dd>
我们可以观察到,在第一的霸王别姬附近的电影中,除了body标签最大的div标签有一个id标签,我们知道,在css中每个标签的id有且仅有一个,因此我们从这里开始匹配,所以最开始的匹配为:
result = re.xpath('//*[@id="app"]/text()')
这句话的意思是在这个HTML的页面上,匹配到id为app的标签为止,接着我们继续往下看,三个div标签嵌套着我们想提取的内容,所以继续写:
result = re.xpath('//*[@id="app"]/div/div/div/text()')
在div标签下有dl和dd标签同样嵌套因此接着是:
result = re.xpath('//*[@id="app"]/div/div/div/dl/dd/text()')
以此类推,
result = re.xpath('//*[@id="app"]/div/div/div/dl/dd/div/div/div/text()')
直到匹配到:
<p class="name"><a href="/films/1297" title="肖申克的救赎" data-act="boarditem-click" data-val="{movieId:1297}">肖申克的救赎</a></p>
这里的p标签有同级标签,按理来说我们应该要写成:
result = re.xpath('//*[@id="app"]/div/div/div/dl/dd/div/div/div/p[1]/a/text()')
但是,p标签下还有a标签的部分只有一个所以我们可以省略为:
result = re.xpath('//*[@id="app"]/div/div/div/dl/dd/div/div/div/p/a/text()')
6、当我们作为正则匹配工作时,就大功告成了。
完整源代码如下:
import requests
from lxml import etree
import time
def get_one_page (url):
headers = {
'User_Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
return None
def main(offset):
url = "https://maoyan.com/board/4?offset=" + str(offset)
html = get_one_page(url)
re = etree.HTML(html)
result = re.xpath('//*[@id="app"]/div/div/div/dl/dd/div/div/div/p/a/text()')
print(result)
for i in range(10):
main(offset=i * 10)
time.sleep(1)
其运行结果为:
['霸王别姬', '肖申克的救赎', '罗马假日', '这个杀手不太冷', '泰坦尼克号', '唐伯虎点秋香', '魂断蓝桥', '乱世佳人', '天空之城', '辛德勒的名单']
['喜剧之王', '音乐之声', '大闹天宫', '春光乍泄', '剪刀手爱德华', '美丽人生', '海上钢琴师', '黑客帝国', '指环王3:王者无敌', '哈利·波特与魔法石']
['加勒比海盗', '楚门的世界', '射雕英雄传之东成西就', '无间道', '教父2', '蝙蝠侠:黑暗骑士', '指环王1:护戒使者', '指环王2:双塔奇兵', '活着', '天堂电影院']
['狮子王', '机器人总动员', '拯救大兵瑞恩', '忠犬八公的故事', '哈尔的移动城堡', '疯狂原始人', '阿凡达', '盗梦空间', '幽灵公主', '东邪西毒']
['搏击俱乐部', '风之谷', 'V字仇杀队', '十二怒汉', '当幸福来敲门', '驯龙高手', '速度与激情5', '放牛班的春天', '勇敢的心', '闻香识女人']
['三傻大闹宝莱坞', '黑客帝国3:矩阵革命', '断背山', '神偷奶爸', '少年派的奇幻漂流', '飞屋环游记', '鬼子来了', '大话西游之月光宝盒', '怦然心动', '末代皇帝']
['致命魔术', '美丽心灵', '无敌破坏王', '倩女幽魂', '夜访吸血鬼', '蝙蝠侠:黑暗骑士崛起', '哈利·波特与死亡圣器(下)', '钢琴家', '本杰明·巴顿奇事', '甜蜜蜜']
['初恋这件小事', '触不可及', '新龙门客栈', '熔炉', '大话西游之大圣娶亲', '小鞋子', '教父', '素媛', '萤火之森', '穿条纹睡衣的男孩']
['窃听风暴', '时空恋旅人', '7号房的礼物', '恐怖直播', '海豚湾', '忠犬八公物语', '辩护人', '上帝之城', '美国往事', '七武士']
['完美的世界', '一一', '英雄本色', '爱·回家', '海洋', '我爱你', '黄金三镖客', '迁徙的鸟', '阿飞正传', '龙猫']
进程已结束,退出代码 0
7、结束语,当时我在写这个代码的时候想的挺复杂的,想着用contains去匹配相对应的标签,找到共同点,但是发现太麻烦了。并没有抓住id属性的特点,因此耗费了不少时间,反思自己认为最关键的还是要构思整体,不能盲目的去敲。这样应该可以省去不少时间吧。