【数据采集】Xpath实例学习(1)

2401_84412689

于 2024-04-20 02:08:25 发布

阅读量850

点赞数 10

分类专栏：程序员文章标签：学习

本文链接：https://blog.csdn.net/2401_84412689/article/details/137983476

版权

程序员专栏收录该内容

95 篇文章 0 订阅

订阅专栏

我们通过Xpath表达式先来获取最新电影所有页面的URL

页面右击鼠标点检查，通过分析，用Xpath表达式返回内容是 [‘244’] 列表结果，列表中装载了字符串页码

在这里插入图片描述

first_url = base_url.format(“1”)

first_html_content = download_html(first_url)

first_content = etree.HTML(first_html_content) # 通过这个方法能够将html文档转换为能够执行xpath表达式的对象

获取最大页码

page_number_str = first_content.xpath(“//select[@name=‘sldd’]/option[last()]/text()”)

返回的内容是 [‘244’] 列表结果，列表中装载了字符串页码

拼接URL，获取到所有页面的URL

urls = []

for page_number in range(1,int(page_number_str[0])+1):

url = base_url.format(page_number)

urls.append(url)

获取到所有页面的URL之后，就可以解析每页的电影的超链接，我们只取第一页。

接下来，需要定义一个方法通过电影列表页面URL获取页面中电影超链接列表，在html中通过Xpath表达式定位到了当前页所有电影超链接

def get_movie_href_page_url(url):

html_str = download_html(url)

html_content = etree.HTML(html_str)

movie_second_hrefs = html_content.xpath(“//table[@class=‘tbspan’]//a/@href”)

#在html中通过Xpath表达式定位到了当前页所有电影超链接

return movie_second_hrefs

在这里插入图片描述

拼接域名，获取完整url：

for url in urls[0:1]:#切片(只获取一页)

根据URL获取当前页面的电影href列表

movie_second_hrefs = get_movie_href_page_url(url)

URL需要拼接上网站的域名

movie_second_hrefs = [“https://dytt8.net”+i for i in movie_second_hrefs]

#print(movie_second_hrefs)#输出当前页电影超链接

for movie_href in movie_second_hrefs[0:1]:#切片只取当前页第一个电影超链接

这个方法用来解析电影详情页面

get_movie_detail(movie_href)

最后就需要定义一个方法来解析《甘草披萨》这个电影详情页的信息

def get_movie_detail(href):

html_str = download_html(href)

html_content = etree.HTML(html_str)

movie = {} # 每一个电影信息用字典保存(将来这个结构的数据存储到数据库) csv中

title = html_content.xpath(“//div[@class=‘title_all’]//font/text()”)#获取电影标题

movie[‘title:’] = title[0] if title and len(title) >= 1 else “”

image_src = html_content.xpath(“//div[@id=‘Zoom’]//img/@src”)#获取电影宣传图片

movie[‘image_src:’] = image_src[0] if image_src and len(image_src) >= 1 else “”

movie_text = html_content.xpath(“//div[@id=‘Zoom’]//text()”)#获取影片详情

print(movie_text)

#下面对详情内容通过条件控制进行需求性输出

for info in movie_text:

#注意，此控制条件只对不换行字符串内容进行全部提取

if info.startswith(“◎译　　名”):

s = info.replace(“◎译　　名”,“”).strip()

movie[‘transfer_name:’] = s

elif info.startswith(“◎片　　名”):

s = info.replace(“◎片　　名”,“”).strip()

movie[‘title_name:’] = s

elif info.startswith(“◎年　　代”):

s = info.replace(“◎年　　代”,“”).strip()

movie[‘age:’] = s

elif info.startswith(“◎产　　地”):

s = info.replace(“◎产　　地”,“”).strip()

movie[‘producing_area:’] = s

elif info.startswith(“◎类　　别”):

s = info.replace(“◎类　　别”,“”).strip()

movie[‘sort:’] = s

elif info.startswith(“◎语　　言”):

s = info.replace(“◎语　　言”,“”).strip()

movie[‘language:’] = s

elif info.startswith(“◎上映日期”):

s = info.replace(“◎上映日期”,“”).strip()

movie[‘release_date:’] = s

elif info.startswith(“◎豆瓣评分”):

s = info.replace(“◎豆瓣评分”,“”).strip()

movie[‘rating:’] = s

elif info.startswith(“◎片　　长”):

s = info.replace(“◎片　　长”,“”).strip()

movie[‘film_length:’] = s

elif info.startswith(“◎导　　演”):

s = info.replace(“◎导　　演”,“”).strip()

movie[‘director:’] = s

elif info.startswith(“◎编　　剧”):

s = info.replace(“◎编　　剧”,“”).strip()

movie[‘scriptwriter:’] = s

elif info.startswith(“◎主　　演”):

s = info.replace(“◎主　　演”,“”).strip()

movie[‘actor:’] = s

elif info.startswith("　　这部影片"):

s = info.replace("　　“,”").strip()

movie[‘introductory:’] = s

for key,value in movie.items():

print(key,value)

源码分享

from cmath import inf

from ctypes import memmove

from lxml import etree

import requests

base_url = “https://dytt8.net/html/gndy/dyzz/list_23_{0}.html”

headers = {

“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36”

}

#方法

def download_html(url):

response = requests.get(url,headers=headers)

把响应的内容按照gbk形式解码

html_str = response.content.decode(“gbk”)

return html_str

通过每页电影列表页面URL获取页面中电影超链接列表

def get_movie_href_page_url(url):

html_str = download_html(url)

html_content = etree.HTML(html_str)

movie_second_hrefs = html_content.xpath(“//table[@class=‘tbspan’]//a/@href”)

return movie_second_hrefs

通过每个电影的页面的URL来提取详细信息

def get_movie_detail(href):

html_str = download_html(href)

html_content = etree.HTML(html_str)

movie = {} # 每一个电影信息用字典保存(将来这个结构的数据存储到数据库) csv中

title = html_content.xpath(“//div[@class=‘title_all’]//font/text()”)

movie[‘title:’] = title[0] if title and len(title) >= 1 else “”

image_src = html_content.xpath(“//div[@id=‘Zoom’]//img/@src”)

movie[‘image_src:’] = image_src[0] if image_src and len(image_src) >= 1 else “”

movie_text = html_content.xpath(“//div[@id=‘Zoom’]//text()”)

print(movie_text)

for info in movie_text:

if info.startswith(“◎译　　名”):

s = info.replace(“◎译　　名”,“”).strip()

movie[‘transfer_name:’] = s

elif info.startswith(“◎片　　名”):

s = info.replace(“◎片　　名”,“”).strip()

movie[‘title_name:’] = s

elif info.startswith(“◎年　　代”):

s = info.replace(“◎年　　代”,“”).strip()

movie[‘age:’] = s

elif info.startswith(“◎产　　地”):

s = info.replace(“◎产　　地”,“”).strip()

movie[‘producing_area:’] = s

elif info.startswith(“◎类　　别”):

s = info.replace(“◎类　　别”,“”).strip()

movie[‘sort:’] = s

elif info.startswith(“◎语　　言”):

s = info.replace(“◎语　　言”,“”).strip()

movie[‘language:’] = s

elif info.startswith(“◎上映日期”):

s = info.replace(“◎上映日期”,“”).strip()

movie[‘release_date:’] = s

elif info.startswith(“◎豆瓣评分”):

s = info.replace(“◎豆瓣评分”,“”).strip()

movie[‘rating:’] = s

elif info.startswith(“◎片　　长”):

s = info.replace(“◎片　　长”,“”).strip()

movie[‘film_length:’] = s

elif info.startswith(“◎导　　演”):

s = info.replace(“◎导　　演”,“”).strip()

movie[‘director:’] = s

elif info.startswith(“◎编　　剧”):

s = info.replace(“◎编　　剧”,“”).strip()

movie[‘scriptwriter:’] = s

elif info.startswith(“◎主　　演”):

s = info.replace(“◎主　　演”,“”).strip()

movie[‘actor:’] = s

elif info.startswith("　　这部影片"):

s = info.replace("　　“,”").strip()

movie[‘introductory:’] = s

自我介绍一下，小编13年上海交大毕业，曾经在小公司待过，也去过华为、OPPO等大厂，18年进入阿里一直到现在。

深知大多数Java工程师，想要提升技能，往往是自己摸索成长或者是报班学习，但对于培训机构动则几千的学费，着实压力不小。自己不成体系的自学效果低效又漫长，而且极易碰到天花板技术停滞不前！

因此收集整理了一份《2024年Java开发全套学习资料》，初衷也很简单，就是希望能够帮助到想自学提升又不知道该从何学起的朋友，同时减轻大家的负担。

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，基本涵盖了95%以上Java开发知识点，真正体系化！

由于文件比较大，这里只是将部分目录截图出来，每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频，并且会持续更新！

如果你觉得这些内容对你有帮助，可以扫码获取！！（备注Java获取）

技术学习总结

学习技术一定要制定一个明确的学习路线，这样才能高效的学习，不必要做无效功，既浪费时间又得不到什么效率，大家不妨按照我这份路线来学习。

最后面试分享

大家不妨直接在牛客和力扣上多刷题，同时，我也拿了一些面试题跟大家分享，也是从一些大佬那里获得的，大家不妨多刷刷题，为金九银十冲一波！

《互联网大厂面试真题解析、进阶开发核心学习笔记、全套讲解视频、实战项目源码讲义》点击传送门即可获取！
一个明确的学习路线，这样才能高效的学习，不必要做无效功，既浪费时间又得不到什么效率，大家不妨按照我这份路线来学习。

[外链图片转存中…(img-gYXH9rES-1713550090324)]

[外链图片转存中…(img-zxYMLWry-1713550090327)]

[外链图片转存中…(img-ljU97Y51-1713550090329)]

最后面试分享

大家不妨直接在牛客和力扣上多刷题，同时，我也拿了一些面试题跟大家分享，也是从一些大佬那里获得的，大家不妨多刷刷题，为金九银十冲一波！

[外链图片转存中…(img-Pe9yGH6P-1713550090330)]

[外链图片转存中…(img-yRPBxZQK-1713550090332)]

《互联网大厂面试真题解析、进阶开发核心学习笔记、全套讲解视频、实战项目源码讲义》点击传送门即可获取！

2401_84412689

关注

10
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
【数据采集】Xpath实例学习(1)

学习技术一定要制定一个明确的学习路线，这样才能高效的学习，不必要做无效功，既浪费时间又得不到什么效率，大家不妨按照我这份路线来学习。大家不妨直接在牛客和力扣上多刷题，同时，我也拿了一些面试题跟大家分享，也是从一些大佬那里获得的，大家不妨多刷刷题，为金九银十冲一波！《互联网大厂面试真题解析、进阶开发核心学习笔记、全套讲解视频、实战项目源码讲义》点击传送门即可获取！一个明确的学习路线，这样才能高效的学习，不必要做无效功，既浪费时间又得不到什么效率，大家不妨按照我这份路线来学习。
复制链接

扫一扫