【数据采集】Xpath实例学习

A1867350

于 2024-04-18 14:26:07 发布

阅读量417

点赞数 5

分类专栏：程序员文章标签：学习

本文链接：https://blog.csdn.net/A1867350/article/details/137920333

版权

程序员专栏收录该内容

553 篇文章 0 订阅

订阅专栏

文章介绍了如何使用Python爬虫技术，结合BeautifulSoup库和XPath表达式，从电影列表页面抓取电影超链接，然后解析每个链接指向的电影详情页，提取关键信息如标题、图片和详细描述。讨论了在互联网大厂招聘中高并发和分布式系统的重要性。

摘要由CSDN通过智能技术生成

urls.append(url)

获取到所有页面的URL之后，就可以解析每页的电影的超链接，我们只取第一页。

接下来，需要定义一个方法通过电影列表页面URL获取页面中电影超链接列表，在html中通过Xpath表达式定位到了当前页所有电影超链接

def get_movie_href_page_url(url):

html_str = download_html(url)

html_content = etree.HTML(html_str)

movie_second_hrefs = html_content.xpath(“//table[@class=‘tbspan’]//a/@href”)

#在html中通过Xpath表达式定位到了当前页所有电影超链接

return movie_second_hrefs

在这里插入图片描述

拼接域名，获取完整url：

for url in urls[0:1]:#切片(只获取一页)

根据URL获取当前页面的电影href列表

movie_second_hrefs = get_movie_href_page_url(url)

URL需要拼接上网站的域名

movie_second_hrefs = [“https://dytt8.net”+i for i in movie_second_hrefs]

#print(movie_second_hrefs)#输出当前页电影超链接

for movie_href in movie_second_hrefs[0:1]:#切片只取当前页第一个电影超链接

这个方法用来解析电影详情页面

get_movie_detail(movie_href)

最后就需要定义一个方法来解析《甘草披萨》这个电影详情页的信息

def get_movie_detail(href):

html_str = download_html(href)

html_content = etree.HTML(html_str)

movie = {} # 每一个电影信息用字典保存(将来这个结构的数据存储到数据库) csv中

title = html_content.xpath(“//div[@class=‘title_all’]//font/text()”)#获取电影标题

movie[‘title:’] = title[0] if title and len(title) >= 1 else “”

image_src = html_content.xpath(“//div[@id=‘Zoom’]//img/@src”)#获取电影宣传图片

movie[‘image_src:’] = image_src[0] if image_src and len(image_src) >= 1 else “”

movie_text = html_content.xpath(“//div[@id=‘Zoom’]//text()”)#获取影片详情

print(movie_text)

#下面对详情内容通过条件控制进行需求性输出

for info in movie_text:

#注意，此控制条件只对不换行字符串内容进行全部提取

if info.startswith(“◎译　　名”):

s = info.replace(“◎译　　名”,“”).strip()

movie[‘transfer_name:’] = s

elif info.startswith(“◎片　　名”):

s = info.replace(“◎片　　名”,“”).strip()

movie[‘title_name:’] = s

elif info.startswith(“◎年　　代”):

s = info.replace(“◎年　　代”,“”).strip()

movie[‘age:’] = s

elif info.startswith(“◎产　　地”):

s = info.replace(“◎产　　地”,“”).strip()

movie[‘producing_area:’] = s

elif info.startswith(“◎类　　别”):

s = info.replace(“◎类　　别”,“”).strip()

movie[‘sort:’] = s

elif info.startswith(“◎语　　言”):

s = info.replace(“◎语　　言”,“”).strip()

movie[‘language:’] = s

elif info.startswith(“◎上映日期”):

s = info.replace(“◎上映日期”,“”).strip()

movie[‘release_date:’] = s

elif info.startswith(“◎豆瓣评分”):

s = info.replace(“◎豆瓣评分”,“”).strip()

movie[‘rating:’] = s

elif info.startswith(“◎片　　长”):

s = info.replace(“◎片　　长”,“”).strip()

movie[‘film_length:’] = s

elif info.startswith(“◎导　　演”):

s = info.replace(“◎导　　演”,“”).strip()

movie[‘director:’] = s

elif info.startswith(“◎编　　剧”):

s = info.replace(“◎编　　剧”,“”).strip()

movie[‘scriptwriter:’] = s

elif info.startswith(“◎主　　演”):

s = info.replace(“◎主　　演”,“”).strip()

movie[‘actor:’] = s

elif info.startswith("　　这部影片"):

s = info.replace("　　“,”").strip()

movie[‘introductory:’] = s

for key,value in movie.items():

print(key,value)

源码分享

from cmath import inf

from ctypes import memmove

from lxml import etree

import requests

base_url = “https://dytt8.net/html/gndy/dyzz/list_23_{0}.html”

headers = {

“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36”

}

#方法

def download_html(url):

response = requests.get(url,headers=headers)

把响应的内容按照gbk形式解码

html_str = response.content.decode(“gbk”)

return html_str

通过每页电影列表页面URL获取页面中电影超链接列表

def get_movie_href_page_url(url):

html_str = download_html(url)

html_content = etree.HTML(html_str)

movie_second_hrefs = html_content.xpath(“//table[@class=‘tbspan’]//a/@href”)

return movie_second_hrefs

通过每个电影的页面的URL来提取详细信息

def get_movie_detail(href):

html_str = download_html(href)

html_content = etree.HTML(html_str)

movie = {} # 每一个电影信息用字典保存(将来这个结构的数据存储到数据库) csv中

title = html_content.xpath(“//div[@class=‘title_all’]//font/text()”)

movie[‘title:’] = title[0] if title and len(title) >= 1 else “”

image_src = html_content.xpath(“//div[@id=‘Zoom’]//img/@src”)

movie[‘image_src:’] = image_src[0] if image_src and len(image_src) >= 1 else “”

movie_text = html_content.xpath(“//div[@id=‘Zoom’]//text()”)

print(movie_text)

for info in movie_text:

if info.startswith(“◎译　　名”):

s = info.replace(“◎译　　名”,“”).strip()

movie[‘transfer_name:’] = s

elif info.startswith(“◎片　　名”):

s = info.replace(“◎片　　名”,“”).strip()

movie[‘title_name:’] = s

elif info.startswith(“◎年　　代”):

s = info.replace(“◎年　　代”,“”).strip()

movie[‘age:’] = s

elif info.startswith(“◎产　　地”):

s = info.replace(“◎产　　地”,“”).strip()

movie[‘producing_area:’] = s

elif info.startswith(“◎类　　别”):

s = info.replace(“◎类　　别”,“”).strip()

movie[‘sort:’] = s

elif info.startswith(“◎语　　言”):

s = info.replace(“◎语　　言”,“”).strip()

movie[‘language:’] = s

elif info.startswith(“◎上映日期”):

s = info.replace(“◎上映日期”,“”).strip()

movie[‘release_date:’] = s

elif info.startswith(“◎豆瓣评分”):

s = info.replace(“◎豆瓣评分”,“”).strip()

movie[‘rating:’] = s

elif info.startswith(“◎片　　长”):

s = info.replace(“◎片　　长”,“”).strip()

movie[‘film_length:’] = s

elif info.startswith(“◎导　　演”):

s = info.replace(“◎导　　演”,“”).strip()

movie[‘director:’] = s

elif info.startswith(“◎编　　剧”):

s = info.replace(“◎编　　剧”,“”).strip()

movie[‘scriptwriter:’] = s

elif info.startswith(“◎主　　演”):

s = info.replace(“◎主　　演”,“”).strip()

movie[‘actor:’] = s

elif info.startswith("　　这部影片"):

s = info.replace("　　“,”").strip()

movie[‘introductory:’] = s

for key,value in movie.items():

print(key,value)

#print(movie)

看程序要从main方法去看

def main():

first_url = base_url.format(“1”)

first_html_content = download_html(first_url)

first_content = etree.HTML(first_html_content) # 通过这个方法能够将html文档转换为能够执行xpath表达式的对象

获取最大页码

page_number_str = first_content.xpath(“//select[@name=‘sldd’]/option[last()]/text()”)

返回的内容是 [‘244’] 列表结果，列表中装载了字符串页码

拼接URL，获取到所有页面的URL

urls = []

for page_number in range(1,int(page_number_str[0])+1):

url = base_url.format(page_number)

urls.append(url)

获取到所有页面的URL之后，解析每页的电影的超链接

for url in urls[0:1]:#切片(只获取一页)
自我介绍一下，小编13年上海交大毕业，曾经在小公司待过，也去过华为、OPPO等大厂，18年进入阿里一直到现在。

深知大多数Java工程师，想要提升技能，往往是自己摸索成长或者是报班学习，但对于培训机构动则几千的学费，着实压力不小。自己不成体系的自学效果低效又漫长，而且极易碰到天花板技术停滞不前！

因此收集整理了一份《2024年Java开发全套学习资料》，初衷也很简单，就是希望能够帮助到想自学提升又不知道该从何学起的朋友，同时减轻大家的负担。

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，基本涵盖了95%以上Java开发知识点，真正体系化！

由于文件比较大，这里只是将部分目录截图出来，每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频，并且会持续更新！

如果你觉得这些内容对你有帮助，可以扫码获取！！（备注Java获取）

最后

现在其实从大厂招聘需求可见，在招聘要求上有高并发经验优先，包括很多朋友之前都是做传统行业或者外包项目，一直在小公司，技术搞的比较简单，没有怎么搞过分布式系统，但是现在互联网公司一般都是做分布式系统。

所以说，如果你想进大厂，想脱离传统行业，这些技术知识都是你必备的，下面自己手打了一份Java并发体系思维导图，希望对你有所帮助。

《互联网大厂面试真题解析、进阶开发核心学习笔记、全套讲解视频、实战项目源码讲义》点击传送门即可获取！
）**

最后

所以说，如果你想进大厂，想脱离传统行业，这些技术知识都是你必备的，下面自己手打了一份Java并发体系思维导图，希望对你有所帮助。

[外链图片转存中…(img-Lol6vsIM-1713421555382)]

《互联网大厂面试真题解析、进阶开发核心学习笔记、全套讲解视频、实战项目源码讲义》点击传送门即可获取！

A1867350

关注

5
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
【数据采集】Xpath实例学习

现在其实从大厂招聘需求可见，在招聘要求上有高并发经验优先，包括很多朋友之前都是做传统行业或者外包项目，一直在小公司，技术搞的比较简单，没有怎么搞过分布式系统，但是现在互联网公司一般都是做分布式系统。所以说，如果你想进大厂，想脱离传统行业，这些技术知识都是你必备的，下面自己手打了一份Java并发体系思维导图，希望对你有所帮助。《互联网大厂面试真题解析、进阶开发核心学习笔记、全套讲解视频、实战项目源码讲义》点击传送门即可获取！）**
复制链接

扫一扫