python学习-利用xpath爬取豆瓣top250电影保存至本地excel

最新推荐文章于 2023-03-09 13:00:30 发布

bimbamboun

最新推荐文章于 2023-03-09 13:00:30 发布

阅读量1.9k

点赞数 4

文章标签： python xpath

本文链接：https://blog.csdn.net/bimbamboun/article/details/105644924

版权

库

requests库
lxml
csv

库的安装

1.pip install ****
2.pycharm中file->setting 在这里插入图片描述
+号添加库

爬取网页数据

1. requests库

可以说， Requests是用Python语言编写的简单易⽤的HTTP库

1.1基本操作

    r=requests.get("http://www.baidu.com/") # 请求网页
    print(r.status_code) # HTTP请求的返回状态，200表示连接成功，404表示失败
    print(type(r))
    print(r.headers) # 头部信息
    print(r.encoding) # 头部可能编码形式
    print(r.apparent_encoding) # 内容可能编码形式
    r.encoding=r.apparent_encoding
    print(r.text) # HTTP响应内容的字符串形式，即，url对应的页面内容

在这里插入图片描述

	requests
获取url	request = requests.get(“http://www…com”)
获取状态码	request.states_code
返回html	request.text
获得头部信息	request.headers
返回请求的url	request.url

requests.get(‘url’) #GET请求，获取url位置上的资源
requests.post(“url”) #POST请求，在url位置的资源后附加新的数据
requests.put(“url”) #PUT请求，向url位置存储资源，覆盖原有的资源
requests.delete(“url”) #DELETE请求，删除url位置存储的资源
#PATCH请求，局部更新url位置的资源，改变该处资源的部分内容，优点是节省网络带宽
requests.head(“url”) #HEAD请求
requests.options(“url”) #OPTIONS请求

在这里插入图片描述

1.2 网页爬取通用框架

def getHTMLText(url):
    try:
        # kv={'user_agent':'Mozilla/5.0'} #更改头部信息，对于要求较高的网站
        # r = requests.get(url, headers=kv)
        r = requests.get(url, timeout=30)
        r.raise_for_status()  # 如果状态不是200, 引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "产生异常"
        
if __name__ == "__main__":
    url = "http://www.baidu.com"
    print(getHTMLText(url))

2.XPath

XPath 是一门在 XML 文档中查找信息的语言
XML文档是被作为节点树来对待的，树的根被称为根节点或者文档节点
节点之间的关系有：父(parent)、子(children)、同胞(sibling)、先辈(ancestor，该节点的父，父的父等)、后代(decentant，该节点的子，子的子)

2.1 语法

a.选取节点

方式	描述
/	绝对路径提取
//	相对路径提取
@	指定属性提取

b.谓语

谓语用来查找某个特定的节点或者包含某个指定的值的节点。谓语被嵌在方括号中。

方式	描述
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()< 3]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang=‘eng’]	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。