python豆瓣爬虫基础篇--1

最新推荐文章于 2024-05-08 22:58:13 发布

W_AM_I

最新推荐文章于 2024-05-08 22:58:13 发布

阅读量165

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_45926926/article/details/116941311

版权

1、head的寻找

head = {  # 模拟浏览器头部信息，同时是用户代理，可以通过请求头伪装浏览器
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}

这是用来伪装谷歌浏览器的，在谷歌的network里面可以找到但是首先要record log

request = urllib.request.Request(url, headers=head)  # 向目标URL发送请求
html = ""
try:  # 用try和except是为了防止一些错误的发生，比如浏览器自己的404,或者是500
    response = urllib.request.Request(request) # 接受收到的request请求
    html = response.read().decode("utf-8")  # 用utf-8对html文件进行解码
    print(html)
except urllib.error.URLError as e:
    if hasattr(e, "code"):  # hasattr() 函数用于判断对象是否包含对应的属性。如果对象有该属性返回 True，否则返回 False。
        print(e.code)
    if hasattr(e, "reason"):
        print(e.reason)
return html

注意一下，两个函数urllib.request.Request（）与urllib.request.Request（）它们的作用。

if __name__ == "__main__":

函数开始的地方。

2、BeautifulSoup

将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种。

Tag
NavigableString
BeautifulSoup
Comment

from bs4 import BeautifulSoup  # 提取特定文字
file = open("./demo6.html", "rb")  # read,byte
html = file.read()
bs = BeautifulSoup(html, "html.parser")  # html,xml,json文件都可以被解析，因此是里面填html,解析器为html.parser

# 1.Tag 用于找到第一个标签以及其包括的内容,head为标签名
# print(bs.title) # <title>Document</title>
# print(bs.ul.li)
# print(type(bs.head))

# 2.NavigableString 是标签里的内容，可以理解为字符串
# print(bs.li.string)
# print(type(bs.li.string)) # <class 'bs4.element.NavigableString'>

# 3.快速拿到一个标签（div）里所有的属性值,然后返回一个字典（键值对）
# print(bs.div.attrs)  # {'class': ['container']}

# 4.表示整个文档
# print(type(bs)) # <class 'bs4.BeautifulSoup'>

# 5.注释的内容当用string来读取时
# print(bs.li.string) # 小米手机
# print(type(bs.li.string)) # <class 'bs4.element.Comment'>
# 注意到注释符号被替换掉了，类型变成了Comment，是一个特殊的NavigableString

主要是应用了四种类型分别是Tag，NavigableString，beatifulSoup以及Comment。

1.文档的遍历

print(bs.head.contents)# ['\n', <meta charset="utf-8"/>, '\n', <meta content="width=device-width,
# initial-scale=1.0" name="viewport"/>, '\n' .........

除了contents之外还有很多其他的属性，比如descendants，strings等等等

2.文档的搜索

1、字符串过滤find_all()

字符串过滤:会查找与字符串完全匹配的内容

配合有以下三种方法：

# (1) find_all
# 字符串过滤:会查找与字符串完全匹配的内容
# t_list = bs.find_all("li")
# print(t_list)

# 正则表达式搜索：使用search()方法来匹配内容
# t_list = bs.find_all(re.compile("a"))  # compile 可以和多个不同得函数搭配来使用，比如此时和find_all函数连用，返回一个list

# 方法 ： 传入一个函数，根据函数的要求来进行搜索
# def name_is_exists(tag):
#     return tag.has_attr("name")
#
#
# t_list = bs.find_all(name_is_exists)
# print(t_list)
# for item in t_list:
#     print(item)  # 打印列表的方式

2、Kwargs参数

1、可以用来选择id = head的标签以及其包含的子内容

W_AM_I

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
python豆瓣爬虫基础篇--1

1、head的寻找head = { # 模拟浏览器头部信息，同时是用户代理，可以通过请求头伪装浏览器 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) " "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}这是用来伪装谷歌浏览器的，在谷歌的network里面可以找到但是首先要record log
复制链接

扫一扫