Python爬虫笔记（一）——requests、re、BeautifulSoup库

最新推荐文章于 2023-07-19 10:01:26 发布

ForestWorld

最新推荐文章于 2023-07-19 10:01:26 发布

阅读量647

点赞数

本文链接：https://blog.csdn.net/yangliming4567/article/details/102798245

版权

2.with open（……） as ……

1、requests库

使用requests用于发送网络请求。[1]

1.1 导入requests

import requests

1.2获取某个网页

response = requests.get('http://www.baidu.com') //请求一个网址

1.3输出文件内容


response = requests.get('http://www.baidu.com') //请求一个网址
print(response.status_code)  //输出网页状态码
print(response.url)  //输出网页网址

response= requests.get('https://www.baidu.com/img/superlogo_c4d7df0a003d3db9b65e9ef0fe6da1ec.png?where=super') //请求一个图片
with open('C:/2.png','wb') as i: //在C盘创建一个2.png文件并打开
    i.write(response.content)  //将请求的图片放入2.png文件中

2.with open（……） as ……

2.1打开文件open()

以读文件的模式打开一个文件对象，使用Python内置的open()函数，传入文件名和标示符：

f = open('E:\python\python\test.txt', 'r')

标示符'r'表示读，这样，我们就成功地打开了一个文件

2.2读取文件read()

调用read()方法可以一次读取文件的全部内容

f = open('C:/test.txt')
print(f.read())

2.3关闭文件close()

最后一步是调用close()方法关闭文件。文件使用完毕后必须关闭，因为文件对象会占用操作系统的资源，并且操作系统同一时间能打开的文件数量也是有限的

f.close()

2.4try……finally……

由于文件读写时都有可能产生IOError，一旦出错，后面的f.close()就不会调用。所以，为了保证无论是否出错都能正确地关闭文件，我们可以使用try ... finally来实现

try:
    f = open('C:/test.txt', 'r')
    print(f.read())
finally:
    if f:
        f.close()

2.5with语句

每次都这么写实在太繁琐，所以，Python引入了with语句来自动帮我们调用close()方法[2]

with open('/path/to/file', 'r') as f:
    print(f.read())

3.re

Re库是Python的标准库，主要用于字符串匹配

3.1导入re库

import re

3.2re.findall()

搜索字符串，以列表类型返回全部能匹配的字符串

import re
a = re.findall("玛丽"，“玛丽有只小羊羔”)
print（a）

输出：['玛丽']

4.BeautifulSoup

BeautifulSoup4是爬虫必学的技能。BeautifulSoup最主要的功能是从网页抓取数据，Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐使用lxml 解析器。

假设有一个html

<!DOCTYPE html>
<html>
<head>
    <meta content="text/html;charset=utf-8" http-equiv="content-type" />
    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
    <meta content="always" name="referrer" />
    <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css" />
    <title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
  <div id="wrapper">
    <div id="head">
        <div class="head_wrapper">
          <div id="u1">
            <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 </a>
            <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 </a>
            <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 </a>
            <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 </a>
            <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 </a>
            <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品 </a>
          </div>
        </div>
    </div>
  </div>
</body>
</html>

创建beautifulsoup4对象：[3]

from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read()
bs = BeautifulSoup(html,"html.parser") 

print(bs.prettify()) # 缩进格式
print(bs.title) # 获取title标签的所有内容
print(bs.title.name) # 获取title标签的名称
print(bs.title.string) # 获取title标签的文本内容
print(bs.head) # 获取head标签的所有内容
print(bs.div) # 获取第一个div标签中的所有内容
print(bs.div["id"]) # 获取第一个div标签的id的值
print(bs.a) # 获取第一个a标签中的所有内容
print(bs.find_all("a")) # 获取所有的a标签中的所有内容
print(bs.find(id="u1"))# 获取id="u1"

for item in bs.find_all("a"):  # 获取所有的a标签，并遍历打印a标签中的href的值
    print(item.get("href"))  
for item in bs.find_all("a"): #获取所有的a标签，并遍历打印a标签的文本值
    print(item.get_text())

5.反爬机制和频繁请求

很多网站都有反爬虫的措施，对于没有headers头信息的请求一律认为是爬虫，禁止该请求的访问。因此在每次爬取网页时都需要加上headers头信息。

对于访问过于频繁的请求，客户端的IP会被禁止访问，因此设置代理可以将请求伪装成来自不同的IP，前提是要保证代理的IP地址是有效的。[4]

参考资料

[1]https://requests.kennethreitz.org//zh_CN/latest/user/quickstart.html快速上手Requests

[2]https://blog.csdn.net/xrinosvip/article/details/82019844《python 使用 with open（） as 读写文件》

[3]http://www.jsphp.net/python/show-24-214-1.html《beautifulsoup菜鸟教程》

[4]https://www.cnblogs.com/airnew/p/9981599.html《一起学爬虫——通过爬取豆瓣电影top250学习requests库的使用》

ForestWorld

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫笔记（一）——requests、re、BeautifulSoup库

目录1、requests库1.1 导入requests1.2获取某个网页1.3输出文件内容2.with open（……） as ……2.1打开文件open()2.2读取文件read()2.3关闭文件close()2.4try……finally……2.5with语句3.re3.1导入re库3.2re.findall()4.Beautiful...
复制链接

扫一扫