Python简易爬虫教程：笔记部分

Rainbow-cocktail

于 2023-03-13 12:14:46 发布

阅读量479

点赞数

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/Raintail/article/details/129478116

版权

本文是一篇初级Python爬虫教程，结合b站Genji教学和CSDN资源，讲解了如何使用requests发送HTTP请求，理解HTML结构，使用BeautifulSoup解析网页，并给出了爬取豆瓣电影Top250电影名称的实例。文中还涉及了反爬虫策略，如修改User-Agent。

摘要由CSDN通过智能技术生成

最近在学爬虫，这篇文章以b站Genji教学为骨架，结合了csdn上爬虫各部分细部教学，对内容广度和深度进行一定的拓展而整合成的一篇爬虫初级教程，里面写了自己的笔记和理解，方便记忆与运用。注：本文章陪同GenJi系列视频一起看会更好理解，因为视频里最基础的部分我没做说明。

课程来自于GenJi系列，网址：【04-理论课】如何用Python Requests发送请求？_哔哩哔哩_bilibili

1. 用requests发送请求

2. 了解HTML结构

3. Beautiful Soup用法

4.实例：爬取豆瓣电影top250电影名称并做成列表

1. 用requests发送请求

import requests

response = requests.get("http://books.toscrape.com/")
if response.ok: # response.ok returns True if status_code is less than 400, otherwise False.
    print(type(response))
    print(response.text)
else:
    print("false")

status_code小于400说明请求成功。我们发送请求时常常会因为反扒机制而失败，我们可以修改请求头user_agent。（参考：(54条消息) Python反爬虫措施之User-Agent_程序猿编码的博客-CSDN博客）

import requests

response = requests.get("http://books.toscrape.com/")
print(response.request.headers)
print(response.status_code)

# {'User-Agent': 'python-requests/2.28.1', 
#  'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}
# 200

直接由python发出的请求，user_agent='python-requests/2.28.1' ，这里status_code=200,说明请求成功，但我们直接用python发出请求时常常会因为反扒机制失败，这是我们可以修改为由浏览器发出的请求,浏览器输入about:version

令 user_agent = 红色款选部分，即可伪装发出的请求由浏览器发出,大多情况下问题就解决了，如果还不能解决，则参考上面的文章，引入fake_useragent库。

import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.55'}
response = requests.get("http://zhihu.com/",headers=headers)
print(response.request.headers)
print(response.status_code)

2. 了解HTML结构

（参考：(54条消息) 超级简单的Python爬虫教程_快乐老男孩！的博客-CSDN博客）

打开记事本，写下下面这些，改file.html，运行结果如下

<!DOCTYPE HTML>    
<html>
    <body>
        <h1>我是标题header</h1>
        <p>我是一个段落paragraph</p>
    </body>
</html>

<!DOCTYPE HTML>是告诉file这是一个html文件，<html>..</html>之间夹的是网页内容， <body></body>之间是用户可见内容，<h1></h1>是一级标题，<p></p>是段落。下面讲解了一些常见的html结构标签。

下面介绍了

div，span作为容器使用，相当于代码块

ol,ul是列表 table是表格 a是超链接 img是图像src后面接图像地址

<html>
    <head>
      <meta charset="UTF-8">
      <title>这是一个网页的标签标题</title>
    </head>
	<body>
	   <div>
		<h1>我是第一个容器div中最大的标题</h1>
		<p>下面介绍了换行等基本标签操作</p>
		<p>给岁月<p>
		<p>以文明</p>
		<p>给岁月<br>以文明</p>
		<p>给<i>岁月,用i标签定义斜体italic</i>以文明</p>
		<p>给<b>岁月,用b标签加粗bold</b>以文明</p>
		<p>给<u>岁月,用u标签下划线under</u>以文明</p>
		<h2>我是二级标题</h2>
		<h3>我是三级标题</h3>
		<img src="https://up.sc.enterdesk.com/edpic/f6/8f/1a/f68f1afd69ca12cb04980ef05af7815e.jpg" width="500px" height="500px">
		<a href="https://www.bilibili.com/"taget="_self">HypertextReference用anchor来添加链接,self为当前</a>
		<a href="https://www.bilibili.com/"taget="_blank">HypertextReference用anchor来添加链接,blank为打开新页面</a>
		<p>下面是列表介绍</p>
	      <ol>
		  <li>我是orderlist列表第1个list item</li>
		  <li>我是orderlist列表第2个list item</li>
		  <li>我是orderlist列表第3个list item</li>
	      </ol>
	      <ul>
		  <li>我是unorderlist第1个list item</li>
		  <li>我是unorderlist第2个list item</li>
		  <li>我是unorderlist第3个list item</li>
	      </ul>
	   </div>
	   <span style="color:blue">
		<h1>我是第二个容器span中最大的标题，我是blue的<h1>
		<table border="1">
			<thead>
				<tr>
					<td>我是table head 's table row's table data表头1</td>
					<td>我是table data表头2</td>
				</tr>
			</thead>
			<tbody>
				<tr>
					<td>我是表格body第一行第一个(1,1)</td>
					<td>我是表格body第一行第二个(1,2)</td>
				</tr>
				<tr>
					<td>我是表格body第二行第一个(2,1)</td>
					<td>我是表格body第二行第二个(2,2)</td>
				</tr>
			</tbody>
		</table>
	   </span>

	</body>
</html>

此外还有class属性,区分文本段落,比如

<p class="content">我是叫content类：给岁月以文明</p>
<p class="review">我是叫review类：五星好评</p>

小技巧，写html用pycharm写很方便，下面代码是用pycham写的,顺便区别下div和span

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>这是一个网页的标签标题</title>
</head>
<body>
    <div style="background-color:red;">
        <h1>我是一个1级标题</h1>
        <h2>我是一个2级标题</h2>
        <h6>我是一个最小的标题h6,不存在h7</h6>
        <p>这是一个文本段落<br>我用br换行了<i>haha</i>></p>
    </div>
    <p>我这里想区别下div和span区别，<span style="background-color:aqua">div用蓝色</span>，<span style="background-color:violet">span用紫色</span></p>
</body>
</html>

3. Beautiful Soup用法

Beautiful Soup用来解析html

令 soup = BeautifulSoup类 ,这个类初始化传入html和html解析器(html.parser)，接下来我们常常调用实例化对象方法find和findAll

可以参考python爬虫beautifulsoup findall函数详解_白速龙王的回眸的博客-CSDN博客

import requests
from bs4 import BeautifulSoup

content = requests.get("https://books.toscrape.com/").text
soup = BeautifulSoup(content,"html.parser") #实例化对象

#提取网页中价格
all_price = soup.findAll("p",{"class":"price_color"}) # findAll返回可迭代对象
print(all_price)
#[<p class="price_color">Â£51.77</p>, <p class="price_color">Â£53.74</p>,.....]

for i in all_price:
    print(i) # <p class="price_color">Â£51.77</p>
    print(i.string) # .string可以返回string属性即标签包含的文字部分Â£51.77
    print(i.string[2:]) # 51.77
    print(str(i.string))
    print(float(str(i.string)[2:]))
    #break

#提取网页中标题
all_title = soup.findAll("h3") # 发现标题在h3下面，找出所有h3
for item in all_title:
    print(item) # <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
    links = item.findAll("a") # 标题在a特征下面，从每个h3下面再提取a特征
    for i in links:
        print(i.string) # A Light in the ...
    #break

4.实例：爬取豆瓣电影top250电影名称并做成列表

豆瓣电影 Top 250,我们打开网站，右键-->检查--点击左上角鼠标，将鼠标放在肖申克救赎的标题上，右边检查处显示出源码所在位置，记住特征<span>class = title,这是我们爬取的依据。

下面代码只爬取了当前网页的top-list，只有前25部电影。

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"}
response = requests.get("https://movie.douban.com/top250", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")  # beautifulsoup是类，html.parser是html解析器
toplist = []
for i in soup.findAll('span',attrs={'class':"title"}): # 使用beautiful soup对象findall方法,返回可迭代对象
    i = i.string # 我们只要它的string属性，即文字部分
    print(i)
    # 肖申克的救赎
    # /The Shawshank Redemption
    # 我们发现把英文标题也提出来了
    # 注意到英文标题有‘/’，我们依据这个特征来忽略英文
    if '/' not in i:
        toplist.append(i)

print(toplist)
print(len(toplist))

接下来我们要爬取250部电影。

我们注意每下一页url特征

movie.douban.com/top250?start=0&filter= 1-25部电影

movie.douban.com/top250?start=25&filter= 26-50部电影

movie.douban.com/top250?start=50&filter= 51-75部电影

因此我们需要不断改 start=的值，来访问不同网页进行爬取

import requests
from bs4 import BeautifulSoup

toplist = []  # 爬取结果放这里
for num in range(0,226,25):
    print(num) # 0 25 50 75 100 .....225

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"}
    response = requests.get("https://movie.douban.com/top250?start={}".format(num), headers=headers)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")  # beautifulsoup是类，html.parser是html解析器
    for i in soup.findAll('span',attrs={'class':"title"}): # 使用beautiful soup对象findall方法,返回可迭代对象
        i = i.string
        if '/' not in i:
            toplist.append(i)

print(toplist) # ['肖申克的救赎', '霸王别姬', '阿甘正传' ....'我爱你', '地球上的星星']
print(len(toplist)) # 250