新手如何爬虫上道？从小白进阶成大白，看这一篇文章足够（里面有很多爬虫例子）

最新推荐文章于 2024-08-02 17:00:17 发布

sheep.ice

最新推荐文章于 2024-08-02 17:00:17 发布

阅读量1.1k

点赞数 20

文章标签：爬虫

本文链接：https://blog.csdn.net/qq_60556896/article/details/135895611

版权

爬虫相关的b站教程

新手版爬虫教程

较进阶版爬虫教程

关于http和https协议

（相关内容先了解即可，计算机网络的时候可以深入理解）

http协议

概念：服务器和客户端进行数据交互的一种形式
常用请求头信息：
1. User-Agent：请求载体（比如Google浏览器）的身份标识
2. Connection：请求完毕后，是断开连接还是保持连接
常用响应头
1. Content-Type：服务器端响应回客户端的数据类型

https协议

概念：‘s’对应的是security，是安全的超文本传输协议
加密方式
1. 对称密钥加密：利用公钥加密。客户端自己制定加密和解密的方式（密钥），服务器接受到加密信息之后会使用密钥进行解密。但是如果在密钥传输的过程中密钥被盗取或者拦截，会很不安全
2. 非对称密钥加密：利用私钥和公钥.A代表服务器端，B代表客户端。公钥是会发送给客户端，服务器端有自己的私钥进行解密从而避免秘文和秘钥同时放松给服务器端。但是公钥容易被中间机构拦截
下面是相关的一些弹幕解释（doge）：
1. 证书密钥加密（https加密的方式）：客户端会接受一个已经签名的证书进行加密再发给服务器端利用密钥解密。

关于requests

urllib模块
requests模块
作用：模拟浏览器发送请求

步骤：1.指定url，2.发起请求，3.获取响应数据， 4.持久化存储

示例代码1

以下主要是熟悉一下request.get()中所包含的三个参数，并且通过改变网址后面的（‘？’后面的内容）进行不同网页的请求数据

from bs4 import BeautifulSoup
import requests
import re

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
params = {
    'wd':'hello'
}
response = requests.get("https://www.baidu.com", headers=headers, params=params)
html = response.text

with open('./a.html', 'w', encoding='utf-8') as fp:
    fp.write(html)

示例代码2

对于百度翻译的破解，主要是熟悉一下request.post()所包含的参数

post请求（携带了参数）
响应数据是一组json数据

可以通过检查进行相关参数的检查，包括返回的是否是json
在这里插入图片描述

from bs4 import BeautifulSoup
import requests
import re
import json

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
params = {
    'kw':'dog'
}
response = requests.post("https://fanyi.baidu.com/sug", headers=headers, data=params)
html = response.json()

fp = open('./a.json', 'w', encoding='utf-8')
#因为传回来的是中文，所以不可以用ascii码进行编码否则会有问题
json.dump(html, fp=fp, ensure_ascii=False)

示例代码3

关于爬取KFC网站各个地点的餐厅名字并且保留成json格式

from bs4 import BeautifulSoup
import requests
import re
import json
import jsonlines

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
restrants = []
for pageIndex in range(1, 11):
    print(pageIndex)
    params = {
    'cname': '',
    'pid': '',
    'keyword': '北京',
    'pageIndex': pageIndex,
    'pageSize': 10,
    }
    response = requests.post("http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword", headers=headers,data=params)
    html = response.text
    json_html = json.loads(html)
    for all_restrant in json_html['Table1']:
        storeName = all_restrant['storeName']
        addressDetail = all_restrant['addressDetail']
        restrants.append(
            {
                'storeName':storeName,
                'addressDetail':addressDetail,
            }
        )

file_name = './a.json'
with open(file_name, 'w', encoding='utf-8') as json_file:
    json.dump(restrants, json_file, ensure_ascii=False, indent=2)

# with jsonlines.open(file_name, 'w') as jsonl_file:
#     jsonl_file.write_all(restrants)

#可以读取json文件的内容
with open(file_name, 'r', encoding='utf-8') as json_file:
    # 解析JSON文件内容
    data = json.load(json_file)
# 打印解析后的Python对象
print(data)
# fp = open('./a.json', 'w', encoding='utf-8')
# json.dump(html, fp=fp, ensure_ascii=False)

当然，由于一些资源是动态加载的，一些页面的a标签点进去之后会发现参数不同，但是前缀的网址是相同的。此时可以利用上面的方法去得到json然后解析出来那个唯一不同或者多个唯一不同的参数，通过得到的参数再去进行request请求。这就需要我们好好分析网页的结构，然后再进行爬取。

数据解析概论

当我们爬取了页面中指定的页面内容后，需要把爬取的页面内容进行解析以获取页面的局部内容。

数据解析分类

正则
bs4
xpath

数据解析原理

进行指定标签的定位
标签或者标签对应的属性中存储的数据值进行提取

关于bs4

bs4的文档

主要利用bs4进行html的一些解析工作，可以快速的得到很多的网页内容

可以用下面内容当成练习

在这里插入图片描述

示例代码

from bs4 import BeautifulSoup
import requests
import re
#可以指定一下请求头信息
headers = {
  	#主要是为了发送请求的时候模拟浏览器发送请求
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
movie_href = soup.findAll("div", attrs={"class":"hd"})

关于re正则匹配

正则匹配文档

这里主要是为了匹配出一些我们需要的字符子串，这个正则匹配会比一般的字符串处理更容易处理字符串。

一般来说我们可以利用re去提取URL，因此可以利用正则匹配去爬取图片的数据

爬取图片数据

from bs4 import BeautifulSoup
import requests
import re
import json
import jsonlines

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}

response = requests.get('https://lmg.jj20.com/up/allimg/tp10/22022312542M617-0-lp.jpg', headers=headers)
#获取图片二进制
image_wb = response.content

with open('./a.jpg', 'wb') as fp:
    fp.write(image_wb)

下面有一个写正则表达式的一个样例，可以看到虽然很长，但是大多主要是利用.*?

在这里插入图片描述

示例代码

比如我们要去得到解析出来<a></a>标签下的href的链接地址

from bs4 import BeautifulSoup
import requests
import re

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
movie_href = soup.findAll("div", attrs={"class":"hd"})

for href in movie_href:
    s = str(href.a)
    match = re.search(r'href="(.*?)"', s)
    if match:
        href_value = match.group(1)
        print(href_value)
    else:
        print("hh")

关于xpath

最常用且最便捷高效的一种解析方式，比较通用

解析原理：

实例化一个etree的对象，且需要将被解析的页面源码数据加载到该对象中
调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获

如何实例化一个etree的对象

从本地html：etree.parse(filePath)
从互联网上：etree.HTML(‘page_text’)
xpath(‘xpath表达式’)

xpath表达式

其实和定位文件差不多…

/表示单个层级，//表示多个层级

(单个层级) /html/body/div = （多个层级）/html//div

对比

soup.select(‘’)中的空格和大于号>

属性定位：//div[@class=“某某”]
索引定位：//div[@class=“某某”]/p[3] （索引从1开始）
取文本：//div[@class=“某某”]/p[3]/text() 或 //text()
取属性: //div[@class=“某某”]/img/@src

（下面是视频的内容，但是目前python过高版本的lxml已经没有etree）
可以通过下面方式导入

from lxml.html import etree

在这里插入图片描述

示例代码

from bs4 import BeautifulSoup
import requests
import re
import json
from lxml import html
import jsonlines

#可以指定一下请求头信息
headers = {
  	#主要是为了发送请求的时候模拟浏览器发送请求
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
hh = response.content

hh_text = html.fromstring(hh)
a = hh_text.xpath('//*[@id="content"]/div/div[1]/ol/li[7]/div/div[2]/div[2]/p[1]/text()')
print(a)

关于验证码识别

如果有些网站必须登陆才能访问某些数据，例如

在这里插入图片描述

我们需要输入验证码，识别验证码的操作如下：

人工肉眼识别（不推荐，效率比较低）
第三方自动识别

from bs4 import BeautifulSoup
import requests
import re
import json
from lxml import html
import jsonlines
import base64
import requests

_custom_url = "http://api.jfbym.com/api/YmServer/customApi"
_token = "uJgigF8CS5NR-t8ALI8-LRY2OUjC6UHY294tjnoyIfw"
_headers = {
    'Content-Type': 'application/json'
}
headers = {
  	#主要是为了发送请求的时候模拟浏览器发送请求
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
}
def common_verify(image, verify_type="50100"):
    payload = {
        "image": base64.b64encode(image).decode(),
        "token": _token,
        "type": verify_type
    }
    resp = requests.post(_custom_url, headers=_headers, data=json.dumps(payload))
    return resp.json()['data']['data']
def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        # 读取图片文件内容
        image_content = image_file.read()
    return image_content

session = requests.Session()
login_url = "https://cas.bjtu.edu.cn/auth/login/?next=/o/authorize/%3Fresponse_type%3Dcode%26client_id%3DaGex8GLTLueDZ0nW2tD3DwXnSA3F9xeFimirvhfo%26state%3D1705809064%26redirect_uri%3Dhttps%3A//mis.bjtu.edu.cn/auth/callback/%3Fredirect_to%3D/home/"
login_page = requests.get(login_url, headers=headers).content
login_html = html.fromstring(login_page)
img_url = 'https://cas.bjtu.edu.cn/' + login_html.xpath('//*[@id="login"]/dl/dd[2]/div/div[3]/span/img/@src')[0]

img_page = requests.get(img_url, headers=headers).content
with open('./1.jpg', 'wb') as fp:
    fp.write(img_page)
print('图片下载成功！！')
img_result = common_verify(image=image_to_base64('./1.jpg'))
print('图片处理成功！！')
print(img_result)
after_login_page_url = 'https://cas.bjtu.edu.cn/auth/login/?next=/o/authorize/%3Fresponse_type%3Dcode%26client_id%3DaGex8GLTLueDZ0nW2tD3DwXnSA3F9xeFimirvhfo%26state%3D1705809064%26redirect_uri%3Dhttps%3A//mis.bjtu.edu.cn/auth/callback/%3Fredirect_to%3D/home/'
data = {
  'next':'/o/authorize/?response_type=code&client_id=aGex8GLTLueDZ0nW2tD3DwXnSA3F9xeFimirvhfo&state=1705809064&redirect_uri=https://mis.bjtu.edu.cn/auth/callback/?redirect_to=/home/',
  'csrfmiddlewaretoken': 'dNjvND4fz99P99Qc2FhYxoFy8hnJGoAgcIWZ2M4Pw7dcMPYO655VGpJlUPez9OlZ',
  'loginname': '*********',
  'password': '***********',
  'captcha_0': '373515fc2ad2c8a9d25c8c938d6285c5c6737296',
  'captcha_1': img_result
}
after_page = session.post(after_login_page_url, data=data,headers=headers)
print(after_page.status_code)
final_page_url = 'https://mis.bjtu.edu.cn/home/'
final_page = session.get(url=final_page_url, headers=headers).text

with open('./a.html', 'w') as fp:
    fp.write(final_page)

关于selenium

selenium和爬虫之间的关联

可以便捷的获取网站中动态加载的数据
便捷的实现模拟登陆

示例代码1 进行无头和规避检测

from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from lxml.html import etree
import requests
from PIL import Image
import base64
import json
from selenium.webdriver.chrome.options import Options

#实现无可视化界面
chrom_options = Options()
chrom_options.add_argument('--headless')
chrom_options.add_argument('--disable-gpu')

driver = webdriver.Chrome(options=chrom_options)

driver.get('http://www.baidu.com')

print(driver.page_source)
time.sleep(3)
driver.quit()  # 使用完关闭浏览器

示例代码2 爬北京交通大学mis系统模拟登陆

from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from lxml.html import etree
import requests
from PIL import Image
import base64
import json

_custom_url = "http://api.jfbym.com/api/YmServer/customApi"
_token = "uJgigF8CS5NR-t8ALI8-LRY2OUjC6UHY294tjnoyIfw"
_headers = {
    'Content-Type': 'application/json'
}
headers = {
  	#主要是为了发送请求的时候模拟浏览器发送请求
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
}
def common_verify(image, verify_type="50100"):
    payload = {
        "image": base64.b64encode(image).decode(),
        "token": _token,
        "type": verify_type
    }
    resp = requests.post(_custom_url, headers=_headers, data=json.dumps(payload))
    return resp.json()['data']['data']
def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        # 读取图片文件内容
        image_content = image_file.read()
    return image_content

driver = webdriver.Chrome()  # 创建Chrome对象
driver.get('https://mis.bjtu.edu.cn/home/')
driver.save_screenshot('./a.png')
img_ele = driver.find_element(By.XPATH, '//*[@id="login"]/dl/dd[2]/div/div[3]/span/img')
location = img_ele.location
size = img_ele.size
rangle = (
    int(location['x']) * 2,
    int(location['y']) * 2,
    (int(location['x']) + size['width']) * 2,
    (int(location['y']) + size['height']) * 2,
)

i = Image.open('./a.png')
fram = i.crop(rangle)
fram.save('./aa.png')
img_result = common_verify(image=image_to_base64('./aa.png'))
print('图片处理成功！！')
print(img_result)

time.sleep(3)

username = driver.find_element(By.ID, 'id_loginname')
passward = driver.find_element(By.ID, 'id_password')
yzm = driver.find_element(By.ID, 'id_captcha_1')
login_bt = driver.find_element(By.CSS_SELECTOR, '.btn-lg')

username.send_keys('********')
time.sleep(3)
passward.send_keys('********')
time.sleep(3)
yzm.send_keys(img_result)
time.sleep(3)
login_bt.click()
time.sleep(4)

jwxt = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[3]/div/dl/dd[1]/div/ul/li[1]/div/div[2]/h3/a')
jwxt.click()
time.sleep(100)  # 两秒后关闭
driver.quit()  # 使用完关闭浏览器

写在最后

之前爬过豆瓣官网的影评还有一些内容，感兴趣的也可以看看一些实例化的工程

工程1 爬豆瓣官网某部电影影评

import requests
from bs4 import BeautifulSoup
from lxml import html
import re
import json
import os

#判定多少星
def starCnt(x):
    match = re.search(r'allstar(\d+) rating', x)

    if match:
        # 提取数字并将其除以10
        extracted_number = float(match.group(1)) / 5.0
        result = round(extracted_number, 1)
        return result
    else:
        return 0

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

#创建数据集文件夹
movie_data_dir = './movie_data'
try:
    os.makedirs(movie_data_dir)
    print(f'文件夹 "{movie_data_dir}" 创建成功')
except FileExistsError:
    print(f'文件夹 "{movie_data_dir}" 已存在')
except Exception as e:
    print(f'创建文件夹时发生错误: {e}')

movie_id = 35725869
params = {
    'percent_type':'h',
    'limit':1,
    'status':'P',
    'sort':'new_score',
}

#指定url
response = requests.get(f'https://movie.douban.com/subject/{movie_id}/comments', headers=headers, params=params)
h = response.text
hh = html.fromstring(h)
# 对于爬取电影的相关内容的xpath
# 电影的名字
movie_name = hh.xpath('//*[@id="content"]/h1/text()')[0].split(' ')[0]
# 电影的导演
movie_derector = hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[1]/a/text()')[0]
# 电影的主演
movie_actor = hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[2]/a/text()')
# 电影的类型
movie_type = hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[3]/text()')[1]
# 电影的地区
movie_field = hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[4]/text()')[1]
# 电影播放的总时间
movie_time = hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[5]/text()')[1]
# 电影上映的时间
movie_date = hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[6]/text()')[1]
# 设定我们循环的params的相关信息
loop_info = [{'percent_type':'h', 'limit':5},{'percent_type':'m', 'limit':3},{'percent_type':'l', 'limit':2}]

pa_data = {
    # 电影的名字
    'movie_name':movie_name,
    # 电影的导演
    'movie_derector':movie_derector,
    # 电影的主演
    'movie_actor':movie_actor,
    # 电影的类型
    'movie_type':movie_type,
    # 电影的地区
    'movie_field':movie_field,
    # 电影播放的总时间
    'movie_time':movie_time,
    # 电影上映的时间
    'movie_date':movie_date,
    # 电影的影评相关内容
    'coments_all':[{
        # 电影的影评内容
        'content':'',
        # 电影的影评得分
        'starScore':0,
        # 影评的有用数
        'usefulCnt':0,
    }],
}

#爬虫文本内容
for loop in loop_info:
    param = {
        #h m l
        'percent_type':loop['percent_type'],
        'limit': loop['limit'],
        'status':'P',
        'sort':'new_score',
    }
    print(param)
    response = requests.get(f'https://movie.douban.com/subject/{movie_id}/comments', headers=headers, params=param)
    h = response.text
    hh = html.fromstring(h)
    # 对于提取评论区的相关内容的xpath
    comments_body = hh.xpath('/html/body/div[3]/div[1]/div/div[1]/div[4]/div[@class="comment-item "]')
    for comment in comments_body:
        #爬取分数
        starScore = starCnt(str(comment.xpath('./div[2]/h3/span[2]/span[2]/@class')))
        #爬取有用的数量
        usefulCnt = int(comment.xpath('./div[2]/h3/span[1]/span/text()')[0])
        #爬取用户的评论
        content = comment.xpath('./div[2]/p/span/text()')[0]
        dict_data = {
            # 电影的影评内容
            'content': content,
            # 电影的影评得分
            'starScore': starScore,
            # 影评的有用数
            'usefulCnt': usefulCnt,
        }
        pa_data['coments_all'].append(dict_data)

    # 创建数据集json文件夹
    movie_datajson_dir = './movie_data_json'
    try:
        os.makedirs(movie_datajson_dir)
        print(f'文件夹 "{movie_datajson_dir}" 创建成功')
    except FileExistsError:
        print(f'文件夹 "{movie_datajson_dir}" 已存在')
    except Exception as e:
        print(f'创建文件夹时发生错误: {e}')
    # 指定要保存的JSON文件路径
    new_json_name = str(movie_id) + '_' +  loop['percent_type'] + '.json'
    json_file_path = os.path.join(movie_datajson_dir, new_json_name)
    # 使用json.dumps将字典转换为JSON格式的字符串
    json_data = json.dumps(pa_data, indent=2,ensure_ascii=False)
    # 将JSON字符串写入文件
    with open(json_file_path, 'w', encoding='utf-8') as json_file:
        json_file.write(json_data)
    print(f'Data has been written to {json_file_path}')