第一天学习requests

最新推荐文章于 2023-06-04 21:38:26 发布

Rorschach379

最新推荐文章于 2023-06-04 21:38:26 发布

阅读量420

点赞数

分类专栏： python 文章标签： python 正则表达式

本文链接：https://blog.csdn.net/weixin_63123211/article/details/122415626

版权

python 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

用程序去获取网页上的数据

过程：1) 获取网页数据（requests、selenium）、 2)解析数据(正则表达式-re、css选择器-bs4、xpath-lxml)
3) 保存数据(数据库、csv文件、excel文件)
user-agent(身份伪装，将爬虫程序伪装成浏览器)、登录反爬(设置cookie)、字体反爬

1. 获取网页数据

requests.get(网页地址) - 获取网页数据，返回一个响应对象
参数headers: 请求头，需要一个字典，这个字典有两个常见的键值对
1）user-agent：客户端信息，可以设置成浏览器信息，将爬虫程序伪装成一个浏览器
2）cookie：账号登录信息账号登录信息，设置成账号登录成功后cookie值，可以跳过网页登录，获取登录后的数据

headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}
response = requests.get('https://movie.douban.com/top250', headers=headers)

# 2.从响应对象中获取请求结果
# <Response [418]>    418是状态码，状态码为200的时候才表示请求成功
print(response)

# 1）获取响应头
print(response.headers)

# a. 设置网页的编码方式(针对请求对象是网页的时候，如果网页内容打印乱码)
response.encoding = 'utf-8'


# 2）获取响应内容(请求结果)
# a. 响应对象.text    -   获取字符串形式的请求结果(获取网页源代码（请求对象是一个网页地址）)
print(response.text)

# b. 响应对象.content  -  获取二进制形式的请求结果(请求对象是图片、视频、音频等)
# print(response.content)

# c. 响应对象.json()   - 获取请求结果做json转换后的结果(请对象是json接口)
# print(response.json())

import requests

# 1. 网页
# https://cd.zu.ke.com/zufang   -  请求对象是网页地址
# headers = {
#     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
# }
# response = requests.get('https://cd.zu.ke.com/zufang', headers=headers)
#
# print(response.text)

# 2. 图片下载
# https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png  - 请求对象是图片地址
# headers = {
#     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
# }
# url = 'https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png'
# response = requests.get(url, headers=headers)
# open('files/a.png', 'wb').write(response.content)

# 3.json数据解析
# http://api.tianapi.com/auto/index?key=c9d408fefd8ed4081a9079d0d6165d43&num=10  - json数据接口
# response = requests.get('http://api.tianapi.com/auto/index?key=c9d408fefd8ed4081a9079d0d6165d43&num=10')
# result = response.json()
# for x in result['newslist']:
#     print(x['title'])

正则表达式解析数据作业

import requests
from re import findall
import csv
import json


# 需求1：获取top250首页数据中所有电影的电影名称（打印电影名）
def get_film_name():
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
    }
    response = requests.get('https://movie.douban.com/top250', headers=headers)
    result = findall(r'(?s)<img width="100" alt="(.+?)"', response.text)
    print(result)


# 需求2：获取top250首页数据中所有电影的所有数据(包括电影名称、导演、主演、评分、评论数、描述、图片地址)，将数据存储到csv文件
def get_one_page(start=0):
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
    }
    response = requests.get(f'https://movie.douban.com/top250?start={start}&filter=', headers=headers)

    # 解析数据
    result = findall(r'(?s)<li>.+?<img width="100" alt="(.+?)"\s*src="(.+?)".+?<p class="">(.+?)</p>.+?<span class="rating_num" property="v:average">(.+?)</span>.+?<span>(.+?)人评价</span>.+?<span class="inq">(.+?)</span>.+?</li>', response.text)

    data = []
    for x in result:
        line = list(x)
        message = line.pop(2).strip()
        director = findall(r'导演: (.+?)&', message)
        director = director[0] if director else ''

        leading_role = findall(r'主演: (.+?)<br>', message)
        leading_role = leading_role[0] if leading_role else ''

        time = findall(r'\s+(\d{4})&', message)
        time = time[0] if time else ''

        line.extend([director, leading_role, time])
        data.append(line)

    # 写入csv
    writer = csv.writer(open('files/电影.csv', 'a', newline=''))
    if start == 0:
        writer.writerow(['电影名', '封面', '评分', '评论人数', '描述', '导演', '主演', '上映时间'])
    writer.writerows(data)

def get_51job():
    response = requests.get('https://search.51job.com/list/000000,000000,0000,00,9,99,数据分析,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=')
    # print(response.text)
    json_str = findall(r'window.__SEARCH_RESULT__ = (\{.+?\})</script>', response.text)[0]
    data = json.loads(json_str)
    for job in data['engine_jds']:
        print(job['job_name'])


# 需求3：获取top250所有电影的所有数据(包括电影名称、导演、主演、评分、评论数、描述、图片地址)，将数据存储到csv文件
if __name__ == '__main__':
    # for x in range(0, 250, 25):
    #     get_one_page(x)
    get_51job()

网页数据解析

1.beautifulsoup4

beautifulsoup4是基于css选择器来解析网页数据的第三方库。
原理：先通过css选择器选中标签，然后再获取标签内容或者标签属性

2.用法

准备网页数据（获取网页源代码）
data = open(‘data.html’).read()
用bs4解析网页
导入第三方库(注意：安装的时候装beautifulsoup4，使用的时候用bs4)
from bs4 import BeautifulSoup

a.创建BeautifulSoup对象

BeautifulSoup(网页内容, ‘lxml’) - 返回一个BeautifulSoup对象，这个对象指向整个网页
soup = BeautifulSoup(data, ‘lxml’)

b.获取标签

BeautifulSoup对象.select(css选择器) - 返回css选择器选中的所有标签对应的列表（在整个网页中查找）
BeautifulSoup对象.select_one(css选择器) - 返回css选择器选中的所有标签里面的第一个标签（在整个网页中查找）
标签对象.select(css选择器) - 在指定标签中按照css选择器选中标签，选所有
标签对象.select_one(css选择器) - 在指定标签中按照css选择器选中标签，选一个

result = soup.select('#all-goods .title')
print(result)

result = soup.select_one('#all-goods .title')
print(result)

result = soup.select('p')       # 选中整个网页中的p标签
print(result)

div = soup.select_one('#all-staff')
result = div.select('p')        # 选中id为all-staff的标签里面所有的p标签
print(result)

# c.获取标签内容和标签属性
# 获取标签内容：标签对象.text
all_goods = soup.select('.goods>.title')
for x in all_goods:
    print(x.text)

goods = soup.select_one('.goods>.title')
print(goods.text)

# 获取标签属性：标签对象.attrs[属性名]
all_img = soup.select('.goods>img')
for x in all_img:
    print(x.attrs['src'])

作业练习

import requests
from bs4 import BeautifulSoup
from re import sub

# 1. 获取网页数据
response = requests.get('https://cd.zu.ke.com/zufang')
# print(response.text)

# 2. 解析网页数据
soup = BeautifulSoup(response.text, 'lxml')
all_house = soup.select('.content__list>.content__list--item')
data = []
for house in all_house:
    # 名字
    name = house.select_one('.twoline').text.strip()
    # 价格
    price = house.select_one('.content__list--item-price').text

    # 地址-面积-户型
    p = house.select_one('.content__list--item--des')
    message = sub(r'\s+', '', p.text)
    data.append([name, price, message])

print(data)

import time

time.sleep(1)

课后作业

import requests
from bs4 import BeautifulSoup
from re import sub
import csv
import time
import random

data = []
for x in range(1, 101):
    a = 2 * random.random()
    response = requests.get('https://cd.zu.ke.com/zufang/#contentList')
    time.sleep(a)
    soup = BeautifulSoup(response.text, 'lxml')
    all_house = soup.select(' .content__list>.content__list--item')
    for house in all_house:
        name = house.select_one(' .twoline').text.strip()
        price = house.select_one('.content__list--item-price').text
        location_message = house.select_one(' .content__list--item--des').text
        locations = sub(r'\s+', '', location_message)
        data.append([name, price, locations])
writer = csv.writer(open('files/贝壳租房数据.csv', 'w'))
writer.writerow(['房名', '租金', '位置'])
writer.writerows(data)

Rorschach379

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
第一天学习requests

爬虫：用程序去获取网页上的数据爬虫过程：1) 获取网页数据（requests、selenium）、 2)解析数据(正则表达式-re、css选择器-bs4、xpath-lxml)3) 保存数据(数据库、csv文件、excel文件)反反爬：user-agent(身份伪装，将爬虫程序伪装成浏览器)、登录反爬(设置cookie)、字体反爬1. 获取网页数据requests.get(网页地址) - 获取网页数据，返回一个响应对象参数headers: 请求头，需要一个字典，这个字典有两个常见的键值对
复制链接

扫一扫