好兄弟单身？这不得用python来帮他脱离苦海-CSDN博客

本文链接：https://blog.csdn.net/m0_72282564/article/details/132006477

前言

嗨喽，大家好呀~这里是爱看美女的茜茜呐

赠人玫瑰手留余香

好人做到底，来让朋友体验体验恋爱的感觉~

今天就带你们来爬爬相亲网站的数据信息

环境开发

Python 3.8 解释器
Pycharm 编辑器

模块使用

requests —> 数据请求模块需要安装 pip install requests
parsel
csv

第三方模块安装方法：

win + R 输入cmd 输入安装命令 pip install 模块名

(如果你觉得安装速度比较慢, 你可以切换国内镜像源)

👇 👇 👇 更多精彩机密、教程，尽在下方，赶紧点击了解吧~

素材、视频教程、完整代码、插件安装教程我都准备好了，直接在文末名片自取就可

源码、解答、教程、资料加V：qian97378免费领

代码展示

导入模块

# 导入数据请求模块  ---> 第三方模块 需要cmd里面 pip install requests
import requests
# 导入数据解析模块  ---> 第三方模块 需要cmd里面 pip install parsel
import parsel
# 导入csv模块 ---> 内置模块 不需要安装
import csv

创建文件

f = open('对象_1.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=[
    '标题',
    '幸运号',
    '性别',
    '年龄',
    '星座',
    '年薪',
    '学历',
    '身高',
    '爱情宣言',
    '照片',
    '详情页',
])

写入表头

csv_writer.writeheader()

网址列表页面url

link = 'https://www.**平台不让发 需要的+wx：qian97378'

模拟浏览器headers

headers = {
    'Cookie': '_',
    'Host': '****同上',
    'Referer': '*****/r/1/19lnsxq-4.html',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36',
}

发送请求

response_1 = requests.get(url=link, headers=headers)
# 获取数据 print(response_1.text)

解析数据

selector_1 = parsel.Selector(response_1.text)

css提取内容

title_list = selector_1.css('.item-hd h3::text').getall()  # 获取标题

获取链接

href = selector_1.css('.item-bd .cont a::attr(href)').getall()

for循环

for title, index in zip(title_list, href):
    # 把http替换成https
    url = index.replace('http:', 'https:')

"""
1. 发送请求, 模拟浏览器对于url地址发送请求
    - python代码 如何模拟浏览器发送请求
        请求头 是字典数据类型, 我们构建完整键值对形式
    - 如何替换内容
        ctrl + R 会弹出框框 输入正则命令
        (.*?): (.*)
        '$1': '$2',
    - <Response [200]> 表示请求成功
        但是不代表你得到数据...
    - response = requests.get(url=url, headers=headers)
        response 自定义变量 自己定义变量
        requests.get() 调用requests模块里面get方法
        url=url 左边url是get函数里面形式参数 右边url是我们传递进去的参数

"""

确定请求url地址

    # url = 'https://www.**平台不让发 需要的+wx：qian97378'

模拟浏览器发送请求 headers请求头

    headers = {
        'Cookie': '',
        'Host': 'www.19lou.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36',
    }

发送请求 --> <Response [200]> 表示请求成功

requests模块里面get请求方法对于url地址发送请求, 并且携带上headers请求头伪装, 最后用response自定变量接受返回数据

    response = requests.get(url=url, headers=headers)

获取数据, 获取服务器返回响应数据

----> 对应开发者工具里面 response print(response.text)

   """
   3. 解析数据, 提取我们想要数据内容 基本信息
   bs4 lxml parsel.... 解析模块
   - 解析方法: 都要学习掌握, 没有最好的 ---> 只有最适合的
       re: 直接对于字符串数据进行提取

       css: 根据标签属性提取数据内容
       xpath: 根据标签节点提取数据内容
   今日选择css选择器:
       根据标签属性提取数据内容

   都需要进行类型转换: 转成可解析对象
       因为我们得到 response.text ---> 字符串数据类型
   """

转换数据类型 <Selector xpath=None data='<html>\n<head>\n <meta charset="gb23...'>

    selector = parsel.Selector(response.text)

使用css提取数据

replace() 字符串替换

    love_num = selector.css('.love-blind-female .love-blind-info p::text').get()
    if love_num:
        love_num = love_num.replace('爱情幸运号：', '')
        # split() 字符串分割
        info_list = selector.css('.love-blind-female .love-blind-info .mt10::text').get().split('，')
        # 列表索引位置取值
        sex = info_list[0]  # 性别
        age = info_list[1]  # 性别
        constellation = info_list[2]  # 星座
        money = info_list[3]  # 年薪
        edu = info_list[4]  # 学历
        height = info_list[5]  # 身高
        love_txt = selector.css('.love-blind-female .love-blind-info .love-blind-txt::text').get()
        img_url = selector.css('.view-cont .thread-cont img::attr(src)').get().replace('http:', 'https:')
        # ctrl + D
        dit = {
            '标题': title,
            '幸运号': love_num,
            '性别': sex,
            '年龄': age,
            '星座': constellation,
            '年薪': money,
            '学历': edu,
            '身高': height,
            '爱情宣言': love_txt,
            '照片': img_url,
            '详情页': url,
        }
        csv_writer.writerow(dit)
        print(img_url)

获取图片数据

        img_content = requests.get(url=img_url).content
        with open('img\\' + title + '.jpg', mode='wb') as f:
            f.write(img_content)
        print(dit)