爬取糗事百科的段子

最新推荐文章于 2024-07-17 21:04:45 发布

qq_43784519

最新推荐文章于 2024-07-17 21:04:45 发布

阅读量145

点赞数 1

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_43784519/article/details/107406221

版权

爬虫专栏收录该内容

10 篇文章 0 订阅

订阅专栏

爬取糗事百科的段子（正则表达式）

import requests  #导入相应库
import re
from bs4 import BeautifulSoup

def judge_sex(sex):
    if sex == "womenIcon":
        return '女'
    else:
        return '男'

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)  Chrome/83.0.4103.116 Safari/537.36'
}   #加入请求头
def Get_text(url):
    res = requests.get(url, headers=headers)      
    info_lists = []
    ids = re.findall('<h2>(.*?)</h2>',res.text,re.S)	#id
    sexs = re.findall('<div class="articleGender (.*?)">21</div>',res.text)	#性别
    ages = re.findall('<div class="articleGender .*?">(.*?)</div>',res.text)	#年龄

    contents = re.findall('<span>(.*?)</span>',res.text,re.S)	#内容
    for id, content, sex, age in zip(ids,contents,sexs,ages):
        info = {
            'id' : id,
            'sex': judge_sex(sex),
            'age': age,
            'content': content
        }
        info_lists.append(info)

    for info_list in info_lists:
        print(info_list['id'])
        print(info_list['sex'])
        print(info_list['age'])
        for i in (info_list['content'].split("<br/>")):
            print(i)

urls = ['https://www.qiushibaike.com/text/page/{}'.format(str(i)) for i in range(1,25)]
for url in urls:
    Get_text(url)

在这里插入图片描述

qq_43784519

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬取糗事百科的段子

爬取糗事百科的段子import requests #导入相应库import refrom bs4 import BeautifulSoupdef judge_sex(sex): if sex == "womenIcon": return '女' else: return '男'headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/
复制链接

扫一扫