爬虫学习

最新推荐文章于 2024-05-02 14:08:21 发布

穹镜

最新推荐文章于 2024-05-02 14:08:21 发布

阅读量177

点赞数

本文链接：https://blog.csdn.net/weixin_42890793/article/details/91355490

版权

5.17

学习正则表达式

爬取斗破苍穹小说主要代码如下

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
f = open('C:\\Users\\456\\Desktop\\doupo.txt','a+')
def get_info(url):
    res =requests.get(url,headers=headers)

    if res.status_code==200:
        contents = re.findall('<p>(.*?)</p>',res.content.decode('utf-8'),re.S)
        for content in contents:
           f.write(content+'\n')
    else:
        pass

#出现的问题
headers = {‘User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/73.0.3683.103 Safari/537.36’}
应该写成
headers = {‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36’}

爬取嗅事百科的段子主要代码如下

def judgment_sex(class_name):
    if class_name == 'womenIcon':
        return '女'
    else:
        return '男'

def get_info(url):
    res =requests.get(url,headers=headers)
    ids = re.findall('<h2>(.*?)</h2>',res.text,re.S)
    print(ids)
    levels =re.findall('<div class="articleGender manIcon">(.*?)</div>',res.text,re.S)
    print(levels)
    sexs = re.findall('<div class="articleGender (.*?)">', res.text, re.S)
    print(sexs)
    contents =re.findall('<div class="content">.*?<span>(.*?)</span>',res.text,re.S)
    print(contents)
    for id,level,sex,content in zip(ids,levels,sexs,contents):
        info = {
            'id':id.strip(),
            'level':level.strip(),
            'sex':judgment_sex(sex),
            'content':content.strip()
        }
        info_lists.append(info)



if __name__ == "__main__":
    urls = ['https://www.qiushibaike.com/text/page/{}/'.format(str(num)) for num in range(1,36)]
    for single_url in urls:
        get_info(single_url)
        for info_list in info_lists:
            f = open('C:\\Users\\456\\Desktop\\duanzi.txt', 'a+',encoding='utf-8')
            try:
                f.write(info_list['id']+'\n',)
                f.write(info_list['level'] + '\n')
                f.write(info_list['sex'] + '\n')
                f.write(info_list['content'] + '\n')
                f.close()
            except UnboundLocalError:
                pass

发生的错误及解决

1.使用zip时，for循环的的info总是为空，对zip的操作不然熟悉，后面接着把前面的ids,levels,sexs,contents,全部输出，发现
contents为空，恍然大悟，contents =\re.findall()这句语句写错了，导致匹配为空，改好之后，输出至桌面的txt文件

2.f = open(‘C:\\Users\\456\\Desktop\\duanzi.txt’, ‘a+’,encoding=‘utf-8’) 书写时直接复制的路径，应该用俩斜杠\\而不是\
还用没用utf-8形成了乱码。

剩余问题

无法区分**<\p>**标签里的是小说正文，还是广告，或者别的链接
段子文档里的 <\br>没法消除

穹镜

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习

5.17学习正则表达式爬取斗破苍穹小说主要代码如下headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}f = open('C:\\Users\\456\\Des...
复制链接

扫一扫