python3爬取牛客讨论贴简易版

最新推荐文章于 2024-05-30 15:43:46 发布

lcr_happy

最新推荐文章于 2024-05-30 15:43:46 发布

阅读量242

点赞数

分类专栏： Python 文章标签： python3 牛客讨论区

本文链接：https://blog.csdn.net/lcr_happy/article/details/103175689

版权

Python 专栏收录该内容

24 篇文章 1 订阅

订阅专栏

啥也不说了，直接贴代码：

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import re
def filter_emoji(desstr,restr=''):
    '''
    过滤表情
    '''
    try:
        co = re.compile(u'[\U00010000-\U0010ffff]')
    except re.error:
        co = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
    return co.sub(restr, desstr)
#一共有100页
pageTotal = 100
headers={"Connection":"close"}
#对每一页进行请求
for page in range(1,pageTotal+1):
    payload = {"type":0,"order":0,"pageSize":30,"expTag":0,"page":page}
    niuke_discuss_url = 'https://www.nowcoder.com/discuss'
    response = requests.get(niuke_discuss_url,params=payload,timeout=500,headers=headers)
    soup = BeautifulSoup(response.text,"html.parser")
    list = soup.find_all('div',class_='discuss-detail')
    #找到当前页的所有帖子，依次遍历
    for div in list:
        #去除前后的换行符
        title = div.div.a.text.strip('\n')
        #去除烫、置顶等字符
        if '置顶' in title:
            continue
        index = title.find('\n')
        if index!= -1:
            title = title[:index]
        #过滤标题中含有表情的字符
        title = filter_emoji(title)
        
        url = "https://www.nowcoder.com"+div.div.a['href']
        res = requests.get(url,timeout=500,headers=headers)
        div_soup = BeautifulSoup(res.text,'html.parser')
        content = div_soup.find_all('div',class_='post-topic-main')[0].div.text
        #过滤表情
        content = filter_emoji(content)
        content = content.strip('\n').strip(' ').replace(' ','').replace('\t','')
        content = re.sub('\n+','\n',content)
        #print(title)
        with open('niuke.txt','a',encoding='utf-8') as file:
            #帖子标题
            file.write("帖子标题："+title+'\n')
            #帖子的链接
            file.write("帖子链接："+url+"\n")
            #帖子内容
            file.write("帖子内容："+'\n')
            file.write(content+'\n\n\n')
            file.close()
        
print("爬取"+str(pageTotal)+"页完成！")

我遭遇的问题就是爬了一会就会出现max retries with exceed xxxx啥的错误，是访问的太快了吗，还是被封IP了，各位有没有什么好办法，菜鸡求教，哭了。

希望无聊的时候能用上，感谢！

lcr_happy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3爬取牛客讨论贴简易版

啥也不说了，直接贴代码：#coding=utf-8import requestsfrom bs4 import BeautifulSoupimport redef filter_emoji(desstr,restr=''): ''' 过滤表情 ''' try: co = re.compile(u'[\U00010000-\U0010ffff...
复制链接

扫一扫

专栏目录