爬取微博话题

最近微博上有一个比较火的话题,叫做耳机对当代人有多重要,很是感兴趣

在这里插入图片描述

但是刷微博这种事情,太耽误学习了,那又想刷微博又想学习,该怎么办
那不如这样吧,写个爬虫,print每条评论,这样我就可以一边敲代码,时不时的看看打印,这多香啊
本次实战数据清洗部分极其恶心,我使用的数据库是mysql,那评论里又有很多特殊符号,特殊符号还好,使用mysql的utf8mb4还是可以处理的,但表情就实在是难住我了,我只好用最笨的replace大致的清洗下出现频率最多的表情了

开搞

import time
from http.cookiejar import CookieJar
from urllib import request
from lxml import etree
from urllib_day02 import myurllib
import MySQLdb
import re

conn = MySQLdb.connect(
    user='root',
    port=3306,
    password='123456',
    db='spider',
    charset='utf8mb4',
    host='localhost',
)
cursor = conn.cursor()

datas = '''Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
Cookie: SINAGLOBAL=329643001885.6844.1573002148871; login_sid_t=4396613b7b8d3841a00c4a00f2500236; cross_origin_proto=SSL; _s_tentry=www.baidu.com; Apache=5427773053657.6455.1578554625853; ULV=1578554626858:2:1:1:5427773053657.6455.1578554625853:1573002148894; WBtopGlobal_register_version=307744aa77dd5677; webim_unReadCount=%7B%22time%22%3A1578556209612%2C%22dm_pub_total%22%3A8%2C%22chat_group_client%22%3A0%2C%22allcountNum%22%3A19%2C%22msgbox%22%3A0%7D; UOR=,,login.sina.com.cn; appkey=; WBStorage=42212210b087ca50|undefined; SCF=Aj-iTSvsUNnwWoRS3kObxNvX9WkeGeRSr0D4KFIJls0TaAPw29BxH8f2ApSGwnwHbwIobZR3bXx0tBuK9PrkT7c.; SUB=_2A25zEq9zDeRhGeBP4lMT8i_PyDmIHXVQaYe7rDV8PUJbmtANLWvfkW9NRQq3_0YRQLcrIqGX1CGWYTAYXYJ05EGf; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WWxxLrHCKQeaCVeIwANKWnx5JpX5K2hUgL.Foqp1K2Eeo20e0-2dJLoIceLxK-L1-eLBKnLxKqLB-qL12qLxKBLBonL1h5LxKqL1-BLBK2LxKBLB.eLBKBLxK-LB.qL1heLxKML1-2L1hBLxKqL1-zL1K.LxK-L1h-L1h.LxK-LBKBLBKMLxKnLBK2L1KMt; SUHB=0d4OXckQXeM2Bf; SSOLoginState=1578557219; un=13503301458
Host: s.weibo.com
Referer: https://weibo.com/
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-site
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'''
myrequest = myurllib.MyRequest()
headers = myrequest.str_to_dict(datas)
page_index = 1
time_count = 0
with open('weibolog.txt', 'w') as w:
    while True:
        try:
            if page_index == 40:  # 这个是页码,是前些天的了,现在的页码是多少请去页面查看
                break
            url = 'https://s.weibo.com/weibo?q=%23%E8%80%B3%E6%9C%BA%E5%AF%B9%E5%BD%93%E4%BB%A3%E4%BA%BA%E6%9C%89%E5%A4%9A%E9%87%8D%E8%A6%81%23&from=default&page=' + str(page_index)
            req = request.Request(url, headers=headers)
            resp = request.urlopen(req, timeout=3).read()
            html = etree.HTML(resp)
            rows = html.xpath('//div[@class="content"]')
            praise_list = html.xpath('//a[@title="赞"]/em')
            j = 0
            for index, i in enumerate(rows):  # 在当前页遍历所有楼主的昵称和评论
                qqname = i.xpath('./div[@class="info"]')[0].xpath('./div')[1].xpath('./a[@class="name"]/text()')[0]
                praise = praise_list[index].xpath('.//text()')
                if praise == []:
                    praise = 0
                else:
                    praise = int(praise[0])
                print(praise)
                content = i.xpath('./p')[0].xpath('.//text()')
                if '展开全文' in content:
                    content = i.xpath('./p')[1].xpath('.//text()')
                comment = ''
                for j in content:
                    j = j.replace(' ', '')
                    comment += j.strip()
                result = re.findall('L.*微博视频', comment)
                if result:
                    a = re.compile('L.*微博视频')  # 备用字符
                    comment = a.sub('', comment)
                result4 = re.findall('L.*秒拍视频', comment)
                if result4:
                    a = re.compile('L.*秒拍视频')  # 备用字符
                    comment = a.sub('', comment)
                result2 = re.findall('收起全文d', comment)
                if result2:
                    a = re.compile('收起全文d')
                    comment = a.sub('', comment)
                result3 = re.findall('#.*#', comment)
                if result3:
                    a = re.compile('#.*#')
                    comment = a.sub('', comment)
                result5 = re.findall('2..', comment)
                if result5:
                    a = re.compile('2..')
                    comment = a.sub('', comment)
                comment = comment.replace('🐺', '')  # 就是这种enjoy表情实在是束手无策
                comment = comment.replace('🎧​', '')
                comment = comment.replace('👌​', '')
                try:
                    sql = 'insert into how_pod_import(qqname, comment, praise) values(%s, %s, %s)'
                    print(qqname, comment, praise)
                    cursor.execute(sql, [qqname, comment, praise])
                    conn.commit()
                except:
                    pass
                time_count += 1
                if time_count == 1:
                    time.sleep(1)
                if time_count == 5:
                    time.sleep(3)
                    time_count = 0
            print('翻页')
            page_index += 1
        except:
            pass
发布了5 篇原创文章 · 获赞 0 · 访问量 104
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 数字20 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览