python实时爬取微博热搜
文章只做简单记录和放出完整代码,详细内容可以一起讨论
第一步lxml方法获取内容
从站内找到的方法,xpath.py代码如下
import requests
from lxml import etree
import mysql
def run():
# 定义爬取的url
url = "https://s.weibo.com/top/summary"
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/73.0.3683.103 Safari/537.36'}
html = etree.HTML(requests.get(url, headers=header).text)
# 获取置顶热搜
rank = html.xpath('//td[@class="td-01 ranktop"]/text()')
# 获取热搜内容
affair = html.xpath('//td[@class="td-02"]/a/text()')
# 获取点击数值
view = html.xpath('//td[@class="td-02"]/span/text()')
# 获取热搜类别标签
tag = html.xpath('//td[@class="td-03"]')
# 单独拿出置顶标签
top_tag = tag[0].xpath("string(.)")
# 单独拿出置顶内容
top = affair[0]
affair = affair[1:]
# 置顶输出
# 调试用输出-print('{0:<10}\t{1:<40}\t{2:>20}'.format("top", top, top_tag))
# 循环输出
# 调试用输出-for i in range(0, len(affair)):
# 调试用输出-print("{0:<10}\t{1:{3}<30}\t{2:{3}>20}".format(rank[i], affair[i], view[i], chr(12288)))
# 初始化定义删次指数
del_index = 0
for a in range(0, len(tag)):
if tag[a].xpath("string(.)") == "荐":
# print(a)
# print("项为荐")
# 删掉广告项
# tag数比affair数多1,假设3为荐时,实际是要删affair[2]
del rank[a - 1 - del_index]
del view[a - 1 - del_index]
del affair[a - 1 - del_index]
# 删掉一次就改变了list的长度及顺序
del_index += 1
# 将StringList转换为NumList
view = list(map(int, view))
# 调用mysql存入数据库
mysql.insert_db(rank, affair, view, tag)
if __name__ == '__main__':
# 执行一次
run()
说明1:本来以为微博热搜广告只是第四项固定,后来发现会在前几名浮动变化,可以通过【荐】标签去判断删除广告项
说明2:view = list(map(int, view))这句代码是在插入mysql前做的类型转化,似乎不是必要的,但是为了逻辑规范
第二步存入mysql
DDL如下
CREATE TABLE `db_weibo_all` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`note` varchar(45) DEFAULT '',
`note_id` bigint(18) DEFAULT '0',
`onboard_time` timestamp NULL DEFAULT NULL,
`flag` tinyint(1) DEFAULT '0',
`real_pos` tinyint(4) DEFAULT '1',
`num` int(11) DEFAULT '1',
`topic_flag` tinyint(1) DEFAULT '0',
`word` varchar(45) DEFAULT '',
`emoji` varchar(8) DEFAULT '',
`is_fei` tinyint(1) DEFAULT '0',
`is_new` tinyint(1) DEFAULT '0',
`is_hot` tinyint(1) DEFAULT '0',
`is_ad` tinyint(1) DEFAULT '0',
`content_key` varchar(45) DEFAULT '',
PRIMARY KEY (`id`),
KEY `index_note` (`note`)
) ENGINE=InnoDB AUTO_INCREMENT=100 DEFAULT CHARSET=utf8
说明:我创建了两张表db_weibo_all和db_weibo_content,表结构一样,就是为了一张存放历史全部的,一张存实时的(50-1)条数据,主键自增,对note建立了索引
mysql.py代码如下
import sys
import datetime
import pymysql
def db_connect():
try:
db = pymysql.connect(
host='127.0.0.1',
user='root',
passwd='root',
db='echartdemoone'
)
except Exception as e:
print("Can't connect to database")
return db
def insert_db(rank, affair, view, tag):
# 初始化当前时间
now_time = datetime.datetime.now()
time1 = datetime.datetime.strftime(now_time, '%Y-%m-%d %H:%M:%S')
try:
# 清空content数据库sql
sql_delete = "DELETE FROM db_weibo_content"
# 插入all数据库sql
# sql_table_all = "INSERT INTO db_weibo_all SET note=(%s), num=(%s), real_pos=(%s),onboard_time='" + time1 + "'"
sql_table_all = "INSERT INTO db_weibo_all(note,num,real_pos,onboard_time) VALUES(%s,%s,%s,'" + time1 + "')"
# 插入content数据库sql
# sql_table_content = "INSERT INTO db_weibo_content SET note=(%s), num=(%s), real_pos=(%s),onboard_time='" + time1 + "'"
sql_table_content = "INSERT INTO db_weibo_content(note,num,real_pos,onboard_time) VALUES(%s,%s,%s,'" + time1 + "')"
# 查询all数据库sql,判断此项是否已存在
sql_select = "SELECT * FROM db_weibo_all WHERE note=(%s)"
db = db_connect()
cursor = db.cursor()
# 整合list准备插入
values = list(zip(affair, view, rank))
# 调试用输出-print(values)
# 先执行删除操作
cursor.execute(sql_delete)
db.commit()
# 清空content数据库后直接插入
cursor.executemany(sql_table_content, values)
db.commit()
# 循环判断是否已存在
for i in range(0, len(affair)):
cursor.execute(sql_select, affair[i])
data = cursor.fetchall()
if not data:
# 调试用输出-print(affair[i])
print("未找到第"+str(i)+"项,所以执行插入操作")
cursor.execute(sql_table_all, values[i])
db.commit()
else:
# 调试用输出-print(data)
print("已经存在了第"+str(i)+"项")
cursor.close()
db.close()
except Exception as e:
print(e)
2021-04-29日修改sql更规范效率更高
第三步编写定时程序
timer.py代码如下
import time
import xpath
def timer():
while True:
# 初始化定时时间
time_now = time.strftime("%S", time.localtime())
# 当秒数等于以下时
if (time_now == "50") or (time_now == "20"):
subject = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) + " 执行了一次程序"
print(subject)
# 执行xpath运行
xpath.run()
# 因为以秒定时,所以暂停2秒,使之不会在1秒内执行多次
time.sleep(2)
if __name__ == '__main__':
# 总程序启动
timer()
说明:微博为了防止恶意爬虫,如果访问速度过于频繁,就会出现验证码。建议定时程序访问次数合理设置
运行效果如下
2021-04-29修改后运行如下
BUG参考
如SQL中存在%s与%d的参数BUG问题,可以参考下面文章
https://blog.csdn.net/u011878172/article/details/72599120
20210716修改
微博热搜广告项由‘荐’变成了‘商’
同理微调即可