转眼将就来到了我们爬虫基础课的第 6 节课,今天我们来获取微某博信息来进行阅读学习!
PS前面几节课的内容在专栏这里,欢迎大家考古:点我
首先第一步我们先登录一下微x博:点我
点击左上角的搜索框,找到你想获取的用户:
大家可以看到这里有两种搜索方式:
1、按照关键字搜索
2、按照时间搜索
今天我们的代码都会讲!!
首先我们讲按照【时间】去搜索,选好时间,按【f12】或者右击检查,然后点击搜索
这时候我们发现这是一个【get请求】参数在url中也会显示,我们看一下参数
【uid】这是用户id
‘starttime’: ‘1690214400’, 时间戳
‘endtime’: ‘1690473600’, 时间戳
代码 1 —获取json(最后附上完整版代码)
注意,请填写自己cookie
import json
import time
import requests
cookie = {
'cookie': '请填写自己的cookie'}
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
搞定:
代码 2 【展开内容:】如果不点【展开】获取的将是部分内容,不是完整的
如法炮制,点击展开,获取当前这个微博动态的 id,然后再次请求就可以获取完整版的内容!!
代码 2 数据清洗
date = con_json[‘data’][‘list’][i][‘created_at’] # 日期
con = con_json[‘data’][‘list’][i][‘text_raw’] # 内容
reposts_count = con_json[‘data’][‘list’][i][‘reposts_count’] # 转发量
comments_count = con_json[‘data’][‘list’][i][‘comments_count’] # 评论
attitudes_count = con_json[‘data’][‘list’][i][‘attitudes_count’] # 点赞
mblogid = con_json[‘data’][‘list’][i][‘mblogid’] # 微博ID
这里我不知道有这段时间发了多少个,就写了999页
import json
import time
import requests
cookie = {
'cookie': '请填写自己的cookie'}
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
for i1 in range(1, 999):
params2 = {
'uid': '2656274875',
'page': f'{i1}',
'feature': '0',
'starttime': '1690214400',
'endtime': '1690473600',
'hasori' :1 ,
'hasret' :1 ,
'hastext' :1 ,
'haspic' :1 ,
'hasvideo' :1 ,
'hasmusic' :1,
}
url2 = "https://weibo.com/ajax/statuses/searchProfile?"
print(f"**********请求第{i1}页**********")
con = requests.get(url=url2, headers=headers,params=params2, cookies=cookie)
time.sleep(1.5)
con_json = json.loads(con.text)
if con_json['data']['list'] == []:
break
time.sleep(0.2)
for i in range(len(con_json['data']['list'])):
date = con_json['data']['list'][i]['created_at']
con = con_json['data']['list'][i]['text_raw']
reposts_count = con_json['data']['list'][i]['reposts_count']
comments_count = con_json['data']['list'][i]['comments_count']
attitudes_count = con_json['data']['list'][i]['attitudes_count']
mblogid = con_json['data']['list'][i]['mblogid']
url2 = f'https://weibo.com/ajax/statuses/longtext?id={mblogid}'
resp = requests.get(url2, headers=headers,cookies=cookie)
con2 = json.loads(resp.text)
try:
con2 = con2['data']['longTextContent']
except:
con2 = con
print("有展开的文章ID:", mblogid)
print("发布日期:", date)
print("发布内容:", con2)
print("转发量", reposts_count, end=' ')
print("评论量:", comments_count, end=' ')
print("点赞数:", attitudes_count)
print("-------------------------------------------------------------")
到这里就几乎接近尾声了,最后我们完善一下代码:
1、填入日期自动转换时间戳
2、讲内容写入Excel
搞定:
下面是完整版代码
import json
import time
import requests
import openpyxl
wb2 = openpyxl.Workbook()
ws2 = wb2.active
ws2.append(['发布日期','发布内容','转发量','评论量','点赞数'])
cookie = {
'cookie': '请填写自己的cookie'}
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
count = 0
def date_to_timestamp(date_str):
try:
date_time = time.strptime(date_str, '%Y-%m-%d')
timestamp = int(time.mktime(date_time))
return timestamp
except ValueError:
print('日期格式错误')
return None
try:
for i1 in range(1, 999):
params2 = {
'uid': '2656274875',
'page': f'{i1}',
'feature': '0',
'starttime': date_to_timestamp('2023-7-26'),
'endtime': date_to_timestamp('2023-7-28'),
'hasori' :1 ,
'hasret' :1 ,
'hastext' :1 ,
'haspic' :1 ,
'hasvideo' :1 ,
'hasmusic' :1,
}
url2 = "https://weibo.com/ajax/statuses/searchProfile?"
print(f"**********请求第{i1}页**********")
con = requests.get(url=url2, headers=headers,params=params2, cookies=cookie)
time.sleep(1.5)
con_json = json.loads(con.text)
if con_json['data']['list'] == []:
break
time.sleep(0.2)
for i in range(len(con_json['data']['list'])):
date = con_json['data']['list'][i]['created_at']
con = con_json['data']['list'][i]['text_raw']
reposts_count = con_json['data']['list'][i]['reposts_count']
comments_count = con_json['data']['list'][i]['comments_count']
attitudes_count = con_json['data']['list'][i]['attitudes_count']
mblogid = con_json['data']['list'][i]['mblogid']
url2 = f'https://weibo.com/ajax/statuses/longtext?id={mblogid}'
resp = requests.get(url2, headers=headers,cookies=cookie)
con2 = json.loads(resp.text)
try:
con2 = con2['data']['longTextContent']
except:
con2 = con
print("有展开的文章ID:", mblogid)
print("发布日期:", date)
print("发布内容:", con2)
print("转发量", reposts_count, end=' ')
print("评论量:", comments_count, end=' ')
print("点赞数:", attitudes_count)
ws2.append([date,con2,reposts_count,comments_count,attitudes_count])
count += 1
print(f"第{count}个,写入成功", "分割线---------------------------------")
finally:
wb2.save("./查询结果.xlsx")
希望对大家有帮助
我最近在努力学习爬虫,大家共同进步!!
都看到这了,关注+点赞+收藏=不迷路!!