微博主页信息提取—以李易峰微博主页为例
程序实现目标:提取id,点赞数,转发量,微博文本内容等信息,并存入文本文件中
1. 导入相关库及初始化
import requests
from urllib.parse import urlencode
from pyquery import PyQuery as pq
import time
base_url = 'https://m.weibo.cn/api/container/getIndex?'
headers = {
'Host': 'm.weibo.cn',
'Referer': 'https://m.weibo.cn/u/1291477752',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0',
'X-Requested-With': 'XMLHttpRequest',
}
max_page = 10
2. 获取页面信息
def get_page(page):
params = {
'type': 'uid',
'value': '1291477752',
'containerid': '1076031291477752',
'page': page
}
url = base_url + urlencode(params)
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.json(), page
except requests.ConnectionError as e:
print('Error', e.args)
3.解析网页并提取信息
为了提取微博文本信息,在这里使用pyquery.PyQuery类中的text()方法较为方便,该方法使用如下:
doc = PyQuery('<div><span>toto</span><span>tata</span></div>')
print(doc.text())
#输出为:tototata
doc = PyQuery('''<div><span>toto</span>
... <span>tata</span></div>'''
print(doc.text())
#输出为:toto tata
该模块对应的代码如下:
def parse_page(json, page: int):
if json:
items = json.get('data').get('cards')
for index, item in enumerate(items):
if page == 1 and index == 1:
continue
else:
item = item.get('mblog', {})
#weibo = ['id','text','attitudes','comments','reposts']
weibo = []
weibo_id = item.get('id')
text = pq(item.get('text')).text()
attitudes = item.get('attitudes_count')
comments = item.get('comments_count')
reposts = item.get('reposts_count')
weibo.append([weibo_id,text,str(attitudes),str(comments),str(reposts)])
return weibo
4. 将提取的信息保存到文本文件中
def save_to_txt(result):
with open('result.txt','a',encoding='utf-8',errors='ignore') as file:
result=",".join(result)
file.write(result+'\n')
5. 主函数
if __name__ == '__main__':
for page in range(1, max_page + 1):
json,this_page = get_page(page)
results = parse_page(json,this_page)
for result in results:
print(result)
save_to_txt(result)
time.sleep(1)
参考书籍:《Python3网络爬虫开发实战》