需求:
获取股吧热门信息(阅读数、评论数、标题、作者、更新时间)
导入requests,re
import requests,re
定义请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
发起请求,接收响应
response = requests.get(url='https://guba.eastmoney.com/default,99_1.html',headers=headers)
print(response.text)
3.定义获取存放li的ul的匹配规则
ul_pattern = re.compile(r'<ul class="newlist" tracker-eventcode="gb_xgbsy_ lbqy_rmlbdj">(.*?)</ul>',re.S)
ul = ul_pattern.findall(response.text)[0]
# print(ul) # []
# print(response.text)
制定获取li的规则
li_pattern = re.compile(r'<li.*?>(.*?)</li>',re.S)
li_list = li_pattern.findall(ul)
循环列表中的内容,再获取指定内容
制定获取阅读数和评论数的规则:
read_comment_pattern = re.compile(r'<cite>(.*?)</cite>',re.S)
制定获取标题的规则:
title_pattern = re.compile(r'<a .*? class="note">(.*?)</a>',re.S)
制定获取作者的规则:
author_pattern = re.compile(r'<font>(.*?)</font>',re.S)
制定获取更新时间的规则:
time_pattern = re.compile(r'<cite class="last">(.*?)</cite>')
for li in li_list:
dic = {}
# 获取阅读数
read = read_comment_pattern.findall(li)[0].strip()
# print(read)
# 获取评论数
comment = read_comment_pattern.findall(li)[1].strip()
# 获取标题
title = title_pattern.findall(li)[0]
# 获取作者
author = author_pattern.findall(li)[0]
# 获取时间
time = time_pattern.findall(li)[0]
# print()
# pass # 占位符
dic['阅读数'] = read
dic['评论数'] = comment
dic['标题'] = title
dic['作者'] = author
dic['更新时间'] = time
# 保存数据到txt文件中
with open('guba.txt','a',encoding='utf=8') as fp:
fp.write(str(dic)+'\n')
完整代码展示:
import requests,re
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
response = requests.get(url='https://guba.eastmoney.com/default,99_1.html',headers=headers)
ul_pattern = re.compile(r'<ul class="newlist" tracker-eventcode="gb_xgbsy_ lbqy_rmlbdj">(.*?)</ul>',re.S)
ul = ul_pattern.findall(response.text)[0]
li_pattern = re.compile(r'<li.*?>(.*?)</li>',re.S)
li_list = li_pattern.findall(ul)
read_comment_pattern = re.compile(r'<cite>(.*?)</cite>',re.S)
title_pattern = re.compile(r'<a .*? class="note">(.*?)</a>',re.S)
author_pattern = re.compile(r'<font>(.*?)</font>',re.S)
time_pattern = re.compile(r'<cite class="last">(.*?)</cite>')
for li in li_list:
dic = {}
read = read_comment_pattern.findall(li)[0].strip()
comment = read_comment_pattern.findall(li)[1].strip()
title = title_pattern.findall(li)[0]
author = author_pattern.findall(li)[0]
time = time_pattern.findall(li)[0]
dic['阅读数'] = read
dic['评论数'] = comment
dic['标题'] = title
dic['作者'] = author
dic['更新时间'] = time
with open('guba.txt','a',encoding='utf=8') as fp:
fp.write(str(dic)+'\n')