聚焦爬虫案例1:股吧

需求:
获取股吧热门信息(阅读数、评论数、标题、作者、更新时间)

导入requests,re

import requests,re

定义请求头

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

发起请求,接收响应

response = requests.get(url='https://guba.eastmoney.com/default,99_1.html',headers=headers)
print(response.text)

3.定义获取存放li的ul的匹配规则

ul_pattern = re.compile(r'<ul class="newlist" tracker-eventcode="gb_xgbsy_ lbqy_rmlbdj">(.*?)</ul>',re.S)
ul = ul_pattern.findall(response.text)[0]
# print(ul)  # []
# print(response.text)

制定获取li的规则

li_pattern = re.compile(r'<li.*?>(.*?)</li>',re.S)
li_list = li_pattern.findall(ul)

循环列表中的内容,再获取指定内容

制定获取阅读数和评论数的规则:

read_comment_pattern = re.compile(r'<cite>(.*?)</cite>',re.S)

制定获取标题的规则:

title_pattern = re.compile(r'<a .*? class="note">(.*?)</a>',re.S)

制定获取作者的规则:

author_pattern = re.compile(r'<font>(.*?)</font>',re.S)

制定获取更新时间的规则:

time_pattern = re.compile(r'<cite class="last">(.*?)</cite>')
for li in li_list:
    dic = {}
# 获取阅读数
    read = read_comment_pattern.findall(li)[0].strip()
    # print(read)
# 获取评论数
    comment = read_comment_pattern.findall(li)[1].strip()
# 获取标题
    title = title_pattern.findall(li)[0]
# 获取作者
    author = author_pattern.findall(li)[0]
# 获取时间
    time = time_pattern.findall(li)[0]
    # print()
    # pass  # 占位符
    dic['阅读数'] = read
    dic['评论数'] = comment
    dic['标题'] = title
    dic['作者'] = author
    dic['更新时间'] = time
# 保存数据到txt文件中
    with open('guba.txt','a',encoding='utf=8') as fp:
        fp.write(str(dic)+'\n')

完整代码展示:

import requests,re

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
response = requests.get(url='https://guba.eastmoney.com/default,99_1.html',headers=headers)

ul_pattern = re.compile(r'<ul class="newlist" tracker-eventcode="gb_xgbsy_ lbqy_rmlbdj">(.*?)</ul>',re.S)
ul = ul_pattern.findall(response.text)[0]


li_pattern = re.compile(r'<li.*?>(.*?)</li>',re.S)
li_list = li_pattern.findall(ul)


read_comment_pattern = re.compile(r'<cite>(.*?)</cite>',re.S)

title_pattern = re.compile(r'<a .*? class="note">(.*?)</a>',re.S)

author_pattern = re.compile(r'<font>(.*?)</font>',re.S)

time_pattern = re.compile(r'<cite class="last">(.*?)</cite>')

for li in li_list:
    dic = {}

    read = read_comment_pattern.findall(li)[0].strip()

    comment = read_comment_pattern.findall(li)[1].strip()

    title = title_pattern.findall(li)[0]

    author = author_pattern.findall(li)[0]

    time = time_pattern.findall(li)[0]

    dic['阅读数'] = read
    dic['评论数'] = comment
    dic['标题'] = title
    dic['作者'] = author
    dic['更新时间'] = time

    with open('guba.txt','a',encoding='utf=8') as fp:
        fp.write(str(dic)+'\n')
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值