简易文本爬虫

  • 安装requests库和BeautifulSoup库:

在pycharm的file下拉列表中选择setting进行安装:

 

  • 获取爬虫所需的header和cookie:

以爬取百度热搜文本为例,右键网页选择检查,选择网络,crtl+r刷新,选择第一个文件右键复制cURL

粘贴到下面的网站 Convert curl commands to code (curlconverter.com) 里得到header和cookie,直接复制

  • 使用request请求获得网页地址

点击左上角的小图标,选择要爬取的信息,在对应源码处右键复制selector,多复制几个再观察规律,推断目标爬取信息的代码,示例如下:

#sanRoot > main > div.container.right-container_2EFJr > div > div:nth-child(2) > div:nth-child(1) > div.content_1YWBm > a > div.c-single-text-ellipsis

​ #sanRoot > main > div.container.right-container_2EFJr > div > div:nth-child(2) > div:nth-child(5) > div.content_1YWBm > a > div.c-single-text-ellipsis

 #sanRoot > main > div.container.right-container_2EFJr > div > div > div > div.content_1YWBm > div.hot-desc_1m_jR.small_Uvkd3.ellipsis_DupbZ 上面两处代码只有child()不同,故推断出目标爬取数据代码

整体爬取代码如下:

import os
import requests
from bs4 import BeautifulSoup
cookies = {
    'BIDUPSID': '6D4ECAE5184B0FD2223CE4EB639ED21B',
    'PSTM': '1666252145',
    'BAIDUID': '6D4ECAE5184B0FD23D9110EF1AE2968E:FG=1',
    'BDUSS': 'nk0UU9DdjVYc1NDOHNKbzVCRXczTzFXUFhzSTVxTHY3VW94bUFVeTlRd3RTSXhqRVFBQUFBJCQAAAAAAAAAAAEAAAAP-3GQNDY2MTO~qAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC27ZGMtu2RjQ',
    'BDUSS_BFESS': 'nk0UU9DdjVYc1NDOHNKbzVCRXczTzFXUFhzSTVxTHY3VW94bUFVeTlRd3RTSXhqRVFBQUFBJCQAAAAAAAAAAAEAAAAP-3GQNDY2MTO~qAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC27ZGMtu2RjQ',
    'ZFY': 'KGFcQdUfzjPydX91k2qm4YZ2uQBTrxGR73uVGPwhhVI:C',
    'BAIDUID_BFESS': '6D4ECAE5184B0FD23D9110EF1AE2968E:FG=1',
    '__bid_n': '1844ba6a32c3f032914207',
    'RT': '"z=1&dm=baidu.com&si=vo6qc7obakq&ss=labvgtd8&sl=e&tt=7l4&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=ndj4&ul=1kt98&hd=1ktet"',
    'BDORZ': 'B490B5EBF6F3CD402E515D22BCDA1598',
    'BDRCVFR[-BxzrOzUsTb]': 'mk3SLVN4HKm',
    'H_PS_PSSID': '26350',
    'BA_HECTOR': '0la52la10l0lala001018fjq1hn6c741f',
}
​
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    # Requests sorts cookies= alphabetically
    # 'Cookie': 'BIDUPSID=6D4ECAE5184B0FD2223CE4EB639ED21B; PSTM=1666252145; BAIDUID=6D4ECAE5184B0FD23D9110EF1AE2968E:FG=1; BDUSS=nk0UU9DdjVYc1NDOHNKbzVCRXczTzFXUFhzSTVxTHY3VW94bUFVeTlRd3RTSXhqRVFBQUFBJCQAAAAAAAAAAAEAAAAP-3GQNDY2MTO~qAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC27ZGMtu2RjQ; BDUSS_BFESS=nk0UU9DdjVYc1NDOHNKbzVCRXczTzFXUFhzSTVxTHY3VW94bUFVeTlRd3RTSXhqRVFBQUFBJCQAAAAAAAAAAAEAAAAP-3GQNDY2MTO~qAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC27ZGMtu2RjQ; ZFY=KGFcQdUfzjPydX91k2qm4YZ2uQBTrxGR73uVGPwhhVI:C; BAIDUID_BFESS=6D4ECAE5184B0FD23D9110EF1AE2968E:FG=1; __bid_n=1844ba6a32c3f032914207; RT="z=1&dm=baidu.com&si=vo6qc7obakq&ss=labvgtd8&sl=e&tt=7l4&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=ndj4&ul=1kt98&hd=1ktet"; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR[-BxzrOzUsTb]=mk3SLVN4HKm; H_PS_PSSID=26350; BA_HECTOR=0la52la10l0lala001018fjq1hn6c741f',
    'Referer': 'https://top.baidu.com/board?platform=pc&sa=pcindex_entry',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.42',
    'sec-ch-ua': '"Microsoft Edge";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}
​
params = {
    'tab': 'realtime',
}
​
​
#数据存储
fo = open("./热搜1.txt",'a',encoding="utf-8")
#获取网页
response = requests.get('https://top.baidu.com/board', params=params, cookies=cookies, headers=headers)
#解析网页
response.encoding='utf-8'
soup = BeautifulSoup(response.text, 'html.parser')
#爬取内容
content="#sanRoot > main > div.container.right-container_2EFJr > div > div > div > div.content_1YWBm > div.hot-desc_1m_jR.small_Uvkd3.ellipsis_DupbZ"
#清洗数据
a=soup.select(content)
'''BeautifulSoup.select可以找出含有CSS属性的元素,输出形式为list
使用select找出所有id为title的元素(id前面需加#)
使用select找出所有class为link的元素(class前面需加.)
使用select找出所有a tag的href连接'''
for i in range(0,len(a)):
    a[i] = a[i].text#提取列表的文本
    fo.write(a[i]+'\n')
fo.close()
​
​

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值