Python入门爬虫详细教学（附源代码）

青空之蓝qk

于 2024-08-18 02:32:46 发布

阅读量399

点赞数 2

文章标签： python

本文链接：https://blog.csdn.net/2401_82990204/article/details/141289939

版权

爬取百度热搜排行榜Top50+可视化

一.导入所有模块

在终端类下载下列模块

pip install requests
pip install bs4

接下来写入下列其代码：

import requests

from bs4 import BeautifulSoup

import openpyxl

代码思路：

requests 库用于发送HTTP请求获取网页内容。

BeautifulSoup库用于解析HTML页面的内容。

openpyxl库用于创建和操作Excel文件。

二.发起HTTP请求获取百度热搜网页内容：

接下来写入下列其代码：

url = ‘https://tor.baidu.com/board?tab=realtime’

response = requests.get(url)

html = response.content

这里使用了requests.get（）方法发送GET请求，并将响应的内容赋值给变值html。

三.使用BeautifulSoup解析网页内容：

soup = BeautifulSoup（html，‘html.parser’）创建一个BeautifulSoup对象，并传入要解析的HTML内容和解析器类型。

四.提取热搜数据：

接下来写入下列其代码：

hot_searches = [ ]

for item in soup.find_all（‘div’:{’class’:’c-single-text-ellipsis’}）：

hot_searches.append(item.text)

这段代码通过调用soup.find_all（）方法找到所有<div>标签，并且指定class属性‘c-single-text-ellipsis’的元素。然后，将每个元素的文本内容添加到hot_searches列表中。

五.保存热搜数据到Excel：

接下来写入下列其代码：

workbook = openpyxl.Workbook（）

sheet = workbook.active

sheet.title = ‘Baidu Hot Searches’

使用openpyxl.Workbook（)创建一个新的工作薄对象。调用active属性获取当前活动的工作表对象，并将其赋值给变量sheet。使用title属性给工作表命名为‘Baidu Hot Searches’。

六.设置标题：

接下来写入下列其代码：

sheet.cell（row=1，column=1，value=’百度热搜排行榜-博主：Yan-英杰’）

使用cell（)方法选择要操作的单元格，其中row和column参数分别表示行和列的索引。将标题字符串‘百度热搜排行榜-博主：Yan-英杰’写入选定的单元格。

七.写入热搜数据：

接下来写入下列其代码：

for i in range（len（hot_searches)):

sheet.call(row=i+1，column=1，value=hot_searches[i])

使用range（）函数生成一个包含索引的范围，循环遍历hot_searches列表。对于每个索引i，使用call（）方法将对应的热搜词写入Excel文件中。

八.保存Excel文件：

接下来写入下列其代码：

workbook.save（‘百度热搜.xlsx’）

九.输入提示信息：

接下来写入下列其代码：

print（‘热搜数据已保存到百度热搜.xlsx’）

在控制台输出保存成功的提示信息。

源代码

import csv
import time
from time import sleep
import random
import requests
from lxml import etree

url = ('https://tor.baidu.com/board?tab=realtime')
headers = { 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0"}
response = requests .get(url, headers=headers, timeout=10)
html = response.text
print(html)

res = requests.get(url)
print (res.request)
print(res)
print(res.cookies)
print(res.headers)

url =":(https://gamesense.pub/forums/style/Cobalt.min.css?v=31)"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/87.0.4280.88 Safari/537.36"
}

rep = requests.get(url, headers=headers)
rep.encoding = 'utf-8'
html = etree.HTML(rep.text)
list_url = html.xpath('https://cbu01.alicdn.com/img/ibank/O1CN01enpDSl2MHpU8R7yie_!!2820679803-0-cib.220x220.jpg')