Python网络爬虫半小时入门指南

最新推荐文章于 2023-08-08 14:48:44 发布

努力推石头的西西弗斯

最新推荐文章于 2023-08-08 14:48:44 发布

阅读量938

点赞数 2

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/qq1620657419/article/details/121493792

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Python 爬虫入门指南

1. 网络爬虫

网络爬虫是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。

通常，爬虫程序由三部分组成，获得网页、解析网页和存储数据。

1.1. 前置知识

大概科普级别就行。

HTTP协议
HTML
Json数据结构
CSV/Excel文件

1.2. 年轻人的第一个爬虫程序

import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"

request_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}

response = requests.get(link,headers=request_headers)

soup = BeautifulSoup(response.text,"html.parser")

nodeList = soup.find_all("h1", class_="post-title")

for node in nodeList:
    title = node.a.text.strip()
    print(node.a.text.strip())
    with open('title_test.txt', "a+", encoding="utf-8") as f:
        f.write(title+"\n")

2. Requests库

2.1. 安装Requests库

使用pip安装requests库(默认配置好python以及pip的环境变量)

pip install requests

2.2. 核心方法介绍

ApI的具体使用方式请参考Requests库官方文档。
https://docs.python-requests.org/en/latest/

方法	说明
requests.request()	构造一个请求，支撑一下各方法的基础方法
requests.get()	获取HTML网页的主要方法，对应于HTTP的GET
requests.head()	获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post()	向HTML网页提交POST请求的方法，对应于HTTP的POST
requests.put()	向HTML网页提交PUT请求的方法，对应于HTTP的PUT
requests.patch()	向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete()	向HTML页面提交删除请求，对应于HTTP的DELETE

2.3. Response对象

2.3.1. response.text

HTTP响应内容的字符串形式，即，url对应的页面内容。

2.3.2. response.status_code

HTTP请求的返回状态，200表示连接成功，更多参考HTTP协议。

2.3.3. response.encoding

从HTTP header中猜测的响应内容编码方式。

2.3.4. response.content

HTTP响应内容的二进制形式，会自动解码gzip和deflate编码的响应数据。

2.3.5. response.json()

使用Request库内置Json解析器，获得HTTP报文Json解析后的对象。

3. BeautifulSoup库

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。

官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

3.1. 安装BeautifulSoup库

使用pip安装BeautifulSoup库，默认配置好python以及pip的环境变量。

pip install bs4

3.2. HTML解析器

BeautifulSoup支持Pytho标准库中的HTML解析器，不过内置解析器速度较慢。所以BeautifulSoup支持第三方HTML解析器,推荐使用lxml作为解析器,因为效率更高。

3.2.1. 不同解析器之间的对比

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup,"html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup,"lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup,["lxml-xml"])` `BeautifulSoup(markup,"xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup,"html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

3.2.2. 安装lxml解析器

pip install lxml

3.3. CSS选择器

BeautifulSoup支持使用CSS选择器语法来抽取指定标签对象。

import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"

request_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}

response = requests.get(link,headers=request_headers)
soup = BeautifulSoup(response.text,"lxml")

h1 = soup.select_one("h1.post-title")

print(h1)

4. xlwings库

xlwings是Python操作Excel的第三方库。

官方文档：https://www.xlwings.org/

4.1. 安装xlwings

pip install xlwings

4.2. 创建工作簿

参数visible用于设置Excel程序窗口的可见性，如果为True，表示显示Excel程序窗口，如果为False，表示隐藏Excel程序窗口；

参数add_book用于设置启动Excel程序窗口后是否新建工作簿，如果为True，表示新建一个工作簿，如果为False，表示不新建工作簿。

import xlwings as xw

app = xw.App(visible=False, add_book=False)

workbook = app.books.add()
workbook.save("foo.xlsx")

# 关闭文件
app.quit()

4.3. 创建表格

import xlwings as xw

app = xw.App(visible = False, add_book = False)

workbook = app.books.open('foo.xlsx')

#创建表格
workbook.sheets.add('新表')

#遍历表格
for sheet in workbook.sheets:
    print(sheet)
    
workbook.sheets['新表'].range('A1').value = '编号'

workbook.save()
app.quit()

4.4. 批量写入数据

将2x2表格，即二维数组，储存在A1:B2中，如第一行1，2，第二行3，4。

sheet.range('A1').options(expand='table').value=[[1,2],[3,4]]

5. 经典实战：豆瓣top250

import requests
from bs4 import BeautifulSoup
import xlwings as xw

link = "http://www.santostang.com/"

response = requests.get("https://movie.douban.com/top250", headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
})
soup = BeautifulSoup(response.text, "lxml")

itemList = soup.select("ol.grid_view > li > div.item")
dataList = []

for item in itemList:
    movieName = item.select_one("div.hd > a > span").text
    poster = item.select_one("img")['src']
    doubanScore = item.select_one("span.rating_num").text

    doubanScoresNumSpanList = [span.text for span in item.select(
        "span") if "人评价" in span.text]

    if(len(doubanScoresNumSpanList) > 0):
        doubanScoresNum = doubanScoresNumSpanList[0][:-3]
    doubanQuote = item.select_one("p.quote > span.inq").text

    data = {
        "movieName": movieName,  # 电影名
        "poster": poster,  # 电影海报
        "doubanScore": doubanScore,  # 豆瓣分数
        "doubanScoresNum": doubanScoresNum,  # 豆瓣评分数
        "doubanQuote": doubanQuote  # 豆瓣引述
    }

    dataList.append(data)

sheetDataList = [['电影名', '电影海报', '豆瓣分数', '豆瓣评分数', '豆瓣引述']]

for data in dataList:
    sheetDataList.append([data['movieName'], data['poster'],
                         data['doubanScore'], data['doubanScoresNum'], data['doubanQuote']])

app = xw.App(visible=False, add_book=False)
workbook = app.books.add()
sheet = workbook.sheets[0]

sheet.range('A1').options(expand='table').value = sheetDataList

workbook.save("豆瓣Top250.xlsx")
app.quit()

努力推石头的西西弗斯

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python网络爬虫半小时入门指南

文章目录Python 爬虫入门指南网络爬虫前置知识年轻人的第一个爬虫程序Requests库安装Requests库核心方法介绍Response对象response.textresponse.status_coderesponse.encodingresponse.contentresponse.json()Python 爬虫入门指南网络爬虫网络爬虫是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。通常，爬虫程序由三部分组成，获得网页、解析网页和存储数据。前置知识大概科普级别就
复制链接

扫一扫