python网络爬虫学习笔记之一爬虫基础入门

最新推荐文章于 2023-04-15 14:02:49 发布

盛桃云

最新推荐文章于 2023-04-15 14:02:49 发布

阅读量300

点赞数

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/bowei026/article/details/90147540

版权

python 专栏收录该内容

29 篇文章 1 订阅

订阅专栏

爬虫工作的三个基本步骤：爬取网页、解析内容、存储数据

准备

先安装爬取网页需要用到的第三方库： requests 和 bs4

pip install requests

pip install bs4

爬取网页

# coding: UTF-8
import requests

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
r = requests.get(link, headers=headers)
print(r.text)

程序运行后输出网页的html代码

解析网页内容

# coding: UTF-8
import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
r = requests.get(link, headers=headers)

soup = BeautifulSoup(r.text, "lxml")
title = soup.find("h1", class_="post-title").a.text.strip()
print(title)

获取到了网页第一篇文章的 title，输出内容为：

第四章 – 4.3 通过selenium 模拟浏览器抓取

存储数据

# coding: UTF-8
import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
r = requests.get(link, headers=headers)

soup = BeautifulSoup(r.text, "lxml")
title = soup.find("h1", class_="post-title").a.text.strip()

with open('d:/title.txt', 'w') as f:
	f.write(title)

运行程序后找到d:/title.txt 文件，发现文件的内容就是网页第一篇文章的title，即 “第四章 – 4.3 通过selenium 模拟浏览器抓取”

至此，讲解完了python爬虫的三个基本步骤和代码实现

本文内容到此结束，更多内容可关注公众号和个人微信号：

盛桃云

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
python网络爬虫学习笔记之一爬虫基础入门

爬虫工作的三个基本步骤：爬取网页、解析内容、存储数据准备先安装爬取网页需要用到的第三方库：requests 和 bs4pip install requestspip install bs4爬取网页# coding: UTF-8import requestslink = "http://www.santostang.com/"headers = {'User-A...
复制链接

扫一扫