网络爬虫requests和bs4简单入门

最新推荐文章于 2022-03-19 23:40:22 发布

chenweida1

最新推荐文章于 2022-03-19 23:40:22 发布

阅读量3.2k

点赞数 3

分类专栏： python 文章标签： python网络爬虫

本文链接：https://blog.csdn.net/chenweida1/article/details/92732997

版权

网络爬虫基础（嵩天老师爬虫教学）

本博客的主要内容：介绍如何使用基本的库完成对html页面内容的爬取和分析，分以下几方面介绍

介绍网络爬虫的基本工作过程
requests库的基本用法
使用BeautifulSoup对页面进行解析

1.介绍网络爬虫的基本工作过程

The Website is the API 我们应该将网页看成是一个我们获取信息的接口，我们可以通过python爬虫从中获取我们所需要的信息。
一般步骤：
（1）通过requests库爬取html页面的内容
（2）使用BeautifulSoup库对爬取到的html页面进行解析
（3）使用BeautifulSoup以及正则表达式来进一步提取我们想要的关键信息
（4）将信息格式化并输出

2.requests库的基本使用

requests库有好几种方法，这里我们介绍最主要的get和post方法
最简单的请求方法get：

import requests
r  = requests.get("http://python123.io/ws/demo.html")
print(r)
<Response [200]>    #返回码200表示访问正常
r.encoding = r.apparent_encoding   #使用该语句将正确的编码给到 r 
r.text    #打印出html页面的内容
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

以上说明成功访问了，但是存在一些网站我们通过以上的操作访问不成功，且手动点击却可以访问成功。大概率是因为我们的请求头的原因，看一下我们的请求头

r.request.he

最低0.47元/天解锁文章

chenweida1

关注

3
点赞
踩
19

收藏

觉得还不错? 一键收藏
1
评论
网络爬虫requests和bs4简单入门

网络爬虫基础（嵩天老师爬虫教学）本博客的主要内容：介绍如何使用基本的库完成对html页面内容的爬取和分析，分以下几方面介绍介绍网络爬虫的基本工作过程requests库的基本用法使用BeautifulSoup对页面进行解析1.介绍网络爬虫的基本工作过程The Website is the API 我们应该将网页看成是一个我们获取信息的接口，我们可以通过python爬虫从中获取我们所需...
复制链接

扫一扫