python 爬虫学习

最新推荐文章于 2024-11-03 17:39:57 发布

qq_39305263

最新推荐文章于 2024-11-03 17:39:57 发布

阅读量348

点赞数 11

文章标签： python 爬虫学习

本文链接：https://blog.csdn.net/qq_39305263/article/details/107329429

版权

爬虫写作的核心思想就是
网络是个法外之地

所有的网站都是api 把网站当作api来写

在这里插入图片描述
第一步 Requests（robots.txt 防止看到一些恶心的东西）
爬取（模拟人去点击）

第二步
Beautiful Soup
解析页面

第三步
RE
正则表达式提取关键信息

在这里插入图片描述
request：

专门与异常打交道
在这里插入图片描述

#北京理工大学的主页

r=requests.head
SyntaxError: invalid syntax

r=requests.he
SyntaxError: invalid syntax
r=requests.head(‘http://httpbin.org/get’)
r.headers
{‘Date’: ‘Mon, 13 Jul 2020 19:23:57 GMT’, ‘Content-Type’: ‘application/json’, ‘Content-Length’: ‘307’, ‘Connection’: ‘keep-alive’, ‘Server’: ‘gunicorn/19.9.0’, ‘Access-Control-Allow-Origin’: ‘*’, ‘Access-Control-Allow-Credentials’: ‘true’}
r.text
‘’

try 与except 能够保证异常能够处理

搜索引擎也是爬虫

正则表达式：

在这里插入图片描述

![

在这里插入图片描述

](https://img-blog.csdnimg.cn/20200718014738468.png)

亚马逊：
https://item.jd.com/2967929.html

在这里插入图片描述

import requests kv={’ wd’:’ Python’}
r=requests. get(“http://www. baidu. com/s”, params=kv)

中国大字
网络图片的爬取
网络图片链接的格式：
http://www.example.com/picture.jpg
国家地理：
http://www.nationalgeographic.com.cn/
选择一个图片Web页面：
http://www.nationalgeographic.com.cn/photography/
photo_of_the_day/3921.html

import requests import os ur1=“http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg”
root=“D://pics//”
path=root+url.split（/）[-1]
try：if not os.path.exists（root）：os.mkdir（root）
if not os.path.exists（path）：r=requests.get（url）
with open（path，‘wb’）as f：f.write（r.content）
f.close（）
print（“文件保存成功”）
else：print（“文件已存在”）except：print（“爬取失败"）

qq_39305263

关注

11
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫