Python学习之浅谈爬虫(1)

最新推荐文章于 2023-09-15 02:58:41 发布

summersince1985

最新推荐文章于 2023-09-15 02:58:41 发布

阅读量200

点赞数

分类专栏： PYTHON

本文链接：https://blog.csdn.net/summersince1985/article/details/79201036

版权

PYTHON 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

初识爬虫

网络爬虫，即Web Spider，是一个很形象的名字。把互联网比喻成一个蜘蛛网，那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页的。通过它可以获取网页数据的脚本，简单地说是请求网页，提取网页源代码。
如果把整个互联网当成一个网站，那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来。这样看来，网络爬虫就是一个爬行程序，一个抓取网页的程序。网络爬虫的基本操作是抓取网页。

爬虫是怎么工作的

1、发送请求，得到服务器响应，得到源代码。

2、筛选（数据去重，数据筛选，数据清洗，数据存储）

爬虫是怎么获取网页数据的

首先，我们要认识网页的三大特征：

1、有URL链接；

2、通过HTTP请求；

3、所有网页都是通过HTML来展示代码的。

其次，我们要了解通用爬虫与聚焦爬虫的区别：

通用爬虫：目标、流程（爬取网页 - 存储数据 - 内容处理 - 提供检索/排名服务）、遵循Robots协议、缺点是功能单一，不能爬文件以及音乐数据等。
聚焦爬虫：是"面向特定主题需求"的一种网络爬虫程序，它与通用搜索引擎爬虫的区别在于：聚焦爬虫在实施网页抓取时会对内容进行处理筛选，尽量保证只抓取与需求相关的网页信息，功能较多。

基本请求（GET/POST）

查看基本返回信息

#_*_coding:utf-8_*_
import requests
# 基本请求
requests.get('http://httpbin.org/get') # GET请求
requests.post('http://httpbin.org/post') # POST请求
requests.put('http://httpbin.org/put') # PUT请求
requests.delete('http://httpbin.org/delete') # DELETE请求

response = requests.get('http://www.baidu.com') # 通过URL发送请求,获取网页信息
print(response)
print(type(response))  # 查看返回类型
print(response.status_code) # 查看状态码
print(response.encoding) # 查看编码
print(response.cookies) # 查看返回的cookies
print(response.text) # 查看编译过的源代码
print(response.content) # 查看未编译的源代码

#get请求带参数
parm = {'name':'joe','age':18}
r = requests.get('http://httpbin.org/get',params = parm)  #get参数
print(r.text)

http://httpbin.org/?name=joe
http://httpbin.org/?name=joe&age=18

'''
    get请求伪装浏览器，修改头部信息
'''
parm = {'name':'joe','age':18}
headers = {
    "User-Agent":'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'x-1':'a'
}
r = requests.get('http://httpbin.org/get',params = parm,headers = headers)  #get参数
print(r.text)

#伪装浏览器访问知乎
headers = {
    "User-Agent":'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
}
r = requests.get('http://www.zhihu.com',headers = headers)
print(r.text)

# json
r = requests.get('http://github.com/timeline.json')
print(r.text)
print(r.json())
r = requests.get('http://www.mywebs.com/re_json')
print(r.text)
print(r.json())

# get下载音乐
r = requests.get('http://zhangmenshiting.qianqian.com/data2/music/7eabaab0083ec01885939fd370b0d7b2/540234984/540234984.mp3?xcode=ead875177ecf26be9614b4474386abae',stream=True)
print(r.content)
with open ('1.mp3','wb')as file:
    for chunk in r.iter_content(1024*10):
        file.write(chunk)

# get下载图片
r = requests.get('https://bbs-fd.zoimg.com.cn/t_s1200x5000/g4/M09/00/0E/Cg4WlQJDSGIYuTxAAMGyp1H4wIAARKpgPNygIAAwbi579.jpg',stream=True)
print(r.content)
with open ('1.jpg','wb')as file:
    for chunk in r.iter_content(1024*10):
        file.write(chunk)

#cookie
r = requests.get('http://www.ibeifeng.com')
print(r.cookies)

#设置cookie
#我到服务器的setcookIE这里要了一个cookie
s = requests.session()
r = s.get('http://www.mywebs.com/setcookie')
print(r.text)
print(r.cookies)

#检查有没有cookie
r = s.get('http://www.mywebs.com/cookie')
print(r.text)
print(r.cookies)

#请求测试网站发送cookie
r = requests.get('http://httpbin.org/cookies',cookies = {'name':'joe'})
print(r.text)

r = requests.get('http://www.mywebs.com/cookie',cookies = {'name':'joe'})
print(r.text)
print(r.cookies)