1、爬虫简介
近年来,随着网络应用的逐渐扩展和深入,如何高效的获取网上数据成为了无数公司和个人的
追求,在大数据时代,谁掌握了更多的数据,谁就可以获得更高的利益,而网络爬虫是其中最
为常用的一种从网上爬取数据的手段。
网络爬虫,即Web Spider,是一个很形象的名字。如果把互联网比喻成一个蜘蛛网,那么
Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页的。 从网站某
一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些
链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。如果
把整个互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来
- 爬虫的价值
互联网中最有价值的便是数据,比如天猫商城的商品信息,链家网的租房信息,雪球网的证券投资信息等等,这些数据都代表了各个行业的真金白银,
可以说,谁掌握了行业内的第一手数据,谁就成了整个行业的主宰,如果把整个互联网的数据比喻为一座宝藏,那我们的爬虫课程就是来教大家如何
来高效地挖掘这些宝藏,掌握了爬虫技能, 你就成了所有互联网信息公司幕后的老板,换言之,它们都在免费为你提供有价值的数据。
- 爬虫的基本流程
- http协议
https://www.cnblogs.com/yuanchenqi/articles/8875623.html
2、requests模块
get请求
-
- 基本语法
requests模块支持的请求
import requests requests.get("http://httpbin.org/get") requests.post("http://httpbin.org/post") requests.put("http://httpbin.org/put") requests.delete("http://httpbin.org/delete") requests.head("http://httpbin.org/get") requests.options("http://httpbin.org/get")
- 含参数请求
import requests response=requests.get('https://s.taobao.com/search?q=手机') response=requests.get('https://s.taobao.com/search',params={"q":"美女"})
- 含请求头的请求
import requests response=requests.get('https://dig.chouti.com/', headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36', } )
-
含cookies请求
import uuid import requests url = 'http://httpbin.org/cookies' cookies = dict(sbid=str(uuid.uuid4())) res = requests.get(url, cookies=cookies) print(res.json())
- requests.session
import requests # res=requests.get("https://www.zhihu.com/explore") # print(res.cookies.get_dict()) session=requests.session() res1=session.get("https://www.zhihu.com/explore") print(session.cookies.get_dict()) res2=session.get("https://www.zhihu.com/question/30565354/answer/463324517", cookies={"abs":"123"}
- 基本语法
post请求
-
- data参数
requests.post()用法与requests.get()完全一致,特殊的是requests.post()多了一个data 参数,用来存放请求体数据 response=requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"yuan"})
- 发送json数据
import requests<br> #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed res1=requests.post(url='http://httpbin.org/post', data={'name':'yuan'}) print(res1.json()) #默认的请求头:application/json) res2=requests.post(url='http://httpbin.org/post',json={'age':"22",}) print(res2.json())
- data参数
response对象
1.常见属性
import requests respone=requests.get('https://sh.lianjia.com/ershoufang/') # respone属性 print(respone.text) print(respone.content) print(respone.status_code) print(respone.headers) print(respone.cookies) print(respone.cookies.get_dict()) print(respone.cookies.items()) print(respone.url) print(respone.history) print(respone.encoding)
2.编码问题
import requests response=requests.get('http://www.autohome.com/news') #response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的, 而requests的默认编码为ISO-8859-1,如果不设置成gbk则中文乱码 with open("res.html","w") as f: f.write(response.text)
3.下载二进制文件(图片、音频、视频)
import requests response=requests.get('http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg') with open("res.png","wb") as f: # f.write(response.content) # 比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的 for line in response.iter_content(): f.write(line)
4.解析json数据
import requests import json response=requests.get('http://httpbin.org/get') res1=json.loads(response.text) #太麻烦 res2=response.json() #直接获取json数据 print(res1==res2)
5.Redirection and History
默认情况下,除了 HEAD, Requests 会自动处理所有重定向。可以使用响应对象的 history 方法来追踪重定向。Response.history 是一个 Response 对象的列表,为了完成请求而创建了这些对象。
这个对象列表按照从最老到最近的请求进行排序。
>>> r = requests.get('http://github.com') >>> r.url 'https://github.com/' >>> r.status_code 200 >>> r.history [<Response [301]>]
另外,还可以通过 allow_redirects 参数禁用重定向处理:
>>> r = requests.get('http://github.com', allow_redirects=False) >>> r.status_code 301 >>> r.history []
3、应用案例
1.模拟github登陆,获取登陆信息
import requests import re #请求1: r1=requests.get('https://github.com/login') r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权) authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN print("authenticity_token",authenticity_token) #第二次请求:带着初始cookie和TOKEN发送POST请求给登录页面,带上账号密码 data={ 'commit':'Sign in', 'utf8':'✓', 'authenticity_token':authenticity_token, 'login':'yuanchenqi0316@163.com', 'password':'yuanchenqi0316' } #请求2: r2=requests.post('https://github.com/session', data=data, cookies=r1_cookie, # allow_redirects=False ) print(r2.status_code) #200 print(r2.url) #看到的是跳转后的页面:https://github.com/ print(r2.history) #看到的是跳转前的response:[<Response [302]>] print(r2.history[0].text) #看到的是跳转前的response.text with open("result.html","wb") as f: f.write(r2.content)
2.爬取豆瓣电影信息
import requests import re import json import time from concurrent.futures import ThreadPoolExecutor pool=ThreadPoolExecutor(50) def getPage(url): response=requests.get(url) return response.text def parsePage(res): com=re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>' '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',re.S) iter_result=com.finditer(res) return iter_result def gen_movie_info(iter_result): for i in iter_result: yield { "id":i.group("id"), "title":i.group("title"), "rating_num":i.group("rating_num"), "comment_num":i.group("comment_num"), } def stored(gen): with open("move_info.txt","a",encoding="utf8") as f: for line in gen: data=json.dumps(line,ensure_ascii=False) f.write(data+"\n") def spider_movie_info(url): res=getPage(url) iter_result=parsePage(res) gen=gen_movie_info(iter_result) stored(gen) def main(num): url='https://movie.douban.com/top250?start=%s&filter='%num pool.submit(spider_movie_info,url) #spider_movie_info(url) if __name__ == '__main__': before=time.time() count=0 for i in range(10): main(count) count+=25 after=time.time() print("总共耗费时间:",after-before)
个人练习:
# requests 模块的使用 # 普通请求 import requests # res = requests.get("https://www.jd.com/") # # print(res) # print(type(res)) # class 'requests.models.Response' 返回的是一个response对象 # with open("jd.html", 'w', encoding="utf8") as f: # f.write(res.text) # print(res.text) # GET请求参数 淘宝可以通过带参数爬取参数值相关的商品静态资源 # res = requests.get("https://s.taobao.com/search", params={"q": "美女"}) # # print(res.text) # with open("taobao.html", 'w', encoding="utf8") as f: # f.write(res.text) # 带header参数 抽屉网: 需要带浏览器请求头过来证明是浏览器访问的 # res = requests.get("https://dig.chouti.com/", # headers={ # "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" # }) # with open("touti.html", 'w', encoding="utf8") as f: # f.write(res.text) # 带cookies的请求 # http://httpbin.org/cookies 这个网址是访问相应你设置的参数返回响应 # import json # url = "http://httpbin.org/cookies" # res = requests.get(url, cookies={"c1": "123"}) # print(res) # print(res.text) # # 转换为json数据格式 # print(res.json()) # POST请求 data参数 url = "http://httpbin.org/post" # res = requests.post(url, params={"a": "123"}, data={"data": "12:00"}) # print(res.text) # POST请求 json参数 # res = requests.post(url, params={"a": 123}, json={"json": "321"}) # print(res.text) # response对象 # res = requests.get("https://sh.lianjia.com/ershoufang/") # print(res.text) # with open("lianjia.html", 'w', encoding="utf8") as f: # f.write(res.text) # 打印的是二进制格式的文本 # print(res.content) # 返回的是状态码 # print(res.status_code) # 返回的是请求头数据 # print(res.headers) # 返回的是请求cookies # print(res.cookies) # 汽车之家官网 # response = requests.get("http://www.autohome.com/news") # with open("qichezhijia.html", 'w', encoding="utf8") as f: # f.write(response.text)4 # 爬取小泽玛利亚图片 # res = requests.get("http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg") # with open("xzmly.png", 'wb') as f: # # 比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的 # # f.write(res.content) # for line in res.iter_content(): # f.write(line) # 解析json数据 # res = requests.get('http://httpbin.org/get') # print(res.text) # import json # print(json.loads(res.text)) # print(res.json()) # Redirection and History # res = requests.get('http://github.com', allow_redirects=False) # # print(res.text) # with open("githup.html", "w", encoding="utf8") as f: # f.write(res.text) # # print(res.history) # print(res.url)
作业:爬取豆瓣网
import requests import re # response = requests.get("https://movie.douban.com/top250") # print(response) # print(response.text) # 获取url def get_html(url): response = requests.get(url) return response # 获取正则后的标签 re.S是匹配所有的字符包括空格 def parser_html(response): # 使用re正则匹配出图片地址、评分、评论人数 res = re.findall(r'<div class="item">.*?<a href=(.*?)>.*?<span class="title">(.*?)</span>.*?<span class="rating_num".*?>(.*?)</span>.*?<span>(\d+)人评价', response.text, re.S) print(res) return res # 文本模式实现数据持久化 def store(ret): with open("douban.text", "a", encoding="utf8") as f: for item in ret: f.write(" ".join(item) + "\n") # 爬虫三部曲 def do_spider(url): # 1.爬取资源 # "https://movie.douban.com/top250" response = get_html(url) # 2.解析资源 ret = parser_html(response) # 3.数据持久化 store(ret) # 启动爬虫函数 def main(): import time c = time.time() # 初始页数 count = 0 for i in range(10): # 豆瓣电影网爬取数据规律 url = "https://movie.douban.com/top250?start=%s&filter=" % count do_spider(url) count += 25 print(time.time() - c) if __name__ == '__main__': main()