作者:挖数
链接:https://www.zhihu.com/question/20899988/answer/96904827
来源:知乎
著作权归作者所有,转载请联系作者获得授权。
链接:https://www.zhihu.com/question/20899988/answer/96904827
来源:知乎
著作权归作者所有,转载请联系作者获得授权。
以下是我学python爬虫的打怪升级之路,过程充满艰辛,也充满欢乐,虽然还未打倒大boss,但一路的风景就是最大的乐趣,不是么?希望大家能get到想要的东西!
多图预警!
<img src="https://i-blog.csdnimg.cn/blog_migrate/1093aa8f904138ea5f019ab19b814d07.png" data-rawwidth="288" data-rawheight="179" class="content_image" width="288">
<img src="https://i-blog.csdnimg.cn/blog_migrate/8575f84e79c6e2ba7d8892b632d931fe.png" data-rawwidth="242" data-rawheight="268" class="content_image" width="242">
<img src="https://i-blog.csdnimg.cn/blog_migrate/33571496fb02a58a8ec21f6756ad9a5f.png" data-rawwidth="254" data-rawheight="230" class="content_image" width="254">
<img src="https://i-blog.csdnimg.cn/blog_migrate/940aa16f7d2ca820130197e26ffed7cb.png" data-rawwidth="278" data-rawheight="320" class="content_image" width="278">
<img src="https://i-blog.csdnimg.cn/blog_migrate/5a40045b64811c39b188a7aeceac04c0.png" data-rawwidth="309" data-rawheight="318" class="content_image" width="309">
<img src="https://i-blog.csdnimg.cn/blog_migrate/aa8c35644beaba3d15ea248ff918f381.png" data-rawwidth="313" data-rawheight="264" class="content_image" width="313">
<img src="https://i-blog.csdnimg.cn/blog_migrate/49dd51ba59be8a8423bdf2afe0486da2.png" data-rawwidth="266" data-rawheight="240" class="content_image" width="266">
<img src="https://i-blog.csdnimg.cn/blog_migrate/605cd310e07d5d254b1477329daccd53.png" data-rawwidth="269" data-rawheight="246" class="content_image" width="269">
<img src="https://i-blog.csdnimg.cn/blog_migrate/c136245ee5db3454ad0e80d21b6c788f.png" data-rawwidth="299" data-rawheight="254" class="content_image" width="299">
<img src="https://i-blog.csdnimg.cn/blog_migrate/d8001d6b6bc5e99e33b25aeaf288f2d2.png" data-rawwidth="212" data-rawheight="266" class="content_image" width="212">
<img src="https://i-blog.csdnimg.cn/blog_migrate/c95e485334a777ac213b4c8ae7be6c11.png" data-rawwidth="313" data-rawheight="266" class="content_image" width="313">
<img src="https://i-blog.csdnimg.cn/blog_migrate/e666981e62186a79833136fa71ca11c8.png" data-rawwidth="304" data-rawheight="232" class="content_image" width="304">
<img src="https://i-blog.csdnimg.cn/blog_migrate/b2c172c283fdd55c09de6eba900428ad.png" data-rawwidth="287" data-rawheight="234" class="content_image" width="287">
<img src="https://i-blog.csdnimg.cn/blog_migrate/2c42b7414cf8f68e792b095b244698df.png" data-rawwidth="325" data-rawheight="354" class="content_image" width="325">
<img src="https://i-blog.csdnimg.cn/blog_migrate/6198b6e00146e3fcf91595dd43827c75.png" data-rawwidth="289" data-rawheight="243" class="content_image" width="289">
<img src="https://i-blog.csdnimg.cn/blog_migrate/845ff44713e9b3a7adeafef3961ac4c0.png" data-rawwidth="309" data-rawheight="189" class="content_image" width="309">
<img src="https://i-blog.csdnimg.cn/blog_migrate/2132b1e7c83dad3fae7621e8017e8216.png" data-rawwidth="266" data-rawheight="346" class="content_image" width="266">
<img src="https://i-blog.csdnimg.cn/blog_migrate/bce460fc2d3bc9e7a651229ea157e8cd.png" data-rawwidth="338" data-rawheight="269" class="content_image" width="338">
<img src="https://i-blog.csdnimg.cn/blog_migrate/cdb2b84ec74a6c4a092a1b62f55194d4.png" data-rawwidth="255" data-rawheight="175" class="content_image" width="255">
以下奉献一段爬取知乎头像的代码
结果:
<img src="https://i-blog.csdnimg.cn/blog_migrate/4bfd8c4aa9dc8fcc4be9b39f4379c285.png" data-rawwidth="710" data-rawheight="744" class="origin_image zh-lightbox-thumb" width="710" data-original="https://pic2.zhimg.com/b1fc67ee3e290376fe882113ff7d44fd_r.png">
最后,请关注我吧,我会好好维护你的时间线的 \( ^▽^ )/
多图预警!
<img src="https://i-blog.csdnimg.cn/blog_migrate/1093aa8f904138ea5f019ab19b814d07.png" data-rawwidth="288" data-rawheight="179" class="content_image" width="288">
<img src="https://i-blog.csdnimg.cn/blog_migrate/8575f84e79c6e2ba7d8892b632d931fe.png" data-rawwidth="242" data-rawheight="268" class="content_image" width="242">
<img src="https://i-blog.csdnimg.cn/blog_migrate/33571496fb02a58a8ec21f6756ad9a5f.png" data-rawwidth="254" data-rawheight="230" class="content_image" width="254">
<img src="https://i-blog.csdnimg.cn/blog_migrate/940aa16f7d2ca820130197e26ffed7cb.png" data-rawwidth="278" data-rawheight="320" class="content_image" width="278">
<img src="https://i-blog.csdnimg.cn/blog_migrate/5a40045b64811c39b188a7aeceac04c0.png" data-rawwidth="309" data-rawheight="318" class="content_image" width="309">
<img src="https://i-blog.csdnimg.cn/blog_migrate/aa8c35644beaba3d15ea248ff918f381.png" data-rawwidth="313" data-rawheight="264" class="content_image" width="313">
<img src="https://i-blog.csdnimg.cn/blog_migrate/49dd51ba59be8a8423bdf2afe0486da2.png" data-rawwidth="266" data-rawheight="240" class="content_image" width="266">
<img src="https://i-blog.csdnimg.cn/blog_migrate/605cd310e07d5d254b1477329daccd53.png" data-rawwidth="269" data-rawheight="246" class="content_image" width="269">
<img src="https://i-blog.csdnimg.cn/blog_migrate/c136245ee5db3454ad0e80d21b6c788f.png" data-rawwidth="299" data-rawheight="254" class="content_image" width="299">
<img src="https://i-blog.csdnimg.cn/blog_migrate/d8001d6b6bc5e99e33b25aeaf288f2d2.png" data-rawwidth="212" data-rawheight="266" class="content_image" width="212">
<img src="https://i-blog.csdnimg.cn/blog_migrate/c95e485334a777ac213b4c8ae7be6c11.png" data-rawwidth="313" data-rawheight="266" class="content_image" width="313">
<img src="https://i-blog.csdnimg.cn/blog_migrate/e666981e62186a79833136fa71ca11c8.png" data-rawwidth="304" data-rawheight="232" class="content_image" width="304">
<img src="https://i-blog.csdnimg.cn/blog_migrate/b2c172c283fdd55c09de6eba900428ad.png" data-rawwidth="287" data-rawheight="234" class="content_image" width="287">
<img src="https://i-blog.csdnimg.cn/blog_migrate/2c42b7414cf8f68e792b095b244698df.png" data-rawwidth="325" data-rawheight="354" class="content_image" width="325">
<img src="https://i-blog.csdnimg.cn/blog_migrate/6198b6e00146e3fcf91595dd43827c75.png" data-rawwidth="289" data-rawheight="243" class="content_image" width="289">
<img src="https://i-blog.csdnimg.cn/blog_migrate/845ff44713e9b3a7adeafef3961ac4c0.png" data-rawwidth="309" data-rawheight="189" class="content_image" width="309">
<img src="https://i-blog.csdnimg.cn/blog_migrate/2132b1e7c83dad3fae7621e8017e8216.png" data-rawwidth="266" data-rawheight="346" class="content_image" width="266">
<img src="https://i-blog.csdnimg.cn/blog_migrate/bce460fc2d3bc9e7a651229ea157e8cd.png" data-rawwidth="338" data-rawheight="269" class="content_image" width="338">
<img src="https://i-blog.csdnimg.cn/blog_migrate/cdb2b84ec74a6c4a092a1b62f55194d4.png" data-rawwidth="255" data-rawheight="175" class="content_image" width="255">
以下奉献一段爬取知乎头像的代码
import requests
import urllib
import re
import random
from time import sleep
def main():
url=' 知乎 - 与世界分享你的知识、经验和见解'
#感觉这个话题下面美女多
headers={省略}
i=1
for x in xrange(20,3600,20):
data={'start':'0',
'offset':str(x),
'_xsrf':'a128464ef225a69348cef94c38f4e428'}
#知乎用offset控制加载的个数,每次响应加载20
content=requests.post(url,headers=headers,data=data,timeout=10).text
#用post提交form data
imgs=re.findall('<img src=\\\\\"(.*?)_m.jpg',content)
#在爬下来的json上用正则提取图片地址,去掉_m为大图
for img in imgs:
try:
img=img.replace('\\','')
#去掉\字符这个干扰成分
pic=img+'.jpg'
path='d:\\bs4\\zhihu\\jpg\\'+str(i)+'.jpg'
#声明存储地址及图片名称
urllib.urlretrieve(pic,path)
#下载图片
print u'下载了第'+str(i)+u'张图片'
i+=1
sleep(random.uniform(0.5,1))
#睡眠函数用于防止爬取过快被封IP
except:
print u'抓漏1张'
pass
sleep(random.uniform(0.5,1))
if __name__=='__main__':
main()
结果:
&lt;img src="https://i-blog.csdnimg.cn/blog_migrate/4bfd8c4aa9dc8fcc4be9b39f4379c285.png" data-rawwidth="710" data-rawheight="744" class="origin_image zh-lightbox-thumb" width="710" data-original="https://pic2.zhimg.com/b1fc67ee3e290376fe882113ff7d44fd_r.png"&gt;
最后,请关注我吧,我会好好维护你的时间线的 \( ^▽^ )/