Python学习笔记之九（爬虫进阶）

最新推荐文章于 2023-06-19 17:14:31 发布

xuanjat

最新推荐文章于 2023-06-19 17:14:31 发布

阅读量284

点赞数

分类专栏： Python学习笔记文章标签： python 学习

本文链接：https://blog.csdn.net/xuanjat/article/details/96489954

版权

Python学习笔记专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Python学习笔记之九（爬虫进阶）

2019-07-19 09:10:39 星期五

爬虫防屏蔽值使用代理服务器进行爬虫

本课概要

作业讲解
什么是代理服务器
使用代理服务器进行爬取网页实战

作业

#爬取csdn博客主页上所有的文章
import urllib.request
import re
url="http://blog.csdn.net/"
headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)#将opener添加为全局
data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
pat='<h3 class="company_name"><a href="(.*?)"'
result=re.compile(pat).findall(data)
for i in range (0,len(result)):
    file="F:/PL/csdnblog/"+"CSDN博客的第"+str(i)+"篇"+".html"
    urllib.request.urlretrieve(result[i],filename=file)
    print("第"+str(i)+"次爬取成功")

第0次爬取成功
第1次爬取成功
第2次爬取成功
第3次爬取成功
第4次爬取成功
第5次爬取成功。。。。

什么是代理服务器

所谓代理服务器，是一个处于我们与互联网中间的服务器，如果使用代理服务器，我们浏览信息的时候，先向代理服务器发出请求，然后由代理服务器向互联网获取信息，再返回给我们。

#爬取csdn博客主页上所有的文章
import urllib.request
import re
url="http://blog.csdn.net/"
headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)#将opener添加为全局
data=urllib.request.urlopen(url).read().decode("utf-8","ignore")

HTTPError: HTTP Error 503: Too many open connections

图片爬虫

什么是图片爬虫

所谓图片爬虫，即是从互联网中自动把对方服务器上的图片爬下来的爬虫程序。

例子：爬虫淘宝图片

#抓包
import urllib.request
import re
keyname="连衣裙"
key=urllib.request.quote(keyname)
headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
print("ok")
for i in range(0,3):
    url="https://s.taobao.com/search?spm=a21wu.241046-cn.6977698868.3.41cab6cblkBrbT&q=%E8%BF%9E%E8%A1%A3%E8%A3%99&acm=lb-zebra-241046-2064951.1003.4.1812174&scm=1003.4.lb-zebra-241046-2064951.OTHER_15206378653274_1812174&bcoffset=18&ntoffset=18&p4ppushleft=1%2C48&s=44"
    data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
    pat='pic_url(.*?)'
    imagelist=re.compile(pat).findall(data)
    print("miao")
    for j in range(0,len(imagelist)):
        thisimg=imagelist[j]
        thisimgurl="http://"+thisimg
        file="F:/PL/taobaoimg/"+"连衣裙样式的第"+str(i)+str(j)+"件"+".jpg"
        urllib.request.urlretrieve(thisimgurl,filename=file)
        print("保存了"+str(i)+str(j)+"张图片")

抓包分析实战

本课概要

作业讲解
抓包分析概述
使用Fiddler进行抓包分析
抓取HTTPS数据包
爬取腾讯视频的评论

#爬取千图网图片
import urllib.request
import re
for i in range(1,3):
    #pageurl="https://www.58pic.com/piccate/17-280-0-p"+str(i)+".html"
    pageurl="http://www.58pic.com/piccate/11-0-0-p"+str(i)+".html"
    headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36")
    opener=urllib.request.build_opener()
    opener.addheaders=[headers]
    urllib.request.install_opener(opener)#将opener添加为全局
    data=urllib.request.urlopen(pageurl).read().decode("utf-8","ignore")
    print("————success=喵喵喵————")
    pat='class="thumb-box".*?src="(.*?).jpg!'
    imglist=re.compile(pat).findall(data)
    print(imglist)
    for j in range(0,len(imglist)):
        try:
            thisimg=imglist[j]
            #thisimgurl=thisimg+"_small"
            thisimgurl="http:"+thisimg+".jpg!w1024_0.jpg"
            print(thisimgurl)
            file="F:/sinanews/32/"+str(i)+str(j)+".jpg"
            urllib.request.urlretrieve(thisimgurl,filename=file)
            print("第"+str(i)+"页第"+str(j)+"个图片爬取成功")
        except urllib.error.URLError as e:
            if hasattr(e,"code"):
                print(e.code)
            if hasattr(e,"reason"):
                print(e.reason)
        except Exception as e:
            print(e)

更新

#爬取千图网图片
import urllib.request
import re
for i in range(1,3):
    #pageurl="https://www.58pic.com/piccate/17-280-0-p"+str(i)+".html"
    pageurl="https://www.58pic.com/piccate/11-0-0-p"+str(i)+".html"
    headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36")
    opener=urllib.request.build_opener()
    opener.addheaders=[headers]
    urllib.request.install_opener(opener)#将opener添加为全局
    data=urllib.request.urlopen(pageurl).read().decode("utf-8","ignore")
    print("————success=喵喵喵————")
    pat='class="thumb-box".*?src="(.*?).jpg!'
    imglist=re.compile(pat).findall(data)
    print(imglist)
    for j in range(0,len(imglist)):
        try:
            thisimg=imglist[j]
            #thisimgurl=thisimg+"_small"
            thisimgurl="https:"+thisimg+".jpg!w1024_small"
            print(thisimgurl)
            file="F:/PL/qiantuimg/"+str(i)+str(j)+".jpg"
            urllib.request.urlretrieve(thisimgurl,filename=file)
            print("第"+str(i)+"页第"+str(j)+"个图片爬取成功")
        except urllib.error.URLError as e:
            if hasattr(e,"code"):
                print(e.code)
            if hasattr(e,"reason"):
                print(e.reason)
        except Exception as e:
            print(e)

xuanjat

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python学习笔记之九（爬虫进阶）

Python学习笔记之九（爬虫进阶）2019-07-19 09:10:39 星期五爬虫防屏蔽值使用代理服务器进行爬虫本课概要作业讲解什么是代理服务器使用代理服务器进行爬取网页实战作业#爬取csdn博客主页上所有的文章import urllib.requestimport reurl="http://blog.csdn.net/"headers=("User-Agent"...
复制链接

扫一扫

专栏目录