web爬虫学习（一）——基础结构

最新推荐文章于 2024-07-17 23:50:36 发布

又见智能商业

最新推荐文章于 2024-07-17 23:50:36 发布

阅读量3k

点赞数

分类专栏： web爬虫文章标签： crawler

本文链接：https://blog.csdn.net/livan1234/article/details/80850555

版权

web爬虫专栏收录该内容

6 篇文章 3 订阅

订阅专栏

笔者是一个痴迷于挖掘数据中的价值的学习人，希望在平日的工作学习中，挖掘数据的价值，找寻数据的秘密，笔者认为，数据的价值不仅仅只体现在企业中，个人也可以体会到数据的魅力，用技术力量探索行为密码，让大数据助跑每一个人，欢迎直筒们关注我的公众号，大家一起讨论数据中的那些有趣的事情。

我的公众号为：livandata

web爬虫是数据获取过程中的一个必要手段，能从页面上获取到我们所需要的数据，因其技术难度较低，效果又非常明显，能获取到较大的数据量，因此很多人学习，在此做一点介绍。

1、urllib库：不支持https的爬取，支持http的爬取。

豆瓣小案例：

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.request
import re

data = urllib.request.urlopen("https://read.douban.com/provider/all").read()

data = data.decode("utf-8")

pattern = '<div class="name">(.*?)</div>'

mydata = re.compile(pattern).findall(data)

fh = open("出版社.txt", "w")

for i in range(0,len(mydata)):
fh.write(mydata[i]+"\n")

fh.close()

常用函数：

import urllib.request

#一、常用函数：

#1\将第一个参数中的网址，直接下载到filename路径下，爬下来的数据为一个网页。
data = urllib.request.urlretrieve("http://www.hellobi.com", filename="F:\python_workspace\spider_douban")

#2\清除缓存,清除urlretrieve等下载时保存的数据
urllib.request.urlcleanup()

#3\爬取页面数据
file = urllib.request.urlopen("http://www.hellobi.com")

#4\返回当环境的信息
file.info()

#5\获取当前网页的状态码和网址。
print(file.getcode())
print(file.geturl())

2、超时设置：

在urlopen中加入timeout参数。

file = urllib.request.urlopen("http://www.hellobi.com",timeout=10)

for i in range(0,100):
    try:
        file=urllib.request.urlopen("http://yum.iqianyue.com", timeout=1)
        data=file.read()
        print(len(data))
    except Exception as e:
        print("出现异常："+str(e))

3、自动模拟Http请求：

处理get请求：

#!/usr/bin/env python

# _*_ UTF-8 _*_

import urllib.request

keywd = "python"

#对网址中出现的中文进行相应的编码,得到进行编码之后的中文，后面可以直接使用。
keywd=urllib.request.quote(keywd)

url = "http://www.baidu.com/s?wd"+keywd+"&ie=urf-8&tn=96542061_hao_pg"

#将url封装为一个请求
req = urllib.request.Request(url)

data = urllib.request.urlopen(req).read()

fh = open("test.txt","wb")

fh.write(data)

fh.close()

如何处理post请求：

此时HTML的form中存在method=“post”

在新浪login.sina.com.cn中即可看到，form为post方式。

我们只需要form表单中的name属性即可。

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.request

import urllib.parse

url = "http://www.iqianyue.com/mypost/"

#设置对应的表单信息,urlencode中针对代码中的name值。
mydata = urllib.parse.urlencode({
    "name":"ceo@iqianyue.com"
    "pass":"123456"
    }).encode("utf-8")

#将数据转换为请求
req = urllib.request.Request(url, mydata)

#发送请求
data = urllib.request.urlopen(req).read()

fh = open("test_post.txt","wb")

fh.write(data)

fh.close()

4、爬虫的异常处理：

异常处理主要是为了增强代码的稳定性。

urlError：

1、连不上服务器
2、远程的url不存在
3、本地没有网络
4、触发了对应的httpError子类

具体的实战为：

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.error
import urllib.request

try:
    urllib.request.urlopen("http://blog.csdssn.net")
    print("111")
except urllib.error.URLError as e:
    if hasattr(e, "code"):
        print(e.code)
    if hasattr(e, "reason"):
        print(e.reason)

5、爬虫的伪装技术（浏览器）：

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.error
import urllib.request

url="http://blog.csdn.net/weiwei_pig/article/details/52123738"

header = ("User-Agent":"Mozilla/5.0(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/56.0.2924.87 Safari/537.36")

#用来添加报头信息
opener = urllib.request.build_opener()

opener.addheaders=[header]

data = opener.open(url).read()

fh = open("test_header","wb")

fh.write(data)

fh.close()

6、新闻网站爬取：

#!/usr/bin/env python

# _*_ UTF-8 _*_

import urllib.request
import re

data = urllib.request.urlopen("http://news.sina.com.cn/").read()

data2 = data.decode("utf-8", "ignore")

pat = 'href="(http://news.sina.com.cn/.*?)">'

allurl = re.compile(pat).findall(data2)

for i in range(0, len(allurl)):
    try:
        print("第"+str(i)+"次爬取")
        thisurl = allurl[i]
        file = str(i)+".html"
        urllib.request.urlretrieve(thisurl, file)
        print("------成功-------")
    except urllib.request.URLErroras e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)

7、爬虫防屏蔽手段之代理服务器：

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.request
import re

url = "http://blog.csdn.net/"

headers = ("User-Agent","Mozilla/5.0(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/56.0.2924.87 Safari/537.36")

#建一个浏览器opener
opener = urllib.request.build_opener()

#将头加入到opener中
opener.addheaders=[headers]

#将opener安装为全局
urllib.request.install_opener(opener)

data = urllib.request.urlopen(url).read().decode("utf-8", "ignore")

pat = '<h3 class="csdn-tracking-statistics" data-mod="popu_430"data-poputype="feed" data-feed-show="false" data-dsm="post"><a href="(.*?)"'

result = re.compile(pat).findall(data)

for i in range(0, len(result)):
    file = str(i)+".html"
    urllib.request.urlretrieve(result[i], filename=file)
    print("第"+str(i)+"次爬取成功")

#由此可以将CSDN首页所有的文章爬取下来。

如何做代理：

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.request

def use_proxy(url, proxy_addr):
    proxy = urllib.request.ProxyHandler({"http": proxy_addr})
    opener =urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)
    data =urllib.request.urlopen(url).read().decode("utf-8", "ignore")
    return data

proxy_addr = []"110.73.43.18:8123"

url = "http://www.baidu.com"

data = use_proxy(url, proxy_addr)

print(len(data))

8、图片爬虫实战：

在浏览器爬取时，有时不同的浏览器会有不同的查询结果，解析出不同的源码。

首先在“审查元素”中确定元素的重点字段是哪些，然后再在“源码”中查找对应的图片位置，确定图片url的规则。

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.request
import re

keyname = "短裙"

key = urllib.request.quote(keyname)

headers = ("User_Agent", "Mozilla/5.0(Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0")
opener = urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)
for i in range(0, 10):

url =

"https://s.taobao.com/search?q="+key+"&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=4&ntoffset=4&p4ppushleft=1%2C48&s="+str(i*44)
data =urllib.request.urlopen(url).read().decode(

    data =urllib.request.urlopen(url).read().decode("utf-8", "ignore")
    pat = 'pic_url":"//(.*?)"'
    imagelist =re.compile(pat).findall(data)
    for j in range(0, len(imagelist)):
        thisimg = imagelist[j]
        thisimgurl = "http://"+thisimg
        file = "F:/python_workspace/test/pic/"+str(i)+str(j)+".jpg"
        urllib.request.urlretrieve(thisimgurl, filename=file)

在源码解决不了的情况下，需要进行抓包。

作业：千图网的爬取（可以通过调试找到报错原因）：

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.request
import re

for i in range(1,10):
    pageurl = "http://www.58pic.com/piccate/3-153-652-"+str(i)+".html"
    data =urllib.request.urlopen(pageurl).read().decode("utf-8", "ignore")
    pat = '<aclass="thumb-box".*?src="(.*?).jpg!"'
    imglist =re.compile(pat).findall(data)
    for j in range(0,len(imglist)):
        try:
            thisimg = imglist[j]
            thisimgurl = thisimg+"_1024.jpg"
            file = "F:/python_workspace/test/pic2/"+str(i)+str(j)+".jpg"
            urllib.request.urlretrieve(thisimgurl, filename=file)
            print("第"+str(i)+"页第"+str(j)+"个图片爬取成功")
        except urllib.error.URLError as e:
            if hasattr(e, "code"):
                print(e.code)
            if hasattr(e, "reason"):
                print(e.reason)
        except Exception as e:
            print(e)

9、抓包分析实战（一）

获取淘宝的评论信息、腾讯的娱乐新闻信息等需要抓包分析。

如何抓取https的数据包以及腾讯视频的评论。

TextView：显示返回的信息；

通过fiddler找到含有评论的网址，复制出对应的网址，观察网址的规则。

设置完fiddler之后，点击要爬取的页面，回到fiddler中，确定有js内容的链接：

对应的网址为：

https://rate.tmall.com/list_detail_rate.htm?itemId=42679128869&spuId=315119437&sellerId=2166475645&order=3&currentPage=1&append=0&content=1&tagId=&posi=&picture=&ua=098%23E1hv%2FpvEvbQvUvCkvvvvvjiPP2Lw0jEbPL59AjnEPmPZQj1Pn2L9QjEvR2MwljE8vphvC9vhvvCvpvyCvhQvryGvCzox9WFIRfU6pwet9E7rejZIYExr1EuK46en3OkQrEttpR2y%2BnezrmphQRAn3feAOHPIAXcBKFyK2ixrlj7xD7QHYWsUtE97Kphv8vvvvvCvpvvvvvmCc6Cv2UIvvUnvphvpgvvv96CvpCCvvvmCXZCvhhmEvpvV2vvC9jx2uphvmvvv98GEKUM72QhvCvvvMMGtvpvhvvvvv8wCvvpvvUmm3QhvCvvhvvv%3D&isg=AoKCecM7b7NouHNtRCUm6rar0osk--IFkGgfUsyboPWxHyKZtOPWfQjduSCd&needFold=0&_ksTS=1508769919830_1070&callback=jsonp1071

然后确定其中的itemId等字段的内容，其中的部分内容未必有用处，可以直接删除，比如上面url的ua字段。

如果要抓取https的数据：

Fiddler默认只能抓取HTTP协议的网页，不能抓取HTTPS协议的网页，而我们很多时候，都需要抓HTTPS协议的网页，比如抓淘宝数据等。今天，韦玮老师会为大家讲解如何使用Fiddler抓取HTTPS协议的网页。

打开Fiddler，点击“Tools--FiddlerOptions--HTTPS”，把下方的全勾上，如下图所示：

然后，点击Action，选择将CA证书导入到桌面，即第二项，导出后，点击上图的ok保存配置。

然后在桌面上就有了导出的证书，如下所示：

随后，我们可以在浏览器中导入该证书。我们打开火狐浏览器，打开“选项--高级--证书--导入”，选择桌面上的证书，导入即可。随后，Fiddler就可以抓HTTPS协议的网页了。如下图所示。

抓取腾讯视频的评论：

下图为带评论的js文件（从fiddler中获取）：

其中有多个字段，commentid等，在点击“加载更多”时，commentID会发生变化，在第一个url的源码中会找到下一个评论url的地址，找到last字段，即为下一个url的commentid，以此来构造下一个url。

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.request
import re
import urllib.error

headers = ("User_Agent", "Mozilla/5.0(Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener)
comid = "6323280825454961655"
url = "http://coral.qq.com/article/2102904258/comment?commentid="+comid+"&reqnum=20&tag=&callback=jQuery1124020025941284059412_1508770934137&_=1508770934145"
for i in range(0, 100):
    data =urllib.request.urlopen(url).read().decode()
    patnext = '"last":"(.*?)"'
    nextid =re.compile(patnext).findall(data)[0]
    patcom = '"content":"(.*?)",'
    comdata =re.compile(patcom).findall(data)
    for j in range(0, len(comdata)):
        print("------第"+str(i)+str(j)+"条评论内容是：")
        print(eval('u"'+comdata[j]+'"'))
    url = "http://coral.qq.com/article/2102904258/comment?commentid="+nextid+"&reqnum=20&tag=&callback=jQuery1124020025941284059412_1508770934137&_=1508770934145"

10、微信爬虫实战：

如何解决微信的限制？

#!/usr/bin/env python

# _*_ UTF-8 _*_
#http://weixin.sogou.com/
import re
import urllib.request
import time
import urllib.error

#自定义函数，功能为使用代理服务器爬一个网址
def use_proxy(proxy_addr, url):
    #建立异常处理机制
    try:
        req = urllib.request.Request(url)
        req.add_header("User_Agent", "Mozilla/5.0(Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0")
        proxy =urllib.request.ProxyHandler({'http':proxy_addr})
        opener =urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
       urllib.request.install_opener(opener)
        data =urllib.request.urlopen(req).read()
        return data
    except urllib.error.URLErroras e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
            #若为URLError异常，延时10秒执行
        time.sleep(10)
    except Exception as e:
        print("exception:"+str(e))
        time.sleep(1)
#设置关键词
key = "Python"
#设置代理服务器，该代理服务器有可能失效，读者需要换成新的有效代理服务器
#即通过fiddler中转爬取。
proxy = "127.0.0.1:8888"
#爬多少页：
for i in range(0, 10):
    key = urllib.request.quote(key)
    thispageurl = "http://weixin.sogou.com/weixin?type=2&query="+key+"&page"+str(i)
    #a="http://blog.csdn.net"
    thispagedata =use_proxy(proxy, thispageurl)
    print(len(str(thispagedata)))
    pat1 = '<ahref="(.*?)"'
    rs1 =re.compile(pat1, re.S).findall(str(thispagedata))
    if(len(rs1)==0):
        print("此次（"+str(i)+"页）没成功")
        continue
    for j in range(0, len(rs1)):
        thisurl = rs1[j]

        #提取到的网址与通过浏览器实际跳转的页面网址不完全一致，通过比#较观察发现，爬取的页面中有amp字段为多余。
        thisurl = thisurl.replace("amp;","")
        file = "F:/python_workspace/test/wechat/第"+str(i)+"页第"+str(j)+"篇文章.html"
        thisdata =use_proxy(proxy, thisurl)
        try:
            fh = open(file, "wb")
            fh.write(thisdata)
            fh.close()
            print("第"+str(i)+"页第"+str(j)+"篇文章成功")
        except Exception as e:
            print(e)
            print("第"+str(i)+"页第"+str(j)+"篇文章失败")

微信爬虫的爬取依然是使用浏览器，即在搜狗浏览器上使用微信网页版，然后编辑这个页面的url，以获取内容。

11、多线程爬取实战

糗事百科的代码：

将此程序改为多进程为：

#!/usr/bin/env python
# _*_ UTF-8 _*_

import urllib.request
import re
import urllib.error
import threading

headers = ("User_Agent", "Mozilla/5.0(Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0")
opener = urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)

class One(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
       for i in range(1, 36, 2):
            url = "https://www.qiushibaike.com/8hr/page/"+str(i)
           pagedata=urllib.request.urlopen(url).read().decode("utf-8","ignore")
            pat='<divclass="content">.*?<span>(.*?)</span>.*?</div>'
            datalist =re.compile(pat, re.S).findall(pagedata)
            for j in range(0, len(datalist)):
                print("第"+str(i)+"页第"+str(j)+"个段子的内容是：")
                print(datalist[j])

class Two(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        for i in range(0, 36, 2):
            url = "https://www.qiushibaike.com/8hr/page/"+str(i)
           pagedata=urllib.request.urlopen(url).read().decode("utf-8","ignore")
            pat='<divclass="content">.*?<span>(.*?)</span>.*?</div>'
            datalist =re.compile(pat, re.S).findall(pagedata)
            for j in range(0, len(datalist)):
                print("第"+str(i)+"页第"+str(j)+"个段子的内容是：")
                print(datalist[j])

one = One()
one.start()

two = Two()
two.start()