Python之爬虫与反爬虫

最新推荐文章于 2024-05-13 15:58:31 发布

木子葭

最新推荐文章于 2024-05-13 15:58:31 发布

阅读量1k

点赞数 5

本文链接：https://blog.csdn.net/qq_42725815/article/details/86920493

版权

一什么是爬虫

爬虫：一段自动抓取互联网信息的程序，从互联网上抓取对于我们有价值的信息。
1.爬取贴吧中某一帖子的所有邮箱
第一步通过爬虫获取该网址的内容

使用urlopen打开指定页面
再使用.read()读取页面内容
最后decode(‘utf-8’)使用utf-8的解码方式使页面内容解码为unicode

第二步通过网页内容，使用正则表达式获得符合正则的所有邮箱

#1.通过爬虫获取该网址的内容;
#2.通过网页内容，找出<span class="red">31</span>, =====> 31;
from itertools import chain
from urllib.request import urlopen
import re
def getPageContent(url):
    """
    获取网页源代码
    :param url: 指定url内容
    :return: 返回页面的内容(str格式)
    """
    with urlopen(url) as html:
        return html.read().decode('utf-8')

def parser_page(content):
    """
    根据内容获取所有的贴吧总页数
    :param content: 网页内容
    :return: 贴吧总页数
    """
    pattern=r'<span class="red">(\d+)</span>'
    data=re.findall(pattern,content)
    return data[0]

def parser_all_page(pageCount):
    """
    根据贴吧页数，构造不同的url地址，并找出所有的邮箱
    :param pageCount:
    :return:
    """
    emails=[]
    for page in range(int(pageCount)):
        url='http://tieba.baidu.com/p/2314539885?pn=%d' %(page+1)
        print('正在爬取:%s' %(url))
        content=getPageContent(url)
        pattern=r'[a-zA-Z0-9][-\w.+]*@[A-Za-z0-9][-A-Za-z0-9]+\.+[A-Za-z]{2,14}'
        findEmail=re.findall(pattern,content)
        print(findEmail)
        emails.append(findEmail)
    return emails

def main():
    url='http://tieba.baidu.com/p/2314539885'
    content=getPageContent(url)
    pageCount=parser_page(content)
    emails=parser_all_page(pageCount)
    print(emails)
    with open('tiebaEmail.txt','w') as f:
        for tieba in chain(*emails):
            f.write(tieba +'\n')
main()

在这里插入图片描述

一共爬取了794个

2.爬取单个图片

from urllib.request import urlopen
url='http://imgsrc.baidu.com/forum/w%3D580/
sign=e23a670db9b7d0a27bc90495fbee760d/38292df5e0fe9925f33f62ef3fa85edf8db17159.jpg'

#1.获取图片的内容
content = urlopen(url).read()

#2.写入本地文件
with open('hello.jpg', 'wb') as f:
    f.write(content)

在这里插入图片描述

3.爬取贴吧指定页所有图片

import re
from urllib.request import urlopen

def get_content(url):
    """
    获取网页内容
    :param url:
    :return:
    """
    with urlopen(url) as html:
        return html.read()

def parser_get_img_url(content):
    """
    解析贴吧内容，获取所有风景图片的url
    :param content:
    :return:
    """
    pattern=r'<img class="BDE_Image".*?src="(http://.*?\.jpg)".*?>'
    imgUrl = re.findall(pattern, content.decode('utf-8').replace('\n', ' '))
    return imgUrl


def main():
    url = 'http://tieba.baidu.com/p/5437043553'
    content = get_content(url)
    imgLi = parser_get_img_url(content)
    for index, imgurl in enumerate(imgLi):
        # 根据图片的url获取每个图片的内容;
        content = get_content(imgurl)
        with open('img/%s.jpg' % (index + 1), 'wb') as f:
            f.write(content)
            print("第%s个图片下载成功...." % (index + 1))
main()

在这里插入图片描述

4.爬取视频

在这里插入图片描述

在linux系统中想要播放视频还需要安装播放视频软件

二对爬虫的深入理解

浏览网页是所要经历的过程

浏览器 (请求request)-> 输入URL地址(http://www.baidu.com/index.html)
http协议确定， www.baidu.com访问的域名确定 -> DNS服务器解析到IP地址
确定要访问的网页内容 -> 将获取到的页面内容返回给浏览器（响应过程）

爬取网页

1.基本方法：
使用 urllib.request 中的 urlopen 去爬取

from  urllib import  request

respose = request.urlopen('http://www.baidu.com')
content = respose.read().decode('utf-8')
print(content)

在这里插入图片描述
2.使用Resuest对象（可以添加其他的头部信息）

from urllib import request
url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
headers = {'User-Agent':' Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'}
# 实例化request对象，可以自定义请求的头部信息;
req = request.Request(url, headers=headers)
#urlopen不仅可以传递url地址，也可以传递request对象;
content = request.urlopen(req).read().decode('utf-8')
print(content)

在这里插入图片描述
后续添加头部信息

from  urllib import  request
from urllib.error import URLError

url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
# 实例化request对象， 可以自定义请求的头部信息;
req = request.Request(url)
req.add_header('User-Agent',user_agent)

# urlopen不仅可以传递url地址， 也可以传递request对象;
content = request.urlopen(req).read().decode('utf-8')
print(content)

在这里插入图片描述

二反爬虫---->伪装浏览器

有些网站为了避免爬虫的恶意访问，会设置一些反爬虫机制，对方服务器会对爬虫进行屏蔽。常见的反爬虫机制主要有下面几个：

1.通过分析用户请求的Headers信息进行反爬虫
2.通过检测用户行为进行反爬虫，比如通过判断同一个IP在短时间内是否频繁访问对应网站进行分析
3.通过动态页面增加爬虫的爬取难度，达到反爬虫的目的

第一种反爬虫机制在目前网站中应用最多，大部分反爬虫网站会对用户请求的Headers信息的‘ User-Agent ‘ 字段进行检测来判断身份。我们可以通过修改User-Agent的内容，将爬虫伪装成浏览器。
第二种反派爬虫机制的网站，可以通过使用代理服务器并经常切换代理服务器的方式，一般就能够攻克限制
因为在解释器内是不允许读取网站内容的，所以要先伪装成一个浏览器

下边通过例子加以说明，此url地址为中国银行地址
在这里插入图片描述
此时403拒绝访问，但是他不会拒绝浏览器，我们便伪装成浏览器

如上可知：通过伪装浏览器可以看到银行的一些信息，爬取出来的页面是html页面

user_agent来源你的浏览器：
在这里插入图片描述

三反爬虫---->IP代理

当抓取网站时，程序的运行速度很快，如果通过爬虫去访问，一个固定的ip访问频率很高，网站如果做反爬虫策略，那么就会封掉ip；

如何解决?
- 设置延迟；time.sleep(random.randint(1,5))
- 使用IP代理，让其他IP代替你的IP访问；
如何获取代理IP？
http://www.xicidaili.com/
如何实现步骤?

1). 调用urllib.request.ProxyHandler(proxies=None)；  --- 类似理解为Request对象
2). 调用Opener--- 类似与urlopen， 这个是定制的
3). 安装Opener
4). 代理IP的选择

from urllib import request
from urllib.error import URLError

url = 'https://www.whatismyip.com/'
proxy = {'https':'116.209.58.47:9999', 'http':'182.88.247.177:9797'}
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
# 1).调用urllib.request.ProxyHandler(proxies=None)--->类似理解为Request对象
proxy_support = request.ProxyHandler(proxy)
# 2).调用Opener --->类似与urlopen， 这个是定制的
opener = request.build_opener(proxy_support)
# 伪装浏览器
opener.addheaders = [('User-Agent',user_agent)]
# 3).安装Opener
request.install_opener(opener)
# 4).代理IP的选择
response = request.urlopen(url)
content = response.read().decode('utf-8')

print(content)

在这里插入图片描述

四保存cookie信息

cookie信息是什么?

cookie，某些网站为了辨别用户身份，只有登陆之后才能访问某个页面；进行一个会话跟踪，将用户的相关信息包括用户名等保存到本地终端

1. 如何将Cookie保存到变量中，或者文件中;

from http import cookiejar
from urllib.request import HTTPCookieProcessor
from urllib import request

#1.如何将Cookie保存到变量中或者文件中
#1)声明一个CookieJar--->FileCookieJar--->MozillaCookie
cookie=cookiejar.CookieJar()

#2)利用urllib.request的HTTPCookieProcessor创建一个cookie处理器
handler=HTTPCookieProcessor(cookie)

#3)通过CookieHandler创建opener，默认使用的opener就是urlopen
opener=request.build_opener(handler)

#4)打开url页面
response=opener.open('http://www.baidu.com')

#5)打印该页面的cookie信息
print(cookie)
for item in cookie:
    print(item)

在这里插入图片描述
2. 如何将Cookie以指定格式保存到文件中

from http import cookiejar
from urllib.request import HTTPCookieProcessor
from urllib import request

# 1)设置保存cookie的文件名
cookieFilename = 'cookie.txt'

# 2)声明一个MozillaCookie,用来保存cookie并且可以写入文件
cookie = cookiejar.MozillaCookieJar(filename=cookieFilename)

# 3)利用urllib.request的HTTPCookieProcessor创建一个cookie处理器
handler = HTTPCookieProcessor(cookie)

# 4)通过CookieHandler创建opener，默认使用的openr就是urlopen;
opener = request.build_opener(handler)

# 5)打开url页面
response = opener.open('http://www.baidu.com')

# 6)打印cookie，
print(cookie)
print(type(cookie))
# ignore_discard, 即使cookie信息将要被丢弃，也要把它保存到文件中;
# ignore_expires, 如果在文件中的cookie已经存在，就覆盖原文件写入;
cookie.save(ignore_discard=True, ignore_expires=True)

在这里插入图片描述
3. 如何从文件中获取cookie并访问

from http import cookiejar
from urllib.request import HTTPCookieProcessor
from urllib import request

# 1)指定cookie文件存在的位置
cookieFilename = 'cookie.txt'

# 2)声明一个MozillaCookie,用来保存cookie并且可以写入文件，用来读取文件中的cookie信息
cookie = cookiejar.MozillaCookieJar()

# 3)从文件中读取cookie内容
cookie.load(filename=cookieFilename)

# 4)利用urllib.request的HTTPCookieProcessor创建一个cookie处理器
handler = HTTPCookieProcessor(cookie)

# 5)通过CookieHandler创建opener,默认使用的openr就是urlopen;
opener = request.build_opener(handler)

# 6)打开url页面
response = opener.open('http://www.baidu.com')

#7)打印信息
print(response.read().decode('utf-8'))

在这里插入图片描述

五 urllib常见异常处理

常见的异常：（URLError、HTTPError、ContentTooShortError…）
如果这时不确定是否会爬取成功时，不妨先用 try 作异常处理

from urllib import  request, error
try:
    url = 'https://www.baidu.com/hello.html'
    response = request.urlopen(url)
    print(response.read().decode('utf-8'))
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print("成功")

在这里插入图片描述

超时异常处理

from urllib import request, error
import  socket

try:
    url = 'https://www.baidu.com'
    response = request.urlopen(url, timeout=0.01)
    print(response.read().decode('utf-8'))
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
    if isinstance(e.reason, socket.timeout):
        print("超时")
else:
    print("成功")

在这里插入图片描述

六 url解析

1.在 urllib.parse 中有一个 urlparse 可以将url进行分解

from urllib.parse import urlparse

url='http://www.google.com/search?hl=en&q=urlparse&btnG=Google+Search'
parse_info=urlparse(url)

print(parse_info)
print(parse_info.netloc)
print(parse_info.path)

在这里插入图片描述
2.在 urllib.parse 中同样有一个 urlencode 可以将url进行组合

# 1. 解析， 获取使用协议和访问的ip；
# 2. 通过字典编码的方式构造url地址;
from urllib.parse import   urlencode
params = {
    'name':'westos',
    'age':20
}
base_url = 'http://www.baidu.com'
url = base_url + urlencode(params)
print(url)

在这里插入图片描述
要求：
获取网页的协议与网址，并按一定要求构成新的网址

from urllib.parse import urlparse, urlencode

url1='https://movie.douban.com/subject/4864908/comments?start=20&limit=20&sort=new_score&status=P'
parse_news=urlparse(url1)
print(parse_news.scheme)
print(parse_news.netloc)

params2={
    'start':20,
    'limit':20,
    'sort':'new_score',
    'status':'P'
}
base_url2='https://movie.douban.com/subject/4864908/comments'
url2=base_url2 + urlencode(params2)
print(url2)

在这里插入图片描述

木子葭

关注

5
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Python之爬虫与反爬虫

一什么是爬虫爬虫：一段自动抓取互联网信息的程序，从互联网上抓取对于我们有价值的信息。1.爬取贴吧中某一帖子的所有邮箱第一步通过爬虫获取该网址的内容使用urlopen打开指定页面再使用.read()读取页面内容最后decode(‘utf-8’)使用utf-8的解码方式使页面内容解码为unicode第二步通过网页内容，使用正则表达式获得符合正则的所有邮箱#1.通过爬虫获取该网址的...
复制链接

扫一扫