爬虫系列之抓取知乎的公开收藏夹信息

最新推荐文章于 2024-10-30 16:40:23 发布

weixin_34082789

最新推荐文章于 2024-10-30 16:40:23 发布

阅读量300

点赞数 1

文章标签：爬虫 python 操作系统

原文链接：https://my.oschina.net/niithub/blog/864333

版权

2019独角兽企业重金招聘Python工程师标准>>>

昨天在小密圈里看到一位同学发了这么条信息

二十一世纪，什么能力最重要，执行力。说干咱就干，下面请欣赏由niithub选送的菜鸟级爬虫，它表演的节目是抓取知乎公开收藏夹内的信息。

在正式表演之前，我们先简单的了解一下这位菜鸟级爬虫，PP同学的一些简单情况。

它运行在windowx64系统上，python3.5.2版本，编译器是sublime text 3。

情况不多说，直接上代码。

首先我们先看下PP用了哪些包，具体内容见下图

准备工作做好之后，我们要见真功夫了。

俗话说的好，要想从家里拿东西出来，最起码你得进的了门。

所以，在对收藏夹文件进行抓取之前，我们要先登陆知乎。

对知乎登陆操作进行抓包分析之后，我们可以得到这么几条信息：

知乎的登陆请求地址是 https://www.zhihu.com/login/phone_num
登陆过程中发送的参数个数为2，分别是password和phone_num<本次登陆操作是用手机号进行登陆> 　> 注：当频繁进行登陆操作时，会激发知乎的反爬虫机制，传说中的验证码就会出现了
Referer地址是 https://www.zhihu.com

大致了解了这些信息后，我们就可以用代码对知乎做一个模拟登陆操作了。

但是，想到我们的目的是为了抓取知乎收藏夹内的信息，所以，抓取的频率肯定是不低的，我们不能每次抓取都做一次登陆，那样效率也太低了，所以，在登陆之后，我们要把生成的cookie保存到一个文件夹中，这样在下次抓取收藏夹中的内容时，直接调用cookie就能对知乎做一次模拟登陆操作了。

代码如下：

def login(password,name):
    #请求的地址
    postUrl = 'https://www.zhihu.com/login/phone_num'
    #请求头
    headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0",
                'Referer':'https://www.zhihu.com'}
    #发送的参数
    value = {'password':password,'phone_num':name}
    data = urllib.parse.urlencode(value).encode(encoding='UTF8')
    #发送请求
    request = urllib.request.Request(postUrl , data , headers)
    #解析响应的信息
    result = urllib.request.urlopen(request)
    cookie_filename = 'cookie.txt'
    cookie = http.cookiejar.MozillaCookieJar(cookie_filename)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    print(result.read().decode())
    try:
        response = opener.open(request)
        page = response.read().decode()
        # print(page)
    except urllib.error.URLError as e:
        print(e.code, ':', e.reason)
    # 保存cookie到cookie.txt中
    cookie.save(ignore_discard=True, ignore_expires=True)  
    print(cookie)

登陆结果如下：

打印的日志分别是登陆成功后返回的信息和cookie的内容：

1、 msg为unicode字符，转成中文，意思就是登陆成功。 2、<MozillaCookieJar...>中写的是cookie的信息。

现在第一步已经完成了，下面我们要面对的就是对收藏夹的信息进行一个抓取。

以我自己的一个公开收藏夹为例，它的地址是 https://www.zhihu.com/collection/149331424 <该收藏夹就是为了这个菜鸟级爬虫而生>

抓取的信息分为两大类，一是标题，二是回复的内容。

首先，我们要对html文件做一个分析。

利用firebug这款火狐浏览器的插件，我们可以很清晰的看到，所收藏的内容标题，均在类名为am-item-title的h2标签下，所收藏的内容，均在类名为zh-sumary sumary clearfix的div下，好嘛，现在我们进行第一次抓取实验。

def get_collection(get_url):
    cookie_filename = 'cookie.txt'
    cookie = http.cookiejar.MozillaCookieJar(cookie_filename)
    cookie.load(cookie_filename, ignore_discard=True, ignore_expires=True)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    # 利用cookie请求访问另一个网址
    get_request = urllib.request.Request(get_url)
    get_response = opener.open(get_request)
    soup = BeautifulSoup(get_response)
    content = soup.findAll('div',{'class':'zh-summary'})
    for i in content:
        print(i.text)
if __name__ == '__main__':
    get_collection('https://www.zhihu.com/collection/149331424')

结果如下图所示：

抓取到的答案不仅出奇的少，并且都有一个“显示全部”，恩，不用想，代码有bug，我们回过头来在分析一下html文件。

通过认真的分析过html文件后，我发现，真正存放内容的标签是textarea

刚刚对html文件分析的不仔细，所以才会出现刚刚那个乌龙，现在知道问题出在哪了，我们接着修改我们的代码

def get_collection(get_url):
    cookie_filename = 'cookie.txt'
    cookie = http.cookiejar.MozillaCookieJar(cookie_filename)
    cookie.load(cookie_filename, ignore_discard=True, ignore_expires=True)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    # 利用cookie请求访问另一个网址
    get_request = urllib.request.Request(get_url)
    get_response = opener.open(get_request)
    soup = BeautifulSoup(get_response)
    title = soup.findAll('h2',{'class':'zm-item-title'})
    content = soup.findAll('textarea    ',{'class':'content'})
    for i in content:
        print(i.text)
if __name__ == '__main__':
    get_collection('https://www.zhihu.com/collection/149331424')

![](http://images2015.cnblogs.com/blog/994918/201703/994918-20170322005055893-1382698994.png "'第二次运行结果图“)

WTF，居然内含html标签,好吧，我们要进一步升级我们的代码

def get_collection(get_url):
    cookie_filename = 'cookie.txt'
    cookie = http.cookiejar.MozillaCookieJar(cookie_filename)
    cookie.load(cookie_filename, ignore_discard=True, ignore_expires=True)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    # 利用cookie请求访问另一个网址
    get_request = urllib.request.Request(get_url)
    get_response = opener.open(get_request)
    soup = BeautifulSoup(get_response)
    title = soup.findAll('h2',{'class':'zm-item-title'})
    content = soup.findAll('textarea    ',{'class':'content'})
    #计数器
    #功能是让回答和问题标题能对应起来
    num = 0
    for i in title:
        num += 1
        if(num > 1):
            print('标题:'+'\n'+i.text+'\n')
            #去除内含的html标签
            regular  = re.compile(r'<[^>]+>',re.S)
            i = regular.sub('',content[num-2].text)
            print('内容:'+'\n'+i+'\n'+'-------')
if __name__ == '__main__':
    get_collection('https://www.zhihu.com/collection/149331424')

代码至此，算是圆满了，实现了我们的预设目标。

回过头再审视代码，我们会发现，有一小段代码重复了两遍，so，我们要把他们抽离出来，给代码解耦。

最终的代码如下：


#coding:utf-8
#模拟登陆

import requests
import urllib
import http.cookiejar
from bs4 import BeautifulSoup
import re

#利用手机号和密码登陆知乎
#得到cookie后将cookie存储起来
#以后直接调用cookie
def login(password,name):
    #请求的地址
    postUrl = 'https://www.zhihu.com/login/phone_num'
    #请求头
    headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0",
                'Referer':'https://www.zhihu.com'}
    #发送的参数
    value = {'password':password,'phone_num':name}
    data = urllib.parse.urlencode(value).encode(encoding='UTF8')
    #发送请求
    request = urllib.request.Request(postUrl , data , headers)
    #解析响应的信息
    result = urllib.request.urlopen(request)
    #调用操作cookie的函数
    result = make_cookie()
    try:
        response = result[0].open(request)
    except urllib.error.URLError as e:
        print(e.code, ':', e.reason)
    # 将拿到的cookie到cookie.txt中
    result[1].save(ignore_discard=True, ignore_expires=True)  
#利用前面请求到的cookie
#抓取公开收藏夹中的信息
def get_collection(get_url):
    \#调用操作cookie的函数
    result = make_cookie()
    get_request = urllib.request.Request(get_url)
    get_response = result[0].open(get_request)
    soup = BeautifulSoup(get_response)
    title = soup.findAll('h2',{'class':'zm-item-title'})
    content = soup.findAll('textarea',{'class':'content'})
    #计数器
    #功能是让回答和问题标题能对应起来
    num = 0
    for i in title:
        num += 1
        if(num > 1):
            print('标题:'+'\n'+i.text+'\n')
            #去除内含的html标签
            regular  = re.compile(r'<[^>]+>',re.S)
            i = regular.sub('',content[num-2].text)
            print('内容:'+'\n'+i+'\n'+'-------')
def make_cookie():
    \#定义一个保存cookie的地址
    cookie_filename = 'cookie.txt'
    cookie = http.cookiejar.MozillaCookieJar(cookie_filename)
    #获取cookie
    handler = urllib.request.HTTPCookieProcessor(cookie)
    #实例化一个全局opener
    opener = urllib.request.build_opener(handler)
    return opener,cookie

if __name__ == '__main__':
    login('####','####')
    get_collection('https://www.zhihu.com/collection/149331424')

转载于:https://my.oschina.net/niithub/blog/864333