python3爬虫模拟登录爬取豆瓣电影数据

最新推荐文章于 2024-07-01 17:12:36 发布

半九拾

最新推荐文章于 2024-07-01 17:12:36 发布

阅读量5.3k

点赞数 3

分类专栏：大数据

本文链接：https://blog.csdn.net/b285795298/article/details/86510434

版权

大数据专栏收录该内容

5 篇文章 0 订阅

订阅专栏

前面说一些背景

大佬万福,若有高见,还请不吝赐教。
折腾了一天半,总算解决了豆瓣数据的爬取问题。
-------需要登录和输入验证码才能继续爬数据的问题。

你可以获得数据：

"""
链接: https://pan.baidu.com/s/1StbBu4DDh0dQAwf8Ph5I9g 提取码: up6r 
"""

详细代码可以参照我的github。

"""
lets begin
"""

故事是这样的,我有一份媒资数据表,表里面都是影片数据,包括:导演\演员\影片类型等等.但是这份数据表的数据缺失太多了,也没有对应上最新的豆瓣评分.
为了建立影片之间的相互关系,需要尽可能的补充影片的各项属性.于是,爬取豆瓣数据来丰富该数据表成为首选解决方案.

具体的实现思路,是根据我数据库中的影片名称在豆瓣网站上搜索,寻找最佳topk 匹配的影片信息,然后下载如影片的海报,评分,年份等等信息.

想过用scrapy神框架,但是考虑到我是用一个一个数据进行搜索的,框架带给我的便利不是特别大.再者我对scrapy还不是很熟(其实就看了两天书和几篇博文),所以我毅然放弃此大杀器,用requests和xpath 实现我的需求。

我们发现,当我使用如下网址:“https://movie.douban.com/j/subject_suggest?q=” + “电影名称” (后面加上我的电影名称之后),他会返回给我该电影的信息,或者返回给我该电影从名称上看比较类似的信息.

在这里插入图片描述

比如我搜终结者--------------“https://movie.douban.com/j/subject_suggest?q=终结者”

在这里插入图片描述
仔细观看网页返回给我的json数据,我发现非常容易就可以解读出电影的海报地址（“img”）,以及指向电影的详细信息的链接（“url”）.
拿其中的第一条来看:

    {"episode":"",
    "img":"https://img3.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p1910909085.webp",
    "title":"终结者2：审判日",
    "url":"https:\/\/movie.douban.com\/subject\/1291844\/?suggest=%E7%BB%88%E7%BB%93%E8%80%85",
    "type":"movie",
    "year":"1991",
    "sub_title":"Terminator 2: Judgment Day",
    "id":"1291844"},

所以我的行为变得相对简单：
1、将mainURL = “https://movie.douban.com/j/subject_suggest?q=” 和电影名称组合起来，形成mainURL +NAME 的url。
2、request.get 每一个url ，分别获得response。此response的内容是一个json，解析该json即获得海报地址和电影详情页链接m_url。
3、对海报地址和详情页链接稍加处理（变成规范的url形式，基本内容不变。本例中，即把"/" 全部replace为"/" 即可）
4、直接存储海报。（urllib.request.urlretrieve(img,filename)直接存储至本地）
5、继续request.get (m_url) 电影详情页。获得所需要的电影信息。查看源代码，我们知道电影的相关信息都包裹在<script type="application/ld+json">中，
与电影相关的电影信息包裹在div[@class="recommendations-bd"] 中，
借助于xpath 于是可以通过以下代码来解析：

    #影片信息搜集
    def summaryCraw(url):
        while True:
            try:
                time.sleep(1)
                mresponse = requests.get(url, headers={'User-Agent': random.choice(User_Agents)},proxies=ipGet()).content
                break
            except:
                time.sleep(300)
                print("summaryCraw try to reconnect....")
        if len(mresponse):
            html = etree.HTML(mresponse, parser=etree.HTMLParser(encoding='utf-8'))
            mInfo = html.xpath('//script[@type="application/ld+json"]/text()')
            recommendation_m = html.xpath('//div[@class="recommendations-bd"]//img/@alt')
        else:
            return 0
        return mInfo,recommendation_m

在这里插入图片描述

如何应对反爬虫？
访问次数增加时，豆瓣会要求用户登录才能取得以上信息，当访问频率过高的时候，IP可能会被封掉。
有一些大佬解决过类似的问题：
比如：https://blog.csdn.net/qingminxiehui/article/details/81671161
比如：https://blog.csdn.net/loner_fang/article/details/81132571
一些反爬虫策略：https://www.cnblogs.com/micro-chen/p/8676312.html
爬虫与反爬虫和之后的反反爬虫：https://www.zhihu.com/question/28168585

一般来说，网站反爬虫有以下的策略：

同ip请求过于频繁，封ip
同账号请求过于频繁，封账号
将页面内容封装在js代码里面异步加载
需要预登录

爬虫采取的反击有以下：

在请求的头部加入user-agent伪装成浏览器
在请求头部加入cookie对付需要登录才能看的网站

本人采取的是先登录后爬取，并使用代理ip，使用多个账号。
这是豆瓣登录页面请求信息（包含验证码）：
在这里插入图片描述
我们将账号信息封装成这样的表单形式，然后Session.post 提交实现登录。
如果有验证码，就将验证码存储在本地，然后手动输入验证码。封装成如上表单形式，实现提交。

综上具体的实现步骤为：
1、代理ip池的建立。
“”"
1、抓取西刺代理网站的代理ip
2、并根据指定的目标url,对抓取到ip的有效性进行验证
3、最后存到指定的path
4、将存储的ip 设定为request.get时使用的代理ip格式。proxies的格式是一个字典： {‘http’: ‘http://122.114.31.177:808‘}
“”"
代码引自：https://blog.csdn.net/OnlyloveCuracao/article/details/80968233

import threading
import random
import requests
import  datetime
from bs4 import  BeautifulSoup

# ------------------------------------------------------文档处理--------------------------------------------------------
# 写入文档
def write(path,text):
    with open(path,'a', encoding='utf-8') as f:
        f.writelines(text)
        f.write('\n')
# 清空文档
def truncatefile(path):
    with open(path, 'w', encoding='utf-8') as f:
        f.truncate()
# 读取文档
def read(path):
    with open(path, 'r', encoding='utf-8') as f:
        txt = []
        for s in f.readlines():
            txt.append(s.strip())
    return txt
# ----------------------------------------------------------------------------------------------------------------------
# 计算时间差,格式: 时分秒
def gettimediff(start,end):
    seconds = (end - start).seconds
    m, s = divmod(seconds, 60)
    h, m = divmod(m, 60)
    diff = ("%02d:%02d:%02d" % (h, m, s))
    return diff
# ----------------------------------------------------------------------------------------------------------------------
# 返回一个随机的请求头 headers
def getheaders():
    user_agent_list = [ \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    UserAgent=random.choice(user_agent_list)
    headers = {'User-Agent': UserAgent}
    return headers
# -----------------------------------------------------检查ip是否可用----------------------------------------------------
def checkip(targeturl,ip):
    headers =getheaders()  # 定制请求头
    proxies = {"http": "http://"+ip, "https": "http://"+ip}  # 代理ip
    try:
        response=requests.get(url=targeturl,proxies=proxies,headers=headers,timeout=5).status_code
        if response == 200 :
            return True
        else:
            return False
    except:
        return False

#-------------------------------------------------------获取代理方法----------------------------------------------------
# 免费代理 XiciDaili
def findip(type,pagenum,targeturl,path): # ip类型,页码,目标url,存放ip的路径
    list={'1': 'http://www.xicidaili.com/nt/', # xicidaili国内普通代理
          '2': 'http://www.xicidaili.com/nn/', # xicidaili国内高匿代理
          '3': 'http://www.xicidaili.com/wn/', # xicidaili国内https代理
          '4': 'http://www.xicidaili.com/wt/'} # xicidaili国外http代理
    url=list[str(type)]+str(pagenum) # 配置url
    headers = getheaders() # 定制请求头
    html=requests.get(url=url,headers=headers,timeout = 5).text
    soup=BeautifulSoup(html,'lxml')
    all=soup.find_all('tr',class_='odd')
    for i in all:
        t=i.find_all('td')
        ip=t[1].text+':'+t[2].text
        is_avail = checkip(targeturl,ip)
        if is_avail == True:
            write(path=path,text=ip)
            print(ip)

#-----------------------------------------------------多线程抓取ip入口---------------------------------------------------
def getip(targeturl,path):
     truncatefile(path) # 爬取前清空文档
     start = datetime.datetime.now() # 开始时间
     threads=[]
     for type in range(4):   # 四种类型ip,每种类型取前三页,共12条线程
         for pagenum in range(3):
             t=threading.Thread(target=findip,args=(type+1,pagenum+1,targeturl,path))
             threads.append(t)
     print('开始爬取代理ip')
     for s in threads: # 开启多线程爬取
         s.start()
     for e in threads: # 等待所有线程结束
         e.join()
     print('爬取完成')
     end = datetime.datetime.now() # 结束时间
     diff = gettimediff(start, end)  # 计算耗时
     ips = read(path)  # 读取爬到的ip数量
     print('一共爬取代理ip: %s 个,共耗时: %s \n' % (len(ips), diff))

#-------------------------------------------------------启动-----------------------------------------------------------
if __name__ == '__main__':
    path = 'ip.txt' # 存放爬取ip的文档path
    targeturl = 'http://www.cnblogs.com/TurboWay/' # 验证ip有效性的指定url
    getip(targeturl,path)

2、申请多个账号
每次登陆随机使用多个账号中的一个。
3、隔一段时间登陆

while True:
	try:
		time.sleep(random.random()*5)
		do spider......

4、登录时使用不同的header模拟不同的浏览器。
我使用的是开源包：UserAgent

from fake_useragent import UserAgent
ua = UserAgent(verify_ssl=False)
#如果你的IDE执行ua = UserAgent（）报错，请添加verify_ssl=False

关于登陆、验证码存储和输入的代码如下：

# -*- coding: utf-8 -*- 
# @Time    : 19-1-16 下午2:35 
# @Author  : jayden.zheng 
# @FileName: login.py 
# @Software: PyCharm 
# @content :

import requests
import urllib.request
from lxml import etree
import random

#在此之前先构建好代理ip池
#每次登陆使用不同的代理ip
def login(session,agents,proxies):
    name_pass = {'@163.com': '98', '@qq.com': 'w534', '@qq.com': 'q98', '18300@qq.com': 'qw98'}
    url = 'https://www.douban.com/accounts/login'
    #proxies_ = ipGet()
    captchav,captchai = get_captcha(url,agents,proxies)
    email = random.sample(name_pass.keys(), 1)[0]
    passwd = name_pass[email]

    if len(captchav)>0:
        '''此时有验证码'''
        #人工输入验证码
        urllib.request.urlretrieve(captchav[0],"/home/lenovo/dev/captcha.jpg")
        captcha_value = input('查看captcha.png,有验证码请输入:')
        print ('验证码为：',captcha_value)

        data={
            "source":"None",
            "redir": "https: // www.douban.com",
            "form_email": email,
            "form_password": passwd,
            "captcha-solution": captcha_value,
            "captcha-id":captchai[0],
            "login": "登录",
        }
    else:
        '''此时没有验证码'''
        print ('无验证码')

        data={
            "source":"None",
            "redir": "https: // www.douban.com",
            "form_email": email,
            "form_password": passwd,
            "login": "登录",
        }
    print ('正在登陆中......')
    #print('帐号为: %20s 密码为: %20s'%(email,passwd))
    return session.post(url,data=data,headers= {'User-Agent': agents},proxies=proxies)

#验证码获取
def get_captcha(url,agents,proxies):
    html = requests.get(url,headers= {'User-Agent': agents},proxies=proxies).content
    html = etree.HTML(html, parser=etree.HTMLParser(encoding='utf-8'))
    captcha_value = html.xpath('//*[@id="captcha_image"]/@src')
    captcha_id = html.xpath('//*[@name="captcha-id"]/@value')
    return captcha_value,captcha_id

具体的抓取过程代码：

# -*- coding: utf-8 -*- 
# @Time    : 19-1-15 下午2:53 
# @Author  : jayden.zheng 
# @FileName: spider.py 
# @Software: PyCharm 
# @content :
from lxml import etree
import requests
import time
import urllib3
import warnings
import random
import urllib.request
import pandas as pd
from login import  login
from fake_useragent import UserAgent

#去掉warnings 提示
warnings.simplefilter(action='ignore', category=FutureWarning)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

ua = UserAgent(verify_ssl=False)

from wsInput import dataInput
movie = dataInput()
mainUrl = "https://movie.douban.com/j/subject_suggest?q="

def getip():
    ipdf = pd.read_csv('ip.txt',header=None)
    ipdf.columns =['ip']
    ipdf['ip'] = 'http://' + ipdf['ip']
    ipdf['ip'] = ipdf['ip'].apply(lambda x:{'http':x})
    return ipdf['ip']


#将图片保存至本地
def outputImg(c2code,img):
    imgTmp = img.split('.')[-1].lower()
    if imgTmp == 'jpg':
        filename = "/home/lenovo/dev/data/Douban/img/" + str(c2code) + ".jpg"
    elif imgTmp == 'png':
        filename = "/home/lenovo/dev/data/Douban/img/" + str(c2code) + ".png"
    elif imgTmp == 'jpeg':
        filename = "/home/lenovo/dev/data/Douban/img/" + str(c2code) + ".jpeg"
    else:
        filename = "/home/lenovo/dev/data/Douban/img/" + str(c2code) + ".pdf"
    urllib.request.urlretrieve(img, filename)

#影片信息搜集
def summaryCraw(url,agents,proxies):
    while True:
        try:
            time.sleep(random.random()*5)
            mresponse = requests.get(url, headers={'User-Agent': agents},proxies=proxies).content
            break
        except:
            time.sleep(200)
            print("summaryCraw try to reconnect....")
    if len(mresponse):
        html = etree.HTML(mresponse, parser=etree.HTMLParser(encoding='utf-8'))
        mInfo = html.xpath('//script[@type="application/ld+json"]/text()')
        recommendation_m = html.xpath('//div[@class="recommendations-bd"]//img/@alt')
    else:
        return 0
    return mInfo,recommendation_m


def infoCompare(response,qstr,movie,agents,proxies):
    imageDf = pd.DataFrame()
    c2code = movie[movie['NAME'] == qstr]['C2CODE'].iloc[0]
    imageLst = []
    movieUrlLst = []
    yearLst = []
    nameLst = []
    minfoLSt = []
    recommendationLst = []

    for item in eval(response):
        if qstr in item['title'] and 'img' and 'url' and 'year' in item or len(response) == 1:
            img = item['img'].replace('\\/','/')
            url = item['url'].replace('\\/','/')
            nameLst.append(item['title'])
            imageLst.append(img)
            movieUrlLst.append(url)
            yearLst.append(item['year'])
            print("img store ------------")
            print(img)
            outputImg(c2code, img)
            mInfo, recommendation_m = summaryCraw(url,agents,proxies)
            minfoLSt.append(mInfo)
            recommendationLst.append(recommendation_m)
            c2code = c2code + "_"

        else:
            continue

    imageDf['name'] = nameLst
    imageDf['img'] = imageLst
    imageDf['url'] = movieUrlLst
    imageDf['year'] = yearLst
    imageDf['mInfo'] = minfoLSt
    imageDf['recommendation_m'] = recommendationLst
    return imageDf


def targetUrl(mainUrl,qstr,agents,proxies):
    while True:
        try:
            time.sleep(random.random()*3)
            response = requests.get(mainUrl + qstr,headers= {'User-Agent': agents},proxies = proxies).text
            break
        except:
            time.sleep(120)
            print("targetUrl try to reconnect....")
    return response


def urlCollect(mainUrl,movie,ipLst):

    movieLst = movie['NAME']
    mlen = len(movie)
    urlDf = pd.DataFrame()
    crawlcount  = 1
    for qstr in movieLst:
        print("\ncraw times %d / %d ---@--- %s -------------->"%(crawlcount,mlen,qstr))

        session = requests.Session()
        agents = ua.random

        proxies = random.choice(ipLst)

        req = login(session,agents,proxies)
        print(req)
        response = targetUrl(mainUrl,qstr,agents,proxies)

        crawlcount = crawlcount + 1
        if len(response):

            imageDf = infoCompare(response,qstr,movie,agents,proxies)
            urlDf = pd.concat([urlDf,imageDf],sort=True)
            if crawlcount % 100 == 0:
                urlDf.to_csv("urlDf.csv")
        else:
            continue
    return urlDf



ipLst = getip()
urlDf = urlCollect(mainUrl,movie,ipLst)
print(urlDf)
urlDf.to_csv("urlDf.csv")

详细代码可以参照我的github。

半九拾

关注

3
点赞
踩
14

收藏

觉得还不错? 一键收藏
打赏
16
评论
python3爬虫模拟登录爬取豆瓣电影数据

前面说一些背景大佬万福,若有高见,还请不吝赐教.折腾了一天半,总算解决了豆瓣需要登录和输入验证码才能继续爬数据的问题.故事是这样的,我有一份媒资数据表,表里面都是影片数据,包括:导演\演员\影片类型等等.但是这份数据表的数据缺失太多了,也没有对应上最新的豆瓣评分.为了建立影片之间的相互关系,需要尽可能的补充影片的各项属性.于是,爬取豆瓣数据来丰富该数据表成为首选解决方案.具体的实现思路,...
复制链接

扫一扫