python实现CSDN下载资源自动评分(selenium+requests)

1. background:

CSDN下载资源需要下载积分,评论已经下载过的资源,可以获得1分返现. 自己之前下载过50多个资源, 但是不想手动一个个去评分+评论。所以想写个小程序,自动完成50多个资源的评论. 

2. assessment

有了需求,接下来开始分析。我这里直接描述分析结果哈. 可以把这个项目分解为以下几个步骤:

1. 登录CSDN

2. 获取所有未评论资源的url页面

3. 实现评论功能


a)登录CSND方式有两种, a.后台模拟提交 b.利用selenium webdriver实现浏览器登录, 方法最简单就是b.

        iedriver = os.getcwd() + "\\IEDriverServer.exe"
        os.environ["webdriver.ie.driver"] = iedriver
        dr = webdriver.Ie(iedriver)

        dr.get("https://passport.csdn.net/account/login")
        dr.find_element_by_id("username").send_keys("your username")
        dr.find_element_by_id("password").send_keys("your password")
        dr.find_element_by_class_name("logging").click()


b)获取所有资源的url页面, 比如我自己的:


一共58个资源,分为10页. 所有我需要遍历10个download page,分别把each page的评论url抓取出来. 惊奇的发现:

比如第二页, page的url其实是:http://download.csdn.net/my/downloads/2

懂了吧! 遍历所有页面:

        page_urls=[]
        baseurl="http://download.csdn.net/my/downloads"
        
        page_urls.append(baseurl)
        for i in range(2,11):
            self.page_urls.append(baseurl+"/"+str(i))

抓取page中6个资源的url页面:

通过查看元素,可以看到网页中的元素html属性:

<a href="/detail/zg518/8652677#comment" class="btn-comment">立即评价</a>

所以接下来就是用正则表达式,抓吧!

        comments_urls=[]
        stra='<a href="/detail/u011000529/5727131#comment" class="btn-comment">立即评价</a>'
        regl=r'<a href="(.*?)" class="btn-comment">立即评价</a>'
        matchs=re.findall(regl, ret.content)
        for m in matchs:
            comments_urls.append("http://download.csdn.net"+m)


c)实现评分.

我采用的方式是,先找一个资源,去评分一下,然后抓取网络http包。抓包工具很多,可以用firefox的httpfox,我用的360浏览器的自带功能。


其实调用的是comment.js代码:


ok,原来点了提交评论后,后台执行的是ajax请求,请求内容是data后面的值.
分析这个data不难发现,这个sourceid其实就是我们comment url中的7个数字,比如:http://download.csdn.net/detail/lee118007/8637891#comment
content呢就是我们的评论汉字,只是采用了unicode编码.
请求地址是:
http://download.csdn.net/index.php/comment/post_comment
只是这里有个特别重要的point: 那就是cookie和headers
如果cookie和headers不对,是会被服务器拒绝的! 所有我们需要将webderivr中的cookies提取出来,然后用到我们的requests请求中. headers呢就简单了,看看抓包中的headers信息就知道。
到此,3个环节,都可以实现,话不多说,直接上code:

#!/usr/bin/python
#coding=utf-8
import socket
import time
import binascii
import re
import urllib
import requests
import urllib2
from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy
import os

# proxy_host="135.251.33.16"
# proxy_port="8080"
# firefox_profile = webdriver.FirefoxProfile()
# 
# firefox_profile.set_preference('network.proxy.type', 1)
# firefox_profile.set_preference('network.proxy.http', proxy_host)
# firefox_profile.set_preference('network.proxy.http_port', int(proxy_port))
# firefox_profile.set_preference('network.proxy.ssl', proxy_host)
# firefox_profile.set_preference('network.proxy.ssl_port', int(proxy_port))
# firefox_profile.set_preference('network.proxy.no_proxies_on', '127.0.0.1, localhost, .local')
# firefox_profile.update_preferences()
# 
# dr=webdriver.Firefox(firefox_profile=firefox_profile)
class CSDN():
    def __init__(self):
        iedriver = os.getcwd() + "\\IEDriverServer.exe"
        os.environ["webdriver.ie.driver"] = iedriver
        self.dr = webdriver.Ie(iedriver)
        self.comments_urls=[]
        self.page_urls=[]
    def login(self):
        self.dr.get("https://passport.csdn.net/account/login")
        self.dr.find_element_by_id("username").send_keys("XXXXXX")
        self.dr.find_element_by_id("password").send_keys("XXXXXXX")
        self.dr.find_element_by_class_name("logging").click()
        time.sleep(2)
    def logtext(self,msg): #打印log顺便写入到D:/log.txt文件
        print msg
        f=open("D:/log.txt","a+")
        f.write("".join(msg)+"\n")
        f.close()
    def get_all_links(self):
        baseurl="http://download.csdn.net/my/downloads"
        
        self.page_urls.append(baseurl)
        for i in range(2,11):
            self.page_urls.append(baseurl+"/"+str(i))
    def grep_comments_link(self,page):
        new_ck={}
        for ck in self.dr.get_cookies():
            new_ck[ck['name']]=ck['value']
        zyh_header={
                "Host": "download.csdn.net",
                'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0",
                "Accept":'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
                "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
                "Accept-Encoding":"gzip, deflate",
                "Upgrade-Insecure-Requests":"1",
                "Connection": "Keep-Alive",
                    }



        s=requests.session()
        ret = s.request("GET", page, headers=zyh_header, cookies=new_ck)
        #stra='<a href="/detail/u011000529/5727131#comment" class="btn-comment">立即评价</a>'
        regl=r'<a href="(.*?)" class="btn-comment">立即评价</a>'
        matchs=re.findall(regl, ret.content)
        for m in matchs:
            self.comments_urls.append("http://download.csdn.net"+m)
    def rate(self,comment_url):
        # self.dr.get(comment_url)
        self.logtext( "start rating:"+comment_url)
        # print self.dr.get_cookies()
        new_ck={}
        for ck in self.dr.get_cookies():
            new_ck[ck['name']]=ck['value'] ##这里最关键!将webdriver中的cookie提取出来然后赋给requests.session()
        zyh_header={
                "Host": "download.csdn.net",
                'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0",
                "Accept":'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
                "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
                "Accept-Encoding":"gzip, deflate",
                "X-Requested-With":"XMLHttpRequest",
                "Referer":comment_url[:-8],
                "Connection": "Keep-Alive",
                    }
        
            
        s=requests.session()
        url="http://download.csdn.net/index.php/comment/post_comment"
        datas='sourceid='+comment_url[-15:-8]+'&content=%E6%84%9F%E8%B0%A2%E5%88%86%E4%BA%AB%E6%84%9F%E8%B0%A2%E5%88%86%E4%BA%AB&txt_validcode=undefined&rating=5&t='+str(int(time.time()*1000))
        ret=s.request("GET", url,datas, headers=zyh_header, cookies=new_ck)
        self.logtext("reply:"+ret.content)
if __name__ == '__main__':
    csdn=CSDN()
    csdn.login()
    csdn.get_all_links()
    csdn.logtext(csdn.page_urls)
    for page in csdn.page_urls:
        csdn.grep_comments_link(page)
    csdn.logtext("sum of comments_urls:"+str(len(csdn.comments_urls)))
    csdn.logtext(csdn.comments_urls)
    for cl in csdn.comments_urls:
        csdn.rate(cl)
        time.sleep(360) #由于csdn设置了评论间隔,所以我也设置了每个6分钟去提交

打印结果如下:

['http://download.csdn.net/my/downloads', 'http://download.csdn.net/my/downloads/2', 'http://download.csdn.net/my/downloads/3', 'http://download.csdn.net/my/downloads/4', 'http://download.csdn.net/my/downloads/5', 'http://download.csdn.net/my/downloads/6', 'http://download.csdn.net/my/downloads/7', 'http://download.csdn.net/my/downloads/8', 'http://download.csdn.net/my/downloads/9', 'http://download.csdn.net/my/downloads/10']
sum of comments_urls:40
['http://download.csdn.net/detail/st091zsc/9499197#comment', 'http://download.csdn.net/detail/mourendeyouxihao/5029371#comment', 'http://download.csdn.net/detail/lee118007/8637891#comment', 'http://download.csdn.net/detail/zhoujianghai/8160211#comment'#此处就省略了]
start rating:http://download.csdn.net/detail/st091zsc/9499197#comment
reply:({"succ":1})
start rating:http://download.csdn.net/detail/mourendeyouxihao/5029371#comment
reply:({"succ":1})
start rating:http://download.csdn.net/detail/lee118007/8637891#comment
reply:({"succ":1})
start rating:http://download.csdn.net/detail/zhoujianghai/8160211#comment
reply:({"succ":1})
start rating:http://download.csdn.net/detail/kayvid/8882275#comment
reply:({"succ":-4,"msg":"\u9a8c\u8bc1\u7801\u9519\u8bef"})  #貌似多次提交后,就开始需要验证码,这个后续再解决吧。
start rating:http://download.csdn.net/detail/ramissue/8451823#comment
reply:({"succ":-4,"msg":"\u9a8c\u8bc1\u7801\u9519\u8bef"})

效果图:




评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值