Python 模拟登录和抓取文章

1. 模拟登录

    需要拿到登录的url, headers, 抓取登录表单所需要的信息。

    抓取表单信息代码如下:

def _prepare_login_form_data(self, username, password):
    """Get data for login submission"""
    response = self._session.get(CsdnHelper.csdn_login_url)
    login_page = BeautifulSoup(response.text, "lxml")
    login_form = login_page.find('form', id='fm1')

    lt = login_form.find('input', attrs={'name': 'lt'})['value']
    execution = login_form.find('input', attrs={'name': 'execution'})['value']
    eventId = login_form.find('input', attrs={'name': '_eventId'})['value']

    form = {
        "username": username,
        "password": password,
        "lt": lt,
        "execution": execution,
        "_eventId": eventId
    }

    return form
登录表单所需要的用户名和密码通过raw_input输入
if __name__ == '__main__':
    csdn_helper = CsdnHelper()
    username = raw_input("Input the username:")
    password = raw_input("Input the password:")
    if csdn_helper.login(username, password):
        csdn_helper.readArticles()
    else:
        print("Login failed, denied for the former steps!")

通过post方法将发送登录请求,并获取cookie
    def login(self, username, password):
        """login main function"""
        form_data = self._prepare_login_form_data(username, password)
        response = self._session.post(CsdnHelper.csdn_login_url, data=form_data)
        #valid = False
        if 'UserNick' in response.cookies:
            nick = response.cookies['UserNick']
            print("Login succeed")
            print(urllib.unquote(nick))
            return True
        else:
            print("Login failed, invalid username or password")
            return False

2. 登录成功后开始抓取文章的标题和url
    def _get_blog_count(self):
        """Get total counts of blog"""
        response = self._session.get(CsdnHelper.blog_url)
        blog_page = BeautifulSoup(response.text, 'lxml')
        span = blog_page.find('div', class_='page_nav').span
        print(span)
        a = span.string
        pattern = re.compile(u'(\s*\d+)条\s*共(\s*\d+)页')
        result = pattern.findall(a)
        blog_count = int((result[0][0]).strip())
        page_count = int((result[0][1]).strip())
        print("Total count is : ", blog_count," , splitted into pages ",page_count)
        return (blog_count, page_count)

    def readArticles(self):
        """Get article"""
        blog_count, page_count = self._get_blog_count()
        for index in range(1, page_count + 1):
            url = 'http://write.blog.csdn.net/postlist/0/0/enabled/' + str(index)
            print(url)
            response = self._session.get(url)
            page = BeautifulSoup(response.text, 'lxml')
            links = page.find_all('a', href=re.compile(r'http://blog.csdn.net/flsmgf/article/details/(\d+)'))
            for link in links:
                blog_name = link.string
                blog_url = link['href']
                print(blog_name +","+ blog_url)


输出结果如下:
http://write.blog.csdn.net/postlist/0/0/enabled/1
ORACLE-ASM,http://blog.csdn.net/flsmgf/article/details/53941127
Python 之 self,http://blog.csdn.net/flsmgf/article/details/53443606
macbook pro chrome 开发者模式快捷键,http://blog.csdn.net/flsmgf/article/details/53443526
pythong中文编码问题,http://blog.csdn.net/flsmgf/article/details/53356887
Spring AMQP + RabbitMQ 3.3.5 ACCESS_REFUSED - Login was refused using authentication mechanism PLAIN,http://blog.csdn.net/flsmgf/article/details/52008215
Shiro 权限框架使用总结,http://blog.csdn.net/flsmgf/article/details/51874017
struts封装表单数据,http://blog.csdn.net/flsmgf/article/details/51510715
Struts2中关于"There is no Action mapped for namespace / and action name"的总结,http://blog.csdn.net/flsmgf/article/details/51510437
Struts2的convention插件,在步骤中使用Action注解跳转到其他jsp页面,http://blog.csdn.net/flsmgf/article/details/51503411
git@osc使用教程,http://blog.csdn.net/flsmgf/article/details/49705511
MyBatis+springMVC+easyUI (dataGirl)实现分页(转载),http://blog.csdn.net/flsmgf/article/details/48449253
解决tomcat端口占用问题,http://blog.csdn.net/flsmgf/article/details/48448543
mysql 添加/删除主键,http://blog.csdn.net/flsmgf/article/details/46496467
Spring JdbcTemplate RowMapper vs ResultSetExtractor,http://blog.csdn.net/flsmgf/article/details/46496245
doPost 与 doGet 的区别,http://blog.csdn.net/flsmgf/article/details/45919271
JAVA常见面试题之Forward和Redirect的区别,http://blog.csdn.net/flsmgf/article/details/45880033
java 多线程总结,http://blog.csdn.net/flsmgf/article/details/45833691
线程,http://blog.csdn.net/flsmgf/article/details/45833367
java开发多线程机制,http://blog.csdn.net/flsmgf/article/details/45804107
2015年第一次面试总结,http://blog.csdn.net/flsmgf/article/details/45799349
http://write.blog.csdn.net/postlist/0/0/enabled/2
Java--文件遍历并按层级输出,http://blog.csdn.net/flsmgf/article/details/45300097
Java--获取指定格式的文件并批量修改文件,http://blog.csdn.net/flsmgf/article/details/45157099
Java--获取指定目录下指定suffix的文件,http://blog.csdn.net/flsmgf/article/details/45137849
Java--获取指定目录下的所有文件,http://blog.csdn.net/flsmgf/article/details/45136883
Java 单文本替换并计算替换的个数,http://blog.csdn.net/flsmgf/article/details/45074059
Collections -- ArrayList vs LinkedList,http://blog.csdn.net/flsmgf/article/details/44958519
Maps--HashMap, LinkedHashMap, TreeMap,http://blog.csdn.net/flsmgf/article/details/44909709
Performance between Sets and List,http://blog.csdn.net/flsmgf/article/details/44896139
Set--HashSet, LinkedHashSet, TreeSet,http://blog.csdn.net/flsmgf/article/details/44838731
Overriding && Overloading,http://blog.csdn.net/flsmgf/article/details/44805583
Map 根据value 获取key,http://blog.csdn.net/flsmgf/article/details/44793151
String,StringBuffer与StringBuilder的区别??,http://blog.csdn.net/flsmgf/article/details/44765927
String == 与 equal 区别,http://blog.csdn.net/flsmgf/article/details/44764161

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值