1. 模拟登录
需要拿到登录的url, headers, 抓取登录表单所需要的信息。
抓取表单信息代码如下:
def _prepare_login_form_data(self, username, password): """Get data for login submission""" response = self._session.get(CsdnHelper.csdn_login_url) login_page = BeautifulSoup(response.text, "lxml") login_form = login_page.find('form', id='fm1') lt = login_form.find('input', attrs={'name': 'lt'})['value'] execution = login_form.find('input', attrs={'name': 'execution'})['value'] eventId = login_form.find('input', attrs={'name': '_eventId'})['value'] form = { "username": username, "password": password, "lt": lt, "execution": execution, "_eventId": eventId } return form
登录表单所需要的用户名和密码通过raw_input输入if __name__ == '__main__': csdn_helper = CsdnHelper() username = raw_input("Input the username:") password = raw_input("Input the password:") if csdn_helper.login(username, password): csdn_helper.readArticles() else: print("Login failed, denied for the former steps!")
通过post方法将发送登录请求,并获取cookiedef login(self, username, password): """login main function""" form_data = self._prepare_login_form_data(username, password) response = self._session.post(CsdnHelper.csdn_login_url, data=form_data) #valid = False if 'UserNick' in response.cookies: nick = response.cookies['UserNick'] print("Login succeed") print(urllib.unquote(nick)) return True else: print("Login failed, invalid username or password") return False
2. 登录成功后开始抓取文章的标题和url输出结果如下:def _get_blog_count(self): """Get total counts of blog""" response = self._session.get(CsdnHelper.blog_url) blog_page = BeautifulSoup(response.text, 'lxml') span = blog_page.find('div', class_='page_nav').span print(span) a = span.string pattern = re.compile(u'(\s*\d+)条\s*共(\s*\d+)页') result = pattern.findall(a) blog_count = int((result[0][0]).strip()) page_count = int((result[0][1]).strip()) print("Total count is : ", blog_count," , splitted into pages ",page_count) return (blog_count, page_count) def readArticles(self): """Get article""" blog_count, page_count = self._get_blog_count() for index in range(1, page_count + 1): url = 'http://write.blog.csdn.net/postlist/0/0/enabled/' + str(index) print(url) response = self._session.get(url) page = BeautifulSoup(response.text, 'lxml') links = page.find_all('a', href=re.compile(r'http://blog.csdn.net/flsmgf/article/details/(\d+)')) for link in links: blog_name = link.string blog_url = link['href'] print(blog_name +","+ blog_url)
http://write.blog.csdn.net/postlist/0/0/enabled/1 ORACLE-ASM,http://blog.csdn.net/flsmgf/article/details/53941127 Python 之 self,http://blog.csdn.net/flsmgf/article/details/53443606 macbook pro chrome 开发者模式快捷键,http://blog.csdn.net/flsmgf/article/details/53443526 pythong中文编码问题,http://blog.csdn.net/flsmgf/article/details/53356887 Spring AMQP + RabbitMQ 3.3.5 ACCESS_REFUSED - Login was refused using authentication mechanism PLAIN,http://blog.csdn.net/flsmgf/article/details/52008215 Shiro 权限框架使用总结,http://blog.csdn.net/flsmgf/article/details/51874017 struts封装表单数据,http://blog.csdn.net/flsmgf/article/details/51510715 Struts2中关于"There is no Action mapped for namespace / and action name"的总结,http://blog.csdn.net/flsmgf/article/details/51510437 Struts2的convention插件,在步骤中使用Action注解跳转到其他jsp页面,http://blog.csdn.net/flsmgf/article/details/51503411 git@osc使用教程,http://blog.csdn.net/flsmgf/article/details/49705511 MyBatis+springMVC+easyUI (dataGirl)实现分页(转载),http://blog.csdn.net/flsmgf/article/details/48449253 解决tomcat端口占用问题,http://blog.csdn.net/flsmgf/article/details/48448543 mysql 添加/删除主键,http://blog.csdn.net/flsmgf/article/details/46496467 Spring JdbcTemplate RowMapper vs ResultSetExtractor,http://blog.csdn.net/flsmgf/article/details/46496245 doPost 与 doGet 的区别,http://blog.csdn.net/flsmgf/article/details/45919271 JAVA常见面试题之Forward和Redirect的区别,http://blog.csdn.net/flsmgf/article/details/45880033 java 多线程总结,http://blog.csdn.net/flsmgf/article/details/45833691 线程,http://blog.csdn.net/flsmgf/article/details/45833367 java开发多线程机制,http://blog.csdn.net/flsmgf/article/details/45804107 2015年第一次面试总结,http://blog.csdn.net/flsmgf/article/details/45799349 http://write.blog.csdn.net/postlist/0/0/enabled/2 Java--文件遍历并按层级输出,http://blog.csdn.net/flsmgf/article/details/45300097 Java--获取指定格式的文件并批量修改文件,http://blog.csdn.net/flsmgf/article/details/45157099 Java--获取指定目录下指定suffix的文件,http://blog.csdn.net/flsmgf/article/details/45137849 Java--获取指定目录下的所有文件,http://blog.csdn.net/flsmgf/article/details/45136883 Java 单文本替换并计算替换的个数,http://blog.csdn.net/flsmgf/article/details/45074059 Collections -- ArrayList vs LinkedList,http://blog.csdn.net/flsmgf/article/details/44958519 Maps--HashMap, LinkedHashMap, TreeMap,http://blog.csdn.net/flsmgf/article/details/44909709 Performance between Sets and List,http://blog.csdn.net/flsmgf/article/details/44896139 Set--HashSet, LinkedHashSet, TreeSet,http://blog.csdn.net/flsmgf/article/details/44838731 Overriding && Overloading,http://blog.csdn.net/flsmgf/article/details/44805583 Map 根据value 获取key,http://blog.csdn.net/flsmgf/article/details/44793151 String,StringBuffer与StringBuilder的区别??,http://blog.csdn.net/flsmgf/article/details/44765927 String == 与 equal 区别,http://blog.csdn.net/flsmgf/article/details/44764161