本文学会使用多进程爬取的map方法,json提取页面内容方法,xpath解析页面的方法:
http://tieba.baidu.com/p/3522395718?pn=1
页面代码:
<div class="l_post j_l_post l_post_bright " data-field="{"author":{"user_id":503570759,"user_name":"\u9893\u5e9f\u4e86\u8c01\u7684\u6e05\u7eaf","name_u":"%E9%A2%93%E5%BA%9F%E4%BA%86%E8%B0%81%E7%9A%84%E6%B8%85%E7%BA%AF&ie=utf-8","user_sex":2,"portrait":"47e1e9a293e5ba9fe4ba86e8b081e79a84e6b885e7baaf031e","is_like":1,"level_id":14,"level_name":"\u4f20\u5947\u679c\u7c89","cur_score":20947,"bawu":0,"props":null},"content":{"post_id":62866847607,"is_anonym":false,"open_id":"tbclient","open_type":"apple","date":"2015-01-11 16:39","vote_crypt":"","post_no":6,"type":"0","comment_num":123,"ptype":"0","is_saveface":false,"props":null,"post_index":4,"pb_tpoint":null}}">
编程代码:
def spider(url): html = requests.get(url) selector = etree.HTML(html.text) content_field = selector.xpath('//div[@class=&#