python百度贴吧发帖时间_Python-18：多线程扒取百度贴吧帖子内容源码

最新推荐文章于 2024-02-26 00:16:30 发布

weixin_39516956

最新推荐文章于 2024-02-26 00:16:30 发布

阅读量173

点赞数

文章标签： python百度贴吧发帖时间

#-*-coding:utf8-*-

from lxml importetreefrom multiprocessing.dummy importPool as ThreadPoolimportrequestsimportjson#这三行是用来解决编码问题的

importsys

reload(sys)

sys.setdefaultencoding('utf-8')'''重新运行之前请删除content.txt，因为文件操作使用追加方式，会导致内容太多。'''

#该方法是向文件中写入以下格式的内容

deftowrite(contentdict):

f.writelines(u'回帖时间:' + str(contentdict['topic_reply_time']) + '\n')

f.writelines(u'回帖内容:' + unicode(contentdict['topic_reply_content']) + '\n')

f.writelines(u'回帖人:' + contentdict['user_name'] + '\n\n')

_header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}#根据给定的URL扒取内容的方法

defspider(url):

html= requests.get(url,headers=_header)printurl

selector=etree.HTML(html.text)#获取这一楼的所有内容

content_field = selector.xpath('//div[@class="l_post j_l_post l_post_bright "]')

item={}#遍历这一楼

for each incontent_field:'''data-field="

{"

author":{

"user_id":830583117,

"user_name":"huluxiao855",

"name_u":"huluxiao855&ie=utf-8",

"user_sex":0,

"portrait":"4db168756c757869616f3835358131",

"is_like":1,

"level_id":4,

"level_name":"\u719f\u6089\u82f9\u679c",

"cur_score":31,

"bawu":0,

"props":null},

"content":{

"post_id":62881461599,

"is_anonym":false,

"open_id":"tbclient",

"open_type":"apple",

"date":"2015-01-11 22:09",

"vote_crypt":"",

"post_no":203,

"type":"0",

"comment_num":1,

"ptype":"0",

"is_saveface":false,

"props":null,

"post_index":0,

"pb_tpoint":null

}

}"'''reply_info= json.loads(each.xpath('@data-field')[0].replace('&quot',''))#reply_info是一个字典，根据上面注释所述的结构关系，这样来获取做着

author = reply_info['author']['user_name']

content= each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0]

reply_time= reply_info['content']['date']printcontentprintreply_timeprintauthor

item['user_name'] =author

item['topic_reply_content'] =content

item['topic_reply_time'] =reply_time

towrite(item)'''如果我们是直接执行某个.py文件的时候，该文件中那么”__name__ == '__main__'“是True,

但是我们如果从另外一个.py文件通过import导入该文件的时候，这时__name__的值就是我们这个py文件的名字而不是__main__。

这个功能还有一个用处：

调试代码的时候，在”if __name__ == '__main__'“中加入一些我们的调试代码，我们可以让外部模块调用的时候不执行我们的调试代码，

但是如果我们想排查问题的时候，直接执行该模块文件，调试代码能够正常运行！'''

if __name__ == '__main__':#创建一个4核的应用程序池

pool = ThreadPool(4)#第二个参数a代表的意思是向文件中追加

f = open('content.txt','a')#定义一个保存网址url的数组

page =[]#通过循环将网址追加到数组中

for i in range(1,21):

newpage= 'http://tieba.baidu.com/p/3522395718?pn=' +str(i)

page.append(newpage)#多线程爬虫方法

results =pool.map(spider, page)

pool.close()

pool.join()

f.close()

weixin_39516956

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。