话不多说,直接上代码:
1 import re 2 import csv 3 4 #爬虫的一个小例子,爬的是百度贴吧(网页版)某个帖子的各个楼层的用户名,发言内容和发言时间(使用到了正则表达式) source3.txt是网页源代码 5 with open('source3.txt', 'r', encoding='UTF-8') as f: 6 source = f.read() 7 8 result_list = [] 9 every_floor = re.findall('"l_post j_l_post l_post_bright(.*?)<div class="clear"></div>', source, re.S) 10 11 12 for each in every_floor: 13 #每次循环都初始化字典,然后经处理后,将整个有值的字典添加到列表中去 14 result = {} 15 result['username'] = re.findall('username="(.*?)" class="" src="', each, re.S) 16 result['content'] = re.findall('j_d_post_content clearfix" style="display:;">(.*?)</div><br>', each, re.S) 17 result['reply_time'] = re.findall('date":"(.*?)","vote_crypt', each, re.S) 18 result_list.append(result) 19 20 with open('hstieba2.csv', 'w', encoding='gbk') as f: 21 writer = csv.DictWriter(f, fieldnames=['username', 'content', 'reply_time']) 22 writer.writeheader() 23 writer.writerows(result_list)
其实就是对普通文本使用正则表达式而已,仅供参考,如有疑问,请在底下留言。