Python爬虫实际应用之分析zjgsu和zju两所大学的百度贴吧一天中时段发帖规律
整理整理,从头开始说起!本来是课间偶然的一个想法,后来想想也蛮有趣的,看看“985”的贴主们和咱学校差异(守机器配置和时间限制,数据是两个贴吧前50页的所有帖子的所有楼的回复。不含楼中楼,只是更具需求只获取了发帖时间和发帖人的昵称)
---jackshenonly的处女Bolg
一、首先需要具备的基础知识
1.对python有一定的了解,熟悉一般语法
2.会urllib2模块的基本用法
3.了解正则表达式
4.前端展示了解Echarts绘制方法
5.分析嘛 仁者见仁智者见智
学习学习,参考参考,差不多就开始动工了,吼吼吼~~~~~~~
二、需求分析
要了解这两所(zju、zjgsu)的贴吧用户的一天时间发帖规律,自然需要知道这些用户的发帖时间这么一个数据集。
1.首先就是要获得数据:数据这么来呢,百度肯定是不会给你呢,这个时候就只能自求多福了,计科系的当然不会蠢到人工的去一页一页的查并记录。网络reptile技术可以很好解决这机械枯燥的工作。当然你得让这只强悍的“小蜘蛛”听你的控制。
2.简单的数据处理:存入Mysql数据库,根据实际需求,进行简单的统计处理。
3.可视化展示:一堆干瘪瘪的数字谁都不爱看,所以清晰的前端展示还是必要的。用到Echarts数据可视化不再是难事
三、实施
实施分析:
1.深入占地:本来对贴吧也不是熟悉,大致浏览了一下,进入zjgsu和zju的贴吧。每页有50个帖子,每个帖子对应有个帖子的ID,打开是该帖子的所有回复时间信息和发帖人昵称信息。每一页的url成规律变化:
url = 'http://tieba.baidu.com/f?kw=浙江大学&ie=utf-8&pn=' #(page_id-1)*50
每一个帖子的url也有统一性:
url = 'http://tieba.baidu.com/p/'+帖子id
然后用正则式,提取自己需要部分:
帖子id:
'<a href="/p/(\d.*?)" title="'
发帖时间:
'<span.*?class="j_reply_data">(.*?)</span>'
发帖用户:
'<li class="d_name".*?>.*?<a.*?class="p_author.*?>(.*?)</a>.*?</li>'
2.找到对象:活动活动,开始coding。打开熟悉的IDE sublime。
先对每一页创建一个类,对每个帖子创建一个类,废话不多说,直接上代码。
class Topic_Page:
url = 'http://tieba.baidu.com/f?kw=浙江工商大学&ie=utf-8&pn=' #(page_id-1)*50
def __init__(self, page_id):
self.url += str((page_id-1)*50)
self.page_id = page_id
self.topic_ids = []
#self.pages = []
def MySpider(self):
print "正在获取第 "+str(self.page_id)+" 页的内容..."
mypage = urllib2.urlopen(self.url).read().decode("utf8")
print "获取第 "+str(self.page_id)+" 页的内容成功!"
return mypage
def GetTopicId(self):
myMatch = re.findall('<a href="/p/(\d.*?)" title="',self.MySpider(),re.S)
print "匹配第 "+str(self.page_id)+" 页的TopicID成功!"
for topicid in myMatch:
self.topic_ids.append(topicid)
#self.pages.append(topic_ids)
class Details_Page:
url = 'http://tieba.baidu.com/p/'
def __init__(self, topic_id):
self.url += str(topic_id)
def MySpider(self):
mypage = urllib2.urlopen(self.url).read().decode("utf8")
return mypage.replace('\n','')
def GetDetails_PutIntoFile(self):
myMatch = re.findall('<span.*?class="j_reply_data">(.*?)</span>',self.MySpider(),re.S)
myMatch2 = re.findall('<li class="d_name".*?>.*?<a.*?class="p_author.*?>(.*?)</a>.*?</li>',self.MySpider(),re.S)
#print len(myMatch)
#filename
f = open("zju_time_username.txt","a")
for reply in range(len(myMatch)):
f.write(myMatch[reply]+":00" + "\t" + myMatch2[reply].encode('utf8')+"\n")
def Page_Counter(self):
myMatch = re.search(r'<span class="red">(\d*?)</span>',self.MySpider(),re.S)
return myMatch.groups()[0]
3.然后就是程序的入口,和控制变量。其中Break_*那几个变量是大概网络不稳定时,获取网页出错时程序停止,更具屏幕输出信息,重新接着开始而
设置的。当然用try except可能会更好。将需要的结果存入文件等待待处理。
#----------program entry------------
begin_page = 1
end_page = 50
Break_Main_Page = 18
Break_tie = 43
Break_Next_Page = 0
beginTime = time.strftime("%Y-%m-%d %H:%M:%S")+"\t"+"数据获取开始时间\n"
for MainPage in range(begin_page,end_page + 1):
if MainPage < Break_Main_Page: #when break
continue
a = Topic_Page(MainPage)
a.GetTopicId()
for index,topic_id in enumerate(a.topic_ids):
if index+1 < Break_tie: #when break
continue
print "正在获取第 "+str(MainPage)+" 页的,第 "+str(index+1)+" 个贴子.../p/"+str(topic_id)
b = Details_Page(topic_id)
temp_url = b.url
#print temp_url
#exit()
for NextPage in range(int(b.Page_Counter())):
if NextPage+1 < Break_Next_Page:
continue
b.url = temp_url + "?pn=" + str(NextPage+1)
b.GetDetails_PutIntoFile()
Break_Next_Page = 0
Break_tie = 0
Break_Main_Page = 0
endTime =time.strftime("%Y-%m-%d %H:%M:%S")+"\t"+"数据获取结束时间\n"
f = open('zjgsu_time_username.txt',"a")
f.write(beginTime+endTime)
f.close()
4.biu~biu~~~,帅气地按下Ctrl+B,程序跑起来了
。
5.数据导入mysql,了解只自然知道是so easy的:
load data infile "filepath/filename.txt" into table table_name (datetime,author)
我们需要的其实只是一天当中的时间,而获取的还包括日期。做一个小的处理
alter table_name add column time time after author;
update table_name set time = datetime;
赋值过去,多余的部分会自动舍掉。
6.数据可视化:
数据库中对数据的操作是非常方便的。接下来进行时间段统计,我把一天的时间分为课余时间、课堂时间和课间时间共:32个时间段。time_data = ['00:00~01:00','01:00~02:00','02:00~03:00','03:00~04:00','04:00~05:00','05:00~06:00','06:00~07:00','07:00~08:05','08:05~09:35','09:35~09:50','09:50~10:35','10:35~10:40','10:40~11:25','11:25~11:30','11:30~12:15','12:15~13:40','13:40~14:25','14:25~14:35','14:35~15:20','15:20~15:30','15:30~16:15','16:15~16:25','16:25~17:10','17:10~18:30','18:30~19:15','19:15~19:25','19:25~20:10','20:10~20:20','20:20~21:05','21:05~22:00','22:00~23:00','23:00~24:00'];
写个存储过程来统计数值吧,就叫counter(),简单粗暴的代码如下:
DROP PROCEDURE IF EXISTS `counter_zju`;
CREATE DEFINER = `root`@`localhost` PROCEDURE `counter_zju`()
BEGIN
#Routine body goes here...
SELECT count(*) from zju where time <= '01:00:00' INTO @count1;
SELECT count(*) from zju where time > '01:00:00' AND time <= '02:00:00' INTO @count2;
SELECT count(*) from zju where time > '02:00:00' AND time <= '03:00:00' INTO @count3;
SELECT count(*) from zju where time > '03:00:00' AND time <= '04:00:00' INTO @count4;
SELECT count(*) from zju where time > '04:00:00' AND time <= '05:00:00' INTO @count5;
SELECT count(*) from zju where time > '05:00:00' AND time <= '06:00:00' INTO @count6;
SELECT count(*) from zju where time > '06:00:00' AND time <= '07:00:00' INTO @count7;
SELECT count(*) from zju where time > '07:00:00' AND time <= '08:05:00' INTO @count8;
SELECT count(*) from zju where time > '08:05:00' AND time <= '09:35:00' INTO @count9;
SELECT count(*) from zju where time > '09:35:00' AND time <= '09:50:00' INTO @count10;
SELECT count(*) from zju where time > '09:50:00' AND time <= '10:35:00' INTO @count11;
SELECT count(*) from zju where time > '10:35:00' AND time <= '10:40:00' INTO @count12;
SELECT count(*) from zju where time > '10:40:00' AND time <= '11:25:00' INTO @count13;
SELECT count(*) from zju where time > '11:25:00' AND time <= '11:30:00' INTO @count14;
SELECT count(*) from zju where time > '11:30:00' AND time <= '12:15:00' INTO @count15;
#下午
SELECT count(*) from zju where time > '12:15:00' AND time <= '13:40:00' INTO @count16;
SELECT count(*) from zju where time > '13:40:00' AND time <= '14:25:00' INTO @count17;
SELECT count(*) from zju where time > '14:25:00' AND time <= '14:35:00' INTO @count18;
SELECT count(*) from zju where time > '14:35:00' AND time <= '15:20:00' INTO @count19;
SELECT count(*) from zju where time > '15:20:00' AND time <= '15:30:00' INTO @count20;
SELECT count(*) from zju where time > '15:30:00' AND time <= '16:15:00' INTO @count21;
SELECT count(*) from zju where time > '16:15:00' AND time <= '16:25:00' INTO @count22;
SELECT count(*) from zju where time > '16:25:00' AND time <= '17:10:00' INTO @count23;
SELECT count(*) from zju where time > '17:10:00' AND time <= '18:30:00' INTO @count24;
SELECT count(*) from zju where time > '18:30:00' AND time <= '19:15:00' INTO @count25;
SELECT count(*) from zju where time > '19:15:00' AND time <= '19:25:00' INTO @count26;
SELECT count(*) from zju where time > '19:25:00' AND time <= '20:10:00' INTO @count27;
SELECT count(*) from zju where time > '20:10:00' AND time <= '20:20:00' INTO @count28;
SELECT count(*) from zju where time > '20:20:00' AND time <= '21:05:00' INTO @count29;
SELECT count(*) from zju where time > '21:05:00' AND time <= '22:00:00' INTO @count30;
SELECT count(*) from zju where time > '22:00:00' AND time <= '23:00:00' INTO @count31;
SELECT count(*) from zju where time > '23:00:00' INTO @count32;
select @count1,@count2,@count3,@count4,@count5,@count6,@count7,@count8,@count9,@count10,@count11,@count12,@count13,@count14,@count15,@count16,@count17,@count18,@count19,@count20,@count21,@count22,@count23,@count24,@count25,@count26,@count27,@count28,@count29,@count30,@count31,@count32;
END;
既然有了数据,就可以将其可视化出来了。我用的是Echarts,很好用也很简单,读读官网的API就能很快使用,这图表在浏览器里还是可交互的,交互性和视觉效果都很棒。最终可视化结果如下:
四、结论
两所大学在百度贴吧的活跃度所呈现的规律基本是一样的,课间的活跃度始终是低于课堂的,当然也有上课埋头玩贴吧的熊孩子,zju的活跃明显比zjgsu的高,当然是基数大也是个原因。本来还想把所有回复的内容都给获取下来的,想统计统计词频,看看我们这些大学生都在聊些什么的。受限于最近时间比较吃紧,和词频统计算法研究不够,就先把这个落下了,有想法的可以去试试。
五、体会
第一次在CSDN上面发blog,真的是好紧张、好紧张,以前都是看别人发的文章,对自己的帮助也很大。自己也想来试试把自己做小东西分享一下,望诸君莫笑,多多指教、斧正和建议,当然也需要多多鼓励。
额,该去写作业了