Python爬虫实际应用之简单可视化(Echarts)zjgsu和zju两所大学的百度贴吧一天中时段发帖规律

Python爬虫实际应用之分析zjgsu和zju两所大学的百度贴吧一天中时段发帖规律

整理整理,从头开始说起!本来是课间偶然的一个想法,后来想想也蛮有趣的,看看“985”的贴主们和咱学校差异(守机器配置和时间限制,数据是两个贴吧前50页的所有帖子的所有楼的回复。不含楼中楼,只是更具需求只获取了发帖时间和发帖人的昵称)

---jackshenonly的处女Bolg

一、首先需要具备的基础知识

1.对python有一定的了解,熟悉一般语法
2.会urllib2模块的基本用法
3.了解正则表达式

4.前端展示了解Echarts绘制方法

5.分析嘛 仁者见仁智者见智大笑

生气学习学习,参考参考,差不多就开始动工了,吼吼吼~~~~~~~

二、需求分析

要了解这两所(zju、zjgsu)的贴吧用户的一天时间发帖规律,自然需要知道这些用户的发帖时间这么一个数据集。
1.首先就是要获得数据:数据这么来呢,百度肯定是不会给你呢大哭,这个时候就只能自求多福了,计科系的当然不会蠢到人工的去一页一页的查并记录。网络reptile技术可以很好解决这机械枯燥的工作。当然你得让这只强悍的“小蜘蛛”听你的控制。
2.简单的数据处理:存入Mysql数据库,根据实际需求,进行简单的统计处理。
3.可视化展示:一堆干瘪瘪的数字谁都不爱看,所以清晰的前端展示还是必要的。用到Echarts数据可视化不再是难事得意

三、实施

实施分析:
1.深入占地:本来对贴吧也不是熟悉,大致浏览了一下,进入zjgsu和zju的贴吧。每页有50个帖子,每个帖子对应有个帖子的ID,打开是该帖子的所有回复时间信息和发帖人昵称信息。每一页的url成规律变化:
url = 'http://tieba.baidu.com/f?kw=浙江大学&ie=utf-8&pn=' #(page_id-1)*50
每一个帖子的url也有统一性:
url = 'http://tieba.baidu.com/p/'+帖子id
然后用正则式,提取自己需要部分:
帖子id:
'<a href="/p/(\d.*?)" title="'
发帖时间:
'<span.*?class="j_reply_data">(.*?)</span>'

发帖用户:
'<li class="d_name".*?>.*?<a.*?class="p_author.*?>(.*?)</a>.*?</li>'
2.找到对象:活动活动,开始coding。打开熟悉的IDE sublime。 先对每一页创建一个类,对每个帖子创建一个类,废话不多说,直接上代码。
class Topic_Page:
	url = 'http://tieba.baidu.com/f?kw=浙江工商大学&ie=utf-8&pn=' #(page_id-1)*50
	def __init__(self, page_id):
		self.url += str((page_id-1)*50)
		self.page_id = page_id
		self.topic_ids = []
		#self.pages = []
	def MySpider(self):
		print "正在获取第 "+str(self.page_id)+" 页的内容..."
		mypage = urllib2.urlopen(self.url).read().decode("utf8")
		print "获取第 "+str(self.page_id)+" 页的内容成功!"
		
		return mypage
	def GetTopicId(self):
		myMatch = re.findall('<a href="/p/(\d.*?)" title="',self.MySpider(),re.S)
		print "匹配第 "+str(self.page_id)+" 页的TopicID成功!"
		for topicid in myMatch:
			self.topic_ids.append(topicid)
		#self.pages.append(topic_ids)

      
class Details_Page:
	url = 'http://tieba.baidu.com/p/'
	def __init__(self, topic_id):
		self.url += str(topic_id)
	def MySpider(self):
		mypage = urllib2.urlopen(self.url).read().decode("utf8")
		return mypage.replace('\n','')
	def GetDetails_PutIntoFile(self):
		myMatch = re.findall('<span.*?class="j_reply_data">(.*?)</span>',self.MySpider(),re.S)
		myMatch2 = re.findall('<li class="d_name".*?>.*?<a.*?class="p_author.*?>(.*?)</a>.*?</li>',self.MySpider(),re.S)
		#print len(myMatch)
		#filename
		f = open("zju_time_username.txt","a")
		for reply in range(len(myMatch)):
			f.write(myMatch[reply]+":00" + "\t" + myMatch2[reply].encode('utf8')+"\n")


	def Page_Counter(self):
		myMatch = re.search(r'<span class="red">(\d*?)</span>',self.MySpider(),re.S)
		return myMatch.groups()[0]
3.然后就是程序的入口,和控制变量。其中Break_*那几个变量是大概网络不稳定时,获取网页出错时程序停止,更具屏幕输出信息,重新接着开始而 设置的。当然用try  except可能会更好。将需要的结果存入文件等待待处理。
#----------program entry------------		
begin_page = 1
end_page = 50

Break_Main_Page = 18
Break_tie = 43
Break_Next_Page = 0

beginTime =  time.strftime("%Y-%m-%d %H:%M:%S")+"\t"+"数据获取开始时间\n"

for MainPage in range(begin_page,end_page + 1):
	if MainPage < Break_Main_Page:      #when break
		continue
	a = Topic_Page(MainPage)
	a.GetTopicId()
	for index,topic_id in enumerate(a.topic_ids):
		if index+1 < Break_tie:         #when break
			continue

		print "正在获取第 "+str(MainPage)+" 页的,第 "+str(index+1)+" 个贴子.../p/"+str(topic_id)
		b = Details_Page(topic_id)
		temp_url = b.url
		#print temp_url
		#exit()
		for NextPage in range(int(b.Page_Counter())):
			if NextPage+1 < Break_Next_Page:
				continue
			b.url = temp_url + "?pn=" + str(NextPage+1)
			b.GetDetails_PutIntoFile()
		Break_Next_Page = 0
	Break_tie = 0
Break_Main_Page = 0

endTime =time.strftime("%Y-%m-%d %H:%M:%S")+"\t"+"数据获取结束时间\n"
f = open('zjgsu_time_username.txt',"a")
f.write(beginTime+endTime)
f.close()
4.biu~biu~~~,帅气地按下Ctrl+B,程序跑起来了 得意


5.数据导入mysql,了解只自然知道是so easy的:
load data infile "filepath/filename.txt" into table table_name (datetime,author)
我们需要的其实只是一天当中的时间,而获取的还包括日期。做一个小的处理
alter table_name add column time time after author;
update table_name set time = datetime;
赋值过去,多余的部分会自动舍掉。
6.数据可视化:
数据库中对数据的操作是非常方便的。接下来进行时间段统计,我把一天的时间分为课余时间、课堂时间和课间时间共:32个时间段。time_data = ['00:00~01:00','01:00~02:00','02:00~03:00','03:00~04:00','04:00~05:00','05:00~06:00','06:00~07:00','07:00~08:05','08:05~09:35','09:35~09:50','09:50~10:35','10:35~10:40','10:40~11:25','11:25~11:30','11:30~12:15','12:15~13:40','13:40~14:25','14:25~14:35','14:35~15:20','15:20~15:30','15:30~16:15','16:15~16:25','16:25~17:10','17:10~18:30','18:30~19:15','19:15~19:25','19:25~20:10','20:10~20:20','20:20~21:05','21:05~22:00','22:00~23:00','23:00~24:00'];
写个存储过程来统计数值吧,就叫counter(),简单粗暴的代码如下:偷笑偷笑偷笑
DROP PROCEDURE IF EXISTS `counter_zju`;

CREATE DEFINER = `root`@`localhost` PROCEDURE `counter_zju`()
BEGIN
	#Routine body goes here...
SELECT count(*) from zju where                        time  <= '01:00:00' INTO @count1;
SELECT count(*) from zju where time  > '01:00:00' AND time  <= '02:00:00' INTO @count2;
SELECT count(*) from zju where time  > '02:00:00' AND time  <= '03:00:00' INTO @count3;
SELECT count(*) from zju where time  > '03:00:00' AND time  <= '04:00:00' INTO @count4;
SELECT count(*) from zju where time  > '04:00:00' AND time  <= '05:00:00' INTO @count5;
SELECT count(*) from zju where time  > '05:00:00' AND time  <= '06:00:00' INTO @count6;
SELECT count(*) from zju where time  > '06:00:00' AND time  <= '07:00:00' INTO @count7;
SELECT count(*) from zju where time  > '07:00:00' AND time  <= '08:05:00' INTO @count8;
SELECT count(*) from zju where time  > '08:05:00' AND time  <= '09:35:00' INTO @count9;
SELECT count(*) from zju where time  > '09:35:00' AND time  <= '09:50:00' INTO @count10;
SELECT count(*) from zju where time  > '09:50:00' AND time  <= '10:35:00' INTO @count11;
SELECT count(*) from zju where time  > '10:35:00' AND time  <= '10:40:00' INTO @count12;
SELECT count(*) from zju where time  > '10:40:00' AND time  <= '11:25:00' INTO @count13;
SELECT count(*) from zju where time  > '11:25:00' AND time  <= '11:30:00' INTO @count14;
SELECT count(*) from zju where time  > '11:30:00' AND time  <= '12:15:00' INTO @count15;
#下午
SELECT count(*) from zju where time  > '12:15:00' AND time  <= '13:40:00' INTO @count16;
SELECT count(*) from zju where time  > '13:40:00' AND time  <= '14:25:00' INTO @count17;
SELECT count(*) from zju where time  > '14:25:00' AND time  <= '14:35:00' INTO @count18;
SELECT count(*) from zju where time  > '14:35:00' AND time  <= '15:20:00' INTO @count19;
SELECT count(*) from zju where time  > '15:20:00' AND time  <= '15:30:00' INTO @count20;
SELECT count(*) from zju where time  > '15:30:00' AND time  <= '16:15:00' INTO @count21;
SELECT count(*) from zju where time  > '16:15:00' AND time  <= '16:25:00' INTO @count22;
SELECT count(*) from zju where time  > '16:25:00' AND time  <= '17:10:00' INTO @count23;
SELECT count(*) from zju where time  > '17:10:00' AND time  <= '18:30:00' INTO @count24;
SELECT count(*) from zju where time  > '18:30:00' AND time  <= '19:15:00' INTO @count25;
SELECT count(*) from zju where time  > '19:15:00' AND time  <= '19:25:00' INTO @count26;
SELECT count(*) from zju where time  > '19:25:00' AND time  <= '20:10:00' INTO @count27;
SELECT count(*) from zju where time  > '20:10:00' AND time  <= '20:20:00' INTO @count28;
SELECT count(*) from zju where time  > '20:20:00' AND time  <= '21:05:00' INTO @count29;
SELECT count(*) from zju where time  > '21:05:00' AND time  <= '22:00:00' INTO @count30;
SELECT count(*) from zju where time  > '22:00:00' AND time  <= '23:00:00' INTO @count31;
SELECT count(*) from zju where time  > '23:00:00'                         INTO @count32;

	select @count1,@count2,@count3,@count4,@count5,@count6,@count7,@count8,@count9,@count10,@count11,@count12,@count13,@count14,@count15,@count16,@count17,@count18,@count19,@count20,@count21,@count22,@count23,@count24,@count25,@count26,@count27,@count28,@count29,@count30,@count31,@count32;
END;



既然有了数据,就可以将其可视化出来了。我用的是Echarts,很好用也很简单,读读官网的API就能很快使用,这图表在浏览器里还是可交互的,交互性和视觉效果都很棒。最终可视化结果如下:
 

四、结论

两所大学在百度贴吧的活跃度所呈现的规律基本是一样的,课间的活跃度始终是低于课堂的,当然也有上课埋头玩贴吧的熊孩子偷笑偷笑偷笑,zju的活跃明显比zjgsu的高,当然是基数大也是个原因。本来还想把所有回复的内容都给获取下来的,想统计统计词频,看看我们这些大学生都在聊些什么的。受限于最近时间比较吃紧,和词频统计算法研究不够,就先把这个落下了,有想法的可以去试试。

五、体会

第一次在CSDN上面发blog,真的是好紧张、好紧张,以前都是看别人发的文章,对自己的帮助也很大。自己也想来试试把自己做小东西分享一下,望诸君莫笑,多多指教、斧正和建议,当然也需要多多鼓励奋斗奋斗奋斗
额,该去写作业了大哭大哭大哭
 
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值