Python爬虫实际应用之简单可视化（Echarts）zjgsu和zju两所大学的百度贴吧一天中时段发帖规律

最新推荐文章于 2023-06-06 23:28:10 发布

jackshenonly

最新推荐文章于 2023-06-06 23:28:10 发布

阅读量3.2k

点赞数

分类专栏： Python 文章标签：爬虫数据可视化 mysql 正则表达式

本文链接：https://blog.csdn.net/u012421448/article/details/45073767

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Python爬虫实际应用之分析zjgsu和zju两所大学的百度贴吧一天中时段发帖规律

整理整理，从头开始说起！本来是课间偶然的一个想法，后来想想也蛮有趣的，看看“985”的贴主们和咱学校差异（守机器配置和时间限制，数据是两个贴吧前50页的所有帖子的所有楼的回复。不含楼中楼，只是更具需求只获取了发帖时间和发帖人的昵称）

---jackshenonly的处女Bolg

一、首先需要具备的基础知识

1.对python有一定的了解，熟悉一般语法
2.会urllib2模块的基本用法
3.了解正则表达式

4.前端展示了解Echarts绘制方法

5.分析嘛仁者见仁智者见智

学习学习，参考参考，差不多就开始动工了，吼吼吼~~~~~~~

二、需求分析

要了解这两所（zju、zjgsu）的贴吧用户的一天时间发帖规律，自然需要知道这些用户的发帖时间这么一个数据集。

1.首先就是要获得数据：数据这么来呢，百度肯定是不会给你呢

，这个时候就只能自求多福了，计科系的当然不会蠢到人工的去一页一页的查并记录。网络reptile技术可以很好解决这机械枯燥的工作。当然你得让这只强悍的“小蜘蛛”听你的控制。

2.简单的数据处理：存入Mysql数据库，根据实际需求，进行简单的统计处理。

3.可视化展示：一堆干瘪瘪的数字谁都不爱看，所以清晰的前端展示还是必要的。用到Echarts数据可视化不再是难事

三、实施

实施分析：

1.深入占地：本来对贴吧也不是熟悉，大致浏览了一下，进入zjgsu和zju的贴吧。每页有50个帖子，每个帖子对应有个帖子的ID，打开是该帖子的所有回复时间信息和发帖人昵称信息。每一页的url成规律变化：

url = 'http://tieba.baidu.com/f?kw=浙江大学&ie=utf-8&pn=' #(page_id-1)*50

每一个帖子的url也有统一性：

url = 'http://tieba.baidu.com/p/'+帖子id

然后用正则式，提取自己需要部分：

帖子id:

'<a href="/p/(\d.*?)" title="'

发帖时间：

'<span.*?class="j_reply_data">(.*?)</span>'

发帖用户：

'<li class="d_name".*?>.*?<a.*?class="p_author.*?>(.*?)</a>.*?</li>'

2.找到对象：活动活动，开始coding。打开熟悉的IDE sublime。先对每一页创建一个类，对每个帖子创建一个类，废话不多说，直接上代码。

class Topic_Page:
	url = 'http://tieba.baidu.com/f?kw=浙江工商大学&ie=utf-8&pn=' #(page_id-1)*50
	def __init__(self, page_id):
		self.url += str((page_id-1)*50)
		self.page_id = page_id
		self.topic_ids = []
		#self.pages = []
	def MySpider(self):
		print "正在获取第 "+str(self.page_id)+" 页的内容..."
		mypage = urllib2.urlopen(self.url).read().decode("utf8")
		print "获取第 "+str(self.page_id)+" 页的内容成功!"
		
		return mypage
	def GetTopicId(self):
		myMatch = re.findall('<a href="/p/(\d.*?)" title="',self.MySpider(),re.S)
		print "匹配第 "+str(self.page_id)+" 页的TopicID成功!"
		for topicid in myMatch:
			self.topic_ids.append(topicid)
		#self.pages.append(topic_ids)

      
class Details_Page:
	url = 'http://tieba.baidu.com/p/'
	def __init__(self, topic_id):
		self.url += str(topic_id)
	def MySpider(self):
		mypage = urllib2.urlopen(self.url).read().decode("utf8")
		return mypage.replace('\n','')
	def GetDetails_PutIntoFile(self):
		myMatch = re.findall('<span.*?class="j_reply_data">(.*?)</span>',self.MySpider(),re.S)
		myMatch2 = re.findall('<li class="d_name".*?>.*?<a.*?class="p_author.*?>(.*?)</a>.*?</li>',self.MySpider(),re.S)
		#print len(myMatch)
		#filename
		f = open("zju_time_username.txt","a")
		for reply in range(len(myMatch)):
			f.write(myMatch[reply]+":00" + "\t" + myMatch2[reply].encode('utf8')+"\n")


	def Page_Counter(self):
		myMatch = re.search(r'<span class="red">(\d*?)</span>',self.MySpider(),re.S)
		return myMatch.groups()[0]

3.然后就是程序的入口，和控制变量。其中Break_*那几个变量是大概网络不稳定时，获取网页出错时程序停止，更具屏幕输出信息，重新接着开始而设置的。当然用try except可能会更好。将需要的结果存入文件等待待处理。

#----------program entry------------		
begin_page = 1
end_page = 50

Break_Main_Page = 18
Break_tie = 43
Break_Next_Page = 0

beginTime =  time.strftime("%Y-%m-%d %H:%M:%S")+"\t"+"数据获取开始时间\n"

for MainPage in range(begin_page,end_page + 1):
	if MainPage < Break_Main_Page:      #when break
		continue
	a = Topic_Page(MainPage)
	a.GetTopicId()
	for index,topic_id in enumerate(a.topic_ids):
		if index+1 < Break_tie:         #when break
			continue

		print "正在获取第 "+str(MainPage)+" 页的，第 "+str(index+1)+" 个贴子.../p/"+str(topic_id)
		b = Details_Page(topic_id)
		temp_url = b.url
		#print temp_url
		#exit()
		for NextPage in range(int(b.Page_Counter())):
			if NextPage+1 < Break_Next_Page:
				continue
			b.url = temp_url + "?pn=" + str(NextPage+1)
			b.GetDetails_PutIntoFile()
		Break_Next_Page = 0
	Break_tie = 0
Break_Main_Page = 0

endTime =time.strftime("%Y-%m-%d %H:%M:%S")+"\t"+"数据获取结束时间\n"
f = open('zjgsu_time_username.txt',"a")
f.write(beginTime+endTime)
f.close()

4.biu~biu~~~，帅气地按下Ctrl+B,程序跑起来了

。

5.数据导入mysql，了解只自然知道是so easy的：

load data infile "filepath/filename.txt" into table table_name (datetime,author)

我们需要的其实只是一天当中的时间，而获取的还包括日期。做一个小的处理

alter table_name add column time time after author;

update table_name set time = datetime;

赋值过去，多余的部分会自动舍掉。

6.数据可视化：

数据库中对数据的操作是非常方便的。接下来进行时间段统计，我把一天的时间分为课余时间、课堂时间和课间时间共：32个时间段。time_data = ['00:00~01:00','01:00~02:00','02:00~03:00','03:00~04:00','04:00~05:00','05:00~06:00','06:00~07:00','07:00~08:05','08:05~09:35','09:35~09:50','09:50~10:35','10:35~10:40','10:40~11:25','11:25~11:30','11:30~12:15','12:15~13:40','13:40~14:25','14:25~14:35','14:35~15:20','15:20~15:30','15:30~16:15','16:15~16:25','16:25~17:10','17:10~18:30','18:30~19:15','19:15~19:25','19:25~20:10','20:10~20:20','20:20~21:05','21:05~22:00','22:00~23:00','23:00~24:00'];

写个存储过程来统计数值吧，就叫counter()，简单粗暴的代码如下：

DROP PROCEDURE IF EXISTS `counter_zju`;

CREATE DEFINER = `root`@`localhost` PROCEDURE `counter_zju`()
BEGIN
	#Routine body goes here...
SELECT count(*) from zju where                        time  <= '01:00:00' INTO @count1;
SELECT count(*) from zju where time  > '01:00:00' AND time  <= '02:00:00' INTO @count2;
SELECT count(*) from zju where time  > '02:00:00' AND time  <= '03:00:00' INTO @count3;
SELECT count(*) from zju where time  > '03:00:00' AND time  <= '04:00:00' INTO @count4;
SELECT count(*) from zju where time  > '04:00:00' AND time  <= '05:00:00' INTO @count5;
SELECT count(*) from zju where time  > '05:00:00' AND time  <= '06:00:00' INTO @count6;
SELECT count(*) from zju where time  > '06:00:00' AND time  <= '07:00:00' INTO @count7;
SELECT count(*) from zju where time  > '07:00:00' AND time  <= '08:05:00' INTO @count8;
SELECT count(*) from zju where time  > '08:05:00' AND time  <= '09:35:00' INTO @count9;
SELECT count(*) from zju where time  > '09:35:00' AND time  <= '09:50:00' INTO @count10;
SELECT count(*) from zju where time  > '09:50:00' AND time  <= '10:35:00' INTO @count11;
SELECT count(*) from zju where time  > '10:35:00' AND time  <= '10:40:00' INTO @count12;
SELECT count(*) from zju where time  > '10:40:00' AND time  <= '11:25:00' INTO @count13;
SELECT count(*) from zju where time  > '11:25:00' AND time  <= '11:30:00' INTO @count14;
SELECT count(*) from zju where time  > '11:30:00' AND time  <= '12:15:00' INTO @count15;
#下午
SELECT count(*) from zju where time  > '12:15:00' AND time  <= '13:40:00' INTO @count16;
SELECT count(*) from zju where time  > '13:40:00' AND time  <= '14:25:00' INTO @count17;
SELECT count(*) from zju where time  > '14:25:00' AND time  <= '14:35:00' INTO @count18;
SELECT count(*) from zju where time  > '14:35:00' AND time  <= '15:20:00' INTO @count19;
SELECT count(*) from zju where time  > '15:20:00' AND time  <= '15:30:00' INTO @count20;
SELECT count(*) from zju where time  > '15:30:00' AND time  <= '16:15:00' INTO @count21;
SELECT count(*) from zju where time  > '16:15:00' AND time  <= '16:25:00' INTO @count22;
SELECT count(*) from zju where time  > '16:25:00' AND time  <= '17:10:00' INTO @count23;
SELECT count(*) from zju where time  > '17:10:00' AND time  <= '18:30:00' INTO @count24;
SELECT count(*) from zju where time  > '18:30:00' AND time  <= '19:15:00' INTO @count25;
SELECT count(*) from zju where time  > '19:15:00' AND time  <= '19:25:00' INTO @count26;
SELECT count(*) from zju where time  > '19:25:00' AND time  <= '20:10:00' INTO @count27;
SELECT count(*) from zju where time  > '20:10:00' AND time  <= '20:20:00' INTO @count28;
SELECT count(*) from zju where time  > '20:20:00' AND time  <= '21:05:00' INTO @count29;
SELECT count(*) from zju where time  > '21:05:00' AND time  <= '22:00:00' INTO @count30;
SELECT count(*) from zju where time  > '22:00:00' AND time  <= '23:00:00' INTO @count31;
SELECT count(*) from zju where time  > '23:00:00'                         INTO @count32;

	select @count1,@count2,@count3,@count4,@count5,@count6,@count7,@count8,@count9,@count10,@count11,@count12,@count13,@count14,@count15,@count16,@count17,@count18,@count19,@count20,@count21,@count22,@count23,@count24,@count25,@count26,@count27,@count28,@count29,@count30,@count31,@count32;
END;