python:selenium+matplotlib,分析某体育论坛中,最受欢迎的nba球队

第一篇技术博客,忐忑...

最近在学python,包括selenium、numpy、pandas和matplotlib等库的使用。

个人很喜欢打篮球,最近突发奇想,尝试爬取一下经常逛的体育论坛(hupu)的新闻,看看哪只nba球队是被最多次提及的。


例:爬取1000条该论坛的新闻,将新闻正文中,出现的nba球队名字找出并记录,最后看各个球队的名字出现了多少次。


话不多说,上代码:

1.导入库:

#coding=utf-8

import time
from selenium import webdriver  
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
1.1环境:使用的是python2.7.11

1.2使用utf-8编码,因为会涉及中文字符。

1.3必备的引用,time模块显示时间;

   selenium webdirver爬取网站内容;

   numpy、pandas创建所需数据list和series;

   matplotlib.pyplot用于画数据分析图;

   剩下三行,容后再表。


2.准备数据:

###prepare###

##East##
Celtics = u'凯尔特人'
Nets = u'篮网'
Knicks = u'尼克斯'
Phil_76ers = u'76人'
Raptors = u'猛龙'
Bulls = u'公牛'
Cav = u'骑士'
Pistons = u'活塞'
Pacers = u'步行者'
Bucks = u'雄鹿'
Hawks = u'老鹰'
Bobcats = u'黄蜂'
Heat = u'热火'
Magic = u'魔术'
Wizards = u'奇才'


##West##
Mavericks = u'小牛'
Rockets = u'火箭'
Grizzlies = u'灰熊'
Pelican = u'鹈鹕'
Spurs = u'马刺'
Nuggets = u'掘金'
Timberwolves = u'森林狼'
Trial_Blazers = u'开拓者'
Thunder = u'雷霆'
Jazz = u'爵士'
Warriors = u'勇士'
Clippers = u'快船'
Lakers = u'湖人'
Suns = u'太阳'
Kings = u'国王'


##team_name_count_dict##
team_name_count_dict = {Celtics:0,
                        Nets:0,
                        Knicks:0,
                        Phil_76ers:0,
                        Raptors:0,
                        Bulls:0,
                        Cav:0,
                        Pistons:0,
                        Pacers:0,
                        Bucks:0,
                        Hawks:0,
                        Bobcats:0,
                        Heat:0,
                        Magic:0,
                        Wizards:0,
                        Mavericks:0,
                        Rockets:0,
                        Grizzlies:0,
                        Pelican:0,
                        Spurs:0,
                        Nuggets:0,
                        Timberwolves:0,
                        Trial_Blazers:0,
                        Thunder:0,
                        Jazz:0,
                        Warriors:0,
                        Clippers:0,
                        Lakers:0,
                        Suns:0,
                        Kings:0}
##for i in team_name_count_dict:
##     print i
2.1准备现在30支nba球队的字符串。

2.2将上述30支nba球队导入字典中,并初始化出现次数均为0。


3.函数准备:

3.1查看该论坛新闻的url,例:https://voice.hupu.com/nba/2202618.html,即新闻number为2202618。

因此第一步,我决定先爬取最新的新闻number。

代码如下:

def get_latest_news_number():
     driver.get('https://voice.hupu.com/nba')
     res = driver.find_element_by_class_name('list-hd')
     #print res.text
     latest_url = res.find_element_by_tag_name('a').get_attribute('href')
     latest_url1 = latest_url.split('/')
     latest_url2 = latest_url1[-1].split('.')
     return int(latest_url2[0])

3.2下一步,打开新闻网页,并爬取新闻正文:

手动打开新闻网页,在新闻正文中按右键,选择检查,发现新闻正文的class name为artical-main-content,

因此代码如下:

def get_page_content(i,url):
     driver.get(url)
     res = driver.find_element_by_class_name('artical-main-content')
     content = res.text
     contents.append(content)
     print str(i),' is done!'
ps:contents是事先定义的list,用于存储所有新闻正文内容。


3.3下一步,遍历所有新闻正文内容,找出各个nba球队的名称。

例:从某一篇新闻中,找出是否存在‘勇士’这个名字,假如出现了2次,则前文的team_name_count_dict[u'勇士'] += 2

代码如下:

def collect_team_name_number(contents):
     for item in contents:
          #print item
          for name in team_name_count_dict:
               num = item.count(name)
               team_name_count_dict[name] += num
现在,我们的新闻内容爬取并筛选完毕,准备画图。


3.4根据每个队名出现的次数,使用matplotlib花柱状图,将之显示出来。

代码如下:

def draw_pictures(team_name_count_dict):
     dic = pd.Series(team_name_count_dict)
     sort_dic = dic.sort_values()
     x_lim = np.arange(0,60,2)
     
     plt.figure()
     plt.bar(x_lim,sort_dic)
     for i in range(30):
          plt.text(x_lim[i],sort_dic[i]+1,sort_dic[i],ha='center',va='top')
     plt.xticks(x_lim,sort_dic.index)
     
     plt.show()
3.4.1使用pandas,创建series并使用.sort_values()进行排序

3.4.2使用matplotlib,绘制柱状图bar、柱状图的text

3.4.3将x坐标的名字,即各nba球队名称,花在x轴上

ps:这里,最初的引用派上了用场:

from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False

引用这段代码后,中文就可以显示在x轴上了,不然是乱码。

3.5定义我们的main函数,并初始化:

3.5.1:main函数代码:

def main():
     number = 100
     print 'Now 100 news...'
     print 'Start time is:',time.ctime()

     latest_number = get_latest_news_number()
     print 'The latest NBA news number is:',latest_number
     
     for i in range(number):
          url = 'http://voice.hupu.com/nba/'+str(latest_number-i)+'.html'
          get_page_content(i,url)
          time.sleep(1)
          
     collect_team_name_number(contents)
     
     for key,value in team_name_count_dict.items():
               print key,':',value

     draw_pictures(team_name_count_dict)
     
     print 'End time is:',time.ctime()

3.5.2:执行代码

if __name__ == '__main__':
     driver = webdriver.Chrome()
     contents = []
     main()


4.完整代码如下:

肯定还有很多需要修改的地方,以后陆续修改吧。

#coding=utf-8

import time
from selenium import webdriver
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False


###prepare###

##East##
Celtics = u'凯尔特人'
Nets = u'篮网'
Knicks = u'尼克斯'
Phil_76ers = u'76人'
Raptors = u'猛龙'
Bulls = u'公牛'
Cav = u'骑士'
Pistons = u'活塞'
Pacers = u'步行者'
Bucks = u'雄鹿'
Hawks = u'老鹰'
Bobcats = u'黄蜂'
Heat = u'热火'
Magic = u'魔术'
Wizards = u'奇才'


##West##
Mavericks = u'小牛'
Rockets = u'火箭'
Grizzlies = u'灰熊'
Pelican = u'鹈鹕'
Spurs = u'马刺'
Nuggets = u'掘金'
Timberwolves = u'森林狼'
Trial_Blazers = u'开拓者'
Thunder = u'雷霆'
Jazz = u'爵士'
Warriors = u'勇士'
Clippers = u'快船'
Lakers = u'湖人'
Suns = u'太阳'
Kings = u'国王'


##team_name_count_dict##
team_name_count_dict = {Celtics:0,
                        Nets:0,
                        Knicks:0,
                        Phil_76ers:0,
                        Raptors:0,
                        Bulls:0,
                        Cav:0,
                        Pistons:0,
                        Pacers:0,
                        Bucks:0,
                        Hawks:0,
                        Bobcats:0,
                        Heat:0,
                        Magic:0,
                        Wizards:0,
                        Mavericks:0,
                        Rockets:0,
                        Grizzlies:0,
                        Pelican:0,
                        Spurs:0,
                        Nuggets:0,
                        Timberwolves:0,
                        Trial_Blazers:0,
                        Thunder:0,
                        Jazz:0,
                        Warriors:0,
                        Clippers:0,
                        Lakers:0,
                        Suns:0,
                        Kings:0}
##for i in team_name_count_dict:
##     print i


def main():
     number = 100
     print 'Now 100 news...'
     print 'Start time is:',time.ctime()

     latest_number = get_latest_news_number()
     print 'The latest NBA news number is:',latest_number
     
     for i in range(number):
          url = 'http://voice.hupu.com/nba/'+str(latest_number-i)+'.html'
          get_page_content(i,url)
          time.sleep(1)
          
     collect_team_name_number(contents)
     
     for key,value in team_name_count_dict.items():
               print key,':',value

     draw_pictures(team_name_count_dict)
     
     print 'End time is:',time.ctime()

def get_latest_news_number():
     driver.get('https://voice.hupu.com/nba')
     res = driver.find_element_by_class_name('list-hd')
     #print res.text
     latest_url = res.find_element_by_tag_name('a').get_attribute('href')
     latest_url1 = latest_url.split('/')
     latest_url2 = latest_url1[-1].split('.')
     return int(latest_url2[0])
     


def get_page_content(i,url):
     driver.get(url)
     res = driver.find_element_by_class_name('artical-main-content')
     content = res.text
     contents.append(content)
     print str(i),' is done!'


def collect_team_name_number(contents):
     for item in contents:
          #print item
          for name in team_name_count_dict:
               num = item.count(name)
               team_name_count_dict[name] += num
     

def draw_pictures(team_name_count_dict):
     dic = pd.Series(team_name_count_dict)
     sort_dic = dic.sort_values()
     x_lim = np.arange(0,60,2)
     
     plt.figure()
     plt.bar(x_lim,sort_dic)
     for i in range(30):
          plt.text(x_lim[i],sort_dic[i]+1,sort_dic[i],ha='center',va='top')
     plt.xticks(x_lim,sort_dic.index)
     
     plt.show()


if __name__ == '__main__':
     driver = webdriver.Chrome()
     contents = []
     main()


5.运行结果:

我先尝试了爬取100篇新问题,最后画的图是这样滴:

不断学习中~与君共勉。








   

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值