第一篇技术博客,忐忑...
最近在学python,包括selenium、numpy、pandas和matplotlib等库的使用。
个人很喜欢打篮球,最近突发奇想,尝试爬取一下经常逛的体育论坛(hupu)的新闻,看看哪只nba球队是被最多次提及的。
例:爬取1000条该论坛的新闻,将新闻正文中,出现的nba球队名字找出并记录,最后看各个球队的名字出现了多少次。
话不多说,上代码:
1.导入库:
#coding=utf-8
import time
from selenium import webdriver
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
1.1环境:使用的是python2.7.11
1.2使用utf-8编码,因为会涉及中文字符。
1.3必备的引用,time模块显示时间;
selenium webdirver爬取网站内容;
numpy、pandas创建所需数据list和series;
matplotlib.pyplot用于画数据分析图;
剩下三行,容后再表。
2.准备数据:
###prepare###
##East##
Celtics = u'凯尔特人'
Nets = u'篮网'
Knicks = u'尼克斯'
Phil_76ers = u'76人'
Raptors = u'猛龙'
Bulls = u'公牛'
Cav = u'骑士'
Pistons = u'活塞'
Pacers = u'步行者'
Bucks = u'雄鹿'
Hawks = u'老鹰'
Bobcats = u'黄蜂'
Heat = u'热火'
Magic = u'魔术'
Wizards = u'奇才'
##West##
Mavericks = u'小牛'
Rockets = u'火箭'
Grizzlies = u'灰熊'
Pelican = u'鹈鹕'
Spurs = u'马刺'
Nuggets = u'掘金'
Timberwolves = u'森林狼'
Trial_Blazers = u'开拓者'
Thunder = u'雷霆'
Jazz = u'爵士'
Warriors = u'勇士'
Clippers = u'快船'
Lakers = u'湖人'
Suns = u'太阳'
Kings = u'国王'
##team_name_count_dict##
team_name_count_dict = {Celtics:0,
Nets:0,
Knicks:0,
Phil_76ers:0,
Raptors:0,
Bulls:0,
Cav:0,
Pistons:0,
Pacers:0,
Bucks:0,
Hawks:0,
Bobcats:0,
Heat:0,
Magic:0,
Wizards:0,
Mavericks:0,
Rockets:0,
Grizzlies:0,
Pelican:0,
Spurs:0,
Nuggets:0,
Timberwolves:0,
Trial_Blazers:0,
Thunder:0,
Jazz:0,
Warriors:0,
Clippers:0,
Lakers:0,
Suns:0,
Kings:0}
##for i in team_name_count_dict:
## print i
2.1准备现在30支nba球队的字符串。
2.2将上述30支nba球队导入字典中,并初始化出现次数均为0。
3.函数准备:
3.1查看该论坛新闻的url,例:https://voice.hupu.com/nba/2202618.html,即新闻number为2202618。
因此第一步,我决定先爬取最新的新闻number。
代码如下:
def get_latest_news_number():
driver.get('https://voice.hupu.com/nba')
res = driver.find_element_by_class_name('list-hd')
#print res.text
latest_url = res.find_element_by_tag_name('a').get_attribute('href')
latest_url1 = latest_url.split('/')
latest_url2 = latest_url1[-1].split('.')
return int(latest_url2[0])
3.2下一步,打开新闻网页,并爬取新闻正文:
手动打开新闻网页,在新闻正文中按右键,选择检查,发现新闻正文的class name为artical-main-content,
因此代码如下:
def get_page_content(i,url):
driver.get(url)
res = driver.find_element_by_class_name('artical-main-content')
content = res.text
contents.append(content)
print str(i),' is done!'
ps:contents是事先定义的list,用于存储所有新闻正文内容。
3.3下一步,遍历所有新闻正文内容,找出各个nba球队的名称。
例:从某一篇新闻中,找出是否存在‘勇士’这个名字,假如出现了2次,则前文的team_name_count_dict[u'勇士'] += 2
代码如下:
def collect_team_name_number(contents):
for item in contents:
#print item
for name in team_name_count_dict:
num = item.count(name)
team_name_count_dict[name] += num
现在,我们的新闻内容爬取并筛选完毕,准备画图。
3.4根据每个队名出现的次数,使用matplotlib花柱状图,将之显示出来。
代码如下:
def draw_pictures(team_name_count_dict):
dic = pd.Series(team_name_count_dict)
sort_dic = dic.sort_values()
x_lim = np.arange(0,60,2)
plt.figure()
plt.bar(x_lim,sort_dic)
for i in range(30):
plt.text(x_lim[i],sort_dic[i]+1,sort_dic[i],ha='center',va='top')
plt.xticks(x_lim,sort_dic.index)
plt.show()
3.4.1使用pandas,创建series并使用.sort_values()进行排序
3.4.2使用matplotlib,绘制柱状图bar、柱状图的text
3.4.3将x坐标的名字,即各nba球队名称,花在x轴上
ps:这里,最初的引用派上了用场:
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
引用这段代码后,中文就可以显示在x轴上了,不然是乱码。
3.5定义我们的main函数,并初始化:
3.5.1:main函数代码:
def main():
number = 100
print 'Now 100 news...'
print 'Start time is:',time.ctime()
latest_number = get_latest_news_number()
print 'The latest NBA news number is:',latest_number
for i in range(number):
url = 'http://voice.hupu.com/nba/'+str(latest_number-i)+'.html'
get_page_content(i,url)
time.sleep(1)
collect_team_name_number(contents)
for key,value in team_name_count_dict.items():
print key,':',value
draw_pictures(team_name_count_dict)
print 'End time is:',time.ctime()
3.5.2:执行代码
if __name__ == '__main__':
driver = webdriver.Chrome()
contents = []
main()
肯定还有很多需要修改的地方,以后陆续修改吧。
#coding=utf-8
import time
from selenium import webdriver
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
###prepare###
##East##
Celtics = u'凯尔特人'
Nets = u'篮网'
Knicks = u'尼克斯'
Phil_76ers = u'76人'
Raptors = u'猛龙'
Bulls = u'公牛'
Cav = u'骑士'
Pistons = u'活塞'
Pacers = u'步行者'
Bucks = u'雄鹿'
Hawks = u'老鹰'
Bobcats = u'黄蜂'
Heat = u'热火'
Magic = u'魔术'
Wizards = u'奇才'
##West##
Mavericks = u'小牛'
Rockets = u'火箭'
Grizzlies = u'灰熊'
Pelican = u'鹈鹕'
Spurs = u'马刺'
Nuggets = u'掘金'
Timberwolves = u'森林狼'
Trial_Blazers = u'开拓者'
Thunder = u'雷霆'
Jazz = u'爵士'
Warriors = u'勇士'
Clippers = u'快船'
Lakers = u'湖人'
Suns = u'太阳'
Kings = u'国王'
##team_name_count_dict##
team_name_count_dict = {Celtics:0,
Nets:0,
Knicks:0,
Phil_76ers:0,
Raptors:0,
Bulls:0,
Cav:0,
Pistons:0,
Pacers:0,
Bucks:0,
Hawks:0,
Bobcats:0,
Heat:0,
Magic:0,
Wizards:0,
Mavericks:0,
Rockets:0,
Grizzlies:0,
Pelican:0,
Spurs:0,
Nuggets:0,
Timberwolves:0,
Trial_Blazers:0,
Thunder:0,
Jazz:0,
Warriors:0,
Clippers:0,
Lakers:0,
Suns:0,
Kings:0}
##for i in team_name_count_dict:
## print i
def main():
number = 100
print 'Now 100 news...'
print 'Start time is:',time.ctime()
latest_number = get_latest_news_number()
print 'The latest NBA news number is:',latest_number
for i in range(number):
url = 'http://voice.hupu.com/nba/'+str(latest_number-i)+'.html'
get_page_content(i,url)
time.sleep(1)
collect_team_name_number(contents)
for key,value in team_name_count_dict.items():
print key,':',value
draw_pictures(team_name_count_dict)
print 'End time is:',time.ctime()
def get_latest_news_number():
driver.get('https://voice.hupu.com/nba')
res = driver.find_element_by_class_name('list-hd')
#print res.text
latest_url = res.find_element_by_tag_name('a').get_attribute('href')
latest_url1 = latest_url.split('/')
latest_url2 = latest_url1[-1].split('.')
return int(latest_url2[0])
def get_page_content(i,url):
driver.get(url)
res = driver.find_element_by_class_name('artical-main-content')
content = res.text
contents.append(content)
print str(i),' is done!'
def collect_team_name_number(contents):
for item in contents:
#print item
for name in team_name_count_dict:
num = item.count(name)
team_name_count_dict[name] += num
def draw_pictures(team_name_count_dict):
dic = pd.Series(team_name_count_dict)
sort_dic = dic.sort_values()
x_lim = np.arange(0,60,2)
plt.figure()
plt.bar(x_lim,sort_dic)
for i in range(30):
plt.text(x_lim[i],sort_dic[i]+1,sort_dic[i],ha='center',va='top')
plt.xticks(x_lim,sort_dic.index)
plt.show()
if __name__ == '__main__':
driver = webdriver.Chrome()
contents = []
main()
我先尝试了爬取100篇新问题,最后画的图是这样滴:
不断学习中~与君共勉。