一、导入相应的Python模块
from bs4 import BeautifulSoup
import os
import pandas as pd
二、遍历文件,打开指定目录下的文件,制作soup
folder = "rt_html"
for movie_html in os.listdir(folder):
with open (os.path.join(folder,movie_html)) as file:
soup = BeautifulSoup(file,'lxml')
注:open默认的编码格式是”gbk”,当出现不适”gbk”的编码形式时,需要更改编码形式。
更改编码如下:
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding="utf-8") as file:
soup = BeautifulSoup(file,"lxml")
注意:关于文件open方式的使用
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),'rb') as file:
soup = BeautifulSoup(file,'lxml')
三、在HTML文件中查找相应内容-title
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')]
print title
break
注:print在Python2.0版本用print title,Python 3.0 中用print (title),容易出现一下错误
注:BeautifulSoup中find函数查找区域内第一个标签的内容
四、在HTML文件中查找相应内容-audience_score
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
audience_score = soup.find('div',class_='audience-score meter').find('span')
print (audience_score)
break
<span class="superPageFontColor" style="vertical-align:top">97%</span>
注:命名只能使用下划线,不能使用中间的分割线,即可以audience_score,不可以audience-score,错误形式如下:
优化后:
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
print (audience_score)
break
97
五、找到HTML中的相关内容-num_audience_ratings
1、复杂的情况下,将所在class放大
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
print (num_audience_ratings)
break
<div class="audience-info hidden-xs superPageFontColor">
<div>
<span class="subtle superPageFontColor">Average Rating:</span>
4.2/5
</div>
<div>
<span class="subtle superPageFontColor">User Ratings:</span>
103,672</div>
</div>
接下来发现div的相应class中有其他两个div,发现数量在第二个div中,这时候用find_all函数返回名单中的第二个项目。第二个项目的返回索引数为[1].第三个项目索引数字[2]。
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
print (num_audience_ratings)
break
103672
注:strip去除空格值,replace替换字符
#正常查找,数据中包含空格值和逗号
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2]
print (num_audience_ratings)
break
103,672
三项组合起来
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')]
audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
六、形成list
df_list = []
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')]
audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
df_list.append({'title':title,
'audience_score':int(audience_score),
'num_audience_ratings':int(num_audience_ratings)})
七、将list转换为dataframe
df_list = []
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')]
audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
df_list.append({'title':title,
'audience_score':int(audience_score),
'number_of_audience_ratings':int(num_audience_ratings)})
df2 = pd.DataFrame(df_list,columns=['title','audience_score','number_of_audience_ratings'])
注:注意下面columns和上面的df_list.append的标题一致,否则会出现空值的情况
八、将影评人评分和观众评分的数据框进行合并
首先导入观众评分数据框
import pandas as pd
df1 = pd.read_csv('C:\\Users\\Administrator\\Desktop\\rt-html\\bestofrt.tsv',sep='\t')
其次将df1和df2以title列合并在一起
df = pd.merge(df1,df2,on='title')
最终结果显示df列为空
df = pd.merge(df1,df2,on="title")
df
Out[50]:
Empty DataFrame
Columns: [ranking, critic_score, title, number_of_critic_ratings, audience_score, number_of_audience_ratings]
Index: []
查找原因是在title获取的过程中,我们截取的是0到-17,而正常应该是0到-18的字节,因此df1和df2中的title不匹配
最终加上:replace(u’\xa0’, u’ ‘).strip()
replace(u’\xaO’,u’ ‘)用来去掉title中年份前面的那个空格
strip()用来去掉title收尾处的空格
df_list = []
folder = "rt_html"
for movie_html in os.listdir(folder):
with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
soup = BeautifulSoup(file,'lxml')
title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')].replace(u'\xa0', u' ').strip()
audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
df_list.append({'title':title,
'audience_score':int(audience_score),
'number_of_audience_ratings':int(num_audience_ratings)})
df2 = pd.DataFrame(df_list,columns=['title','audience_score','number_of_audience_ratings'])
然后匹配df1和df2
九、进行画图
import matplotlib pyplot as plt
%matplotlib inline
plt.scatter(df.audience_score,df.critic_score)