寻找最好看的电影--从HTML中获取想要的数据

一、导入相应的Python模块

from bs4 import BeautifulSoup
import os
import pandas as pd

二、遍历文件,打开指定目录下的文件,制作soup

folder = "rt_html"
for movie_html in os.listdir(folder):
    with open (os.path.join(folder,movie_html)) as file:
        soup = BeautifulSoup(file,'lxml')

注:open默认的编码格式是”gbk”,当出现不适”gbk”的编码形式时,需要更改编码形式。
这里写图片描述
更改编码如下:

folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding="utf-8") as file:
        soup = BeautifulSoup(file,"lxml")

注意:关于文件open方式的使用

folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),'rb') as file:
        soup = BeautifulSoup(file,'lxml')

三、在HTML文件中查找相应内容-title

folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')]
        print title
        break

注:print在Python2.0版本用print title,Python 3.0 中用print (title),容易出现一下错误
这里写图片描述
注:BeautifulSoup中find函数查找区域内第一个标签的内容

四、在HTML文件中查找相应内容-audience_score

folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        audience_score = soup.find('div',class_='audience-score meter').find('span')
        print (audience_score)
        break

<span class="superPageFontColor" style="vertical-align:top">97%</span>

注:命名只能使用下划线,不能使用中间的分割线,即可以audience_score,不可以audience-score,错误形式如下:
这里写图片描述
优化后:

folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
        print (audience_score)
        break

97

五、找到HTML中的相关内容-num_audience_ratings
1、复杂的情况下,将所在class放大

folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
        print (num_audience_ratings)
        break

<div class="audience-info hidden-xs superPageFontColor">
<div>
<span class="subtle superPageFontColor">Average Rating:</span>
            4.2/5
                </div>
<div>
<span class="subtle superPageFontColor">User Ratings:</span>
        103,672</div>
</div>

接下来发现div的相应class中有其他两个div,发现数量在第二个div中,这时候用find_all函数返回名单中的第二个项目。第二个项目的返回索引数为[1].第三个项目索引数字[2]。

folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
        print (num_audience_ratings)
        break

103672

注:strip去除空格值,replace替换字符

#正常查找,数据中包含空格值和逗号
folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2]
        print (num_audience_ratings)
        break


        103,672

三项组合起来

folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')]
        audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')

六、形成list

df_list = []
folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')]
        audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
        df_list.append({'title':title,
'audience_score':int(audience_score),
'num_audience_ratings':int(num_audience_ratings)})

七、将list转换为dataframe

df_list = []
folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')]
        audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
        df_list.append({'title':title,
'audience_score':int(audience_score),
'number_of_audience_ratings':int(num_audience_ratings)})
df2 = pd.DataFrame(df_list,columns=['title','audience_score','number_of_audience_ratings'])

注:注意下面columns和上面的df_list.append的标题一致,否则会出现空值的情况
八、将影评人评分和观众评分的数据框进行合并

首先导入观众评分数据框

import pandas as pd
df1 = pd.read_csv('C:\\Users\\Administrator\\Desktop\\rt-html\\bestofrt.tsv',sep='\t')

其次将df1和df2以title列合并在一起

df = pd.merge(df1,df2,on='title')

最终结果显示df列为空

df = pd.merge(df1,df2,on="title")

df
Out[50]: 
Empty DataFrame
Columns: [ranking, critic_score, title, number_of_critic_ratings, audience_score, number_of_audience_ratings]
Index: []

查找原因是在title获取的过程中,我们截取的是0到-17,而正常应该是0到-18的字节,因此df1和df2中的title不匹配
最终加上:replace(u’\xa0’, u’ ‘).strip()
replace(u’\xaO’,u’ ‘)用来去掉title中年份前面的那个空格
strip()用来去掉title收尾处的空格

df_list = []
folder = "rt_html"
for movie_html in os.listdir(folder):
    with open(os.path.join(folder,movie_html),encoding='utf-8') as file:
        soup = BeautifulSoup(file,'lxml')
        title = soup.find('title').contents[0][:-len('--Rotten Tomatoes')].replace(u'\xa0', u' ').strip()
        audience_score = soup.find('div',class_='audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div',class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
        df_list.append({'title':title,
'audience_score':int(audience_score),
'number_of_audience_ratings':int(num_audience_ratings)})
df2 = pd.DataFrame(df_list,columns=['title','audience_score','number_of_audience_ratings'])

然后匹配df1和df2
九、进行画图

import matplotlib pyplot as plt
%matplotlib inline
plt.scatter(df.audience_score,df.critic_score)

这里写图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值