本文使用BeautifulSoup工具,根据网页源码字段并分析指定网页,爬取对应字段并储存到文件中,供参考。
刚开始参考了一篇文章:Python获取网页指定内容(BeautifulSoup工具的使用方法),自己尝试后,发现出现错误:urllib.error.HTTPError: HTTP Error 418,查询后发现是:某些网页有反爬虫的机制。解决方法参考:Python爬虫的urllib.error.HTTPError: HTTP Error 418错误,可以设置一个Headers信息(User-Agent),模拟成浏览器去访问这些网站,从而获得数据。
爬取的网站为:豆瓣电影(豆瓣电影 Top 250)
查看网页源码:
结合文中两篇文章的代码,整理代码后如下:
# -*- coding:utf-8 -*-
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
from pandas import DataFrame
url = 'http://movie.douban.com/top250?format=text'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
ret = Request(url, headers=headers)
res = urlopen(ret)
contents = res.read()
soup = BeautifulSoup(contents, "html.parser")
print("豆瓣电影TOP250" + "\n" + " 影片名 评分 评价人数 链接 ")
df_ret = DataFrame(columns=[" 影片名","评分","评价人数","链接 "])
count =0
for tag in soup.find_all('div', class_='info'):
m_name = tag.find('span', class_='title').get_text()
m_rating_score = float(tag.find('span', class_='rating_num').get_text())
m_people = tag.find('div', class_="star")
m_span = m_people.findAll('span')
m_peoplecount = m_span[3].contents[0]
m_url = tag.find('a').get('href')
print(m_name + " " + str(m_rating_score) + " " + m_peoplecount + " " + m_url)
df_ret.loc[count] = [m_name, str(m_rating_score),m_peoplecount, m_url]
count = count +1
# 保存输出结果到csv
df_ret.to_csv('movies_names_set.csv', encoding= 'gbk')
print(df_ret.head())
输出csv文件格式如下: