豆瓣网站的电影数据相对比较好获取,也是很多人学习爬虫时的练手目标,我以前刚刚学习爬虫时也是使用豆瓣和猫眼练习过,对于信息的可靠性来说,豆瓣上的评分更可靠一些。以前爬取电影的信息都是使用的豆瓣的电影排行页面,这个相对简单,且包含了基本想要的信息,这次爬虫练习希望爬取豆瓣电影排行top250的详情页,并将这些信息构建为类似知识图谱的(节点,边,节点)的结构
1.爬取思路
首先通过电影排名页面获得所有电影的链接,存储在相应的字典中,待所有电影的链接获取完之后,循环访问这250个电影的详情页面,依次获取想要的内容。
2.豆瓣页面分析
豆瓣电影top250排行的网址为https://movie.douban.com/top250,电影的详情页为https://movie.douban.com/subject/xxxxxxx/,最后七位标记数字都不同。
电影排行页
可以在网页的检查中点击电影的名称,然后可以找到电影对应的url地址。
第一步主要是获取全部250部电影的地址。
电影详情页
电影详情页主要是获取电影的一些基本信息
将这些信息提取出来后发现,这些信息的分布并不规则,如下图,其中都经过去空格处理:
比如电影的导演、编剧和演员都是在一个span标签里的span标签中,而类型等其他信息外层只包含了一个span标签,同时类型、地区和语言等信息的标签类别是一样的,如果采用PyQuery来定位有些不好区分,所以最终采用了最基础也是比较万能的办法:正则表达式。
3.获取电影的url
import time
from pyquery import PyQuery as pq
# 通过榜单爬取前250部电影的详细地址
movie_urls = {}
movie_names = []
for i in range(10):
url = "https://movie.douban.com/top250?start=" + str(i*25)
html = pq(url=url)
a = html('.item .pic a')
for b in a.items():
movie_name = b('img').attr('alt')
movie_names.append(movie_name)
movie_url = b.attr('href')
movie_urls[movie_name] = movie_url
# print("%s:%s" % (movie_name, movie_url))
print("ok"+str(i))
time.sleep(2) # 每爬取一个页面等待两秒,防止过高频率访问目标网站
这里直接通过pyquery来定位目标的a标签,再获取a标签中的href即可,将获取到的地址和电影名存放在movie_urls字典中。
4.爬取电影详情页面
from pyquery import PyQuery as pq
html = pq(url=url)
# 获取电影的评分
sore = html(".rating_num").text()
# 定位到目标信息的区域,并将定位区域网页的PyQuery对象转换为字符串
info = str(html(".subject.clearfix #info").children())
# 将对应区域网站信息去空额、去回车换行
info = info.replace(' ', '').replace('\n', '').replace('\r', '')
# 用正则表达式匹配导演信息
director = ""
try:
director = re.match('.*?rel="v:directedBy">(.*?)</a>.*?', info).group(1)
except:
print("获取导演信息失败")
# 匹配编剧信息,对于编剧数量大于2的只取前两个,只有一个的只取第一个
scriptwriters = []
try:
scriptwriter = re.match('.*?编剧.*?ahref=.*?">(.*?)</a>.*?ahref=.*?">(.*?)</a>', info)
scriptwriters.append(scriptwriter.group(1))
scriptwriters.append(scriptwriter.group(2))
except:
scriptwriter = re.match('.*?编剧.*?ahref=.*?">(.*?)</a>.*?', info)
if scriptwriter:
scriptwriters.append(scriptwriter.group(1))
# 匹配演员信息,演员信息较多,一般大于5个,但也有为4个的,这里只匹配五个演员信息,演员信息不足5的电影很少,可以手动输入
actors = []
try:
actor = re.match('.*?主演.*?starring">(.*?)</a>.*?v:starring">(.*?)</a>.*?starring">(.*?)</a>.*?starring">(.*?)</a>.*?starring">(.*?)</a>.*?', info)
for i in range(1, 6):
actors.append(actor.group(i))
except:
print("未匹配到合适的演员信息")
# 匹配电影的类型信息,最多匹配两个
movie_types = []
try:
types = re.match('.*?类型.*?genre">(.*?)</span>.*?genre">(.*?)</span>.*?', info)
for i in range(1, 3):
movie_types.append(types.group(i))
except:
movie_types.append(re.match('.*?类型.*?genre">(.*?)</span>.*?', info).group(1))
print("电影类型获取失败")
# 匹配电影的地区信息
location = ""
try:
location = re.match('.*?地区:</span>(.*?)<br/>', info).group(1)
except:
print("电影地区获取失败")
# 匹配电影的语言
language = ""
try:
language = re.match('.*?语言:</span>(.*?)<br/>', info).group(1)
except:
print("电影语言获取失败")
# 匹配电影的上映日期
movie_date = ""
try:
movie_date = re.match('.*?上映日期.*?">(.*?)</span>', info).group(1)
except:
print("电影上映日期获取失败")
# 匹配电影的时长
movie_time = ""
try:
movie_time = re.match('.*?片长.*?">(.*?)</span>', info).group(1)
except:
print("电影时长获取失败")
# 匹配电影的其他名称
movie_other_name = ""
try:
movie_other_name = re.match('.*?又名:</span>(.*?)<br/>', info).group(1)
except:
print("电影其他名称获取失败")
5.将爬取到的电影信息存入到mysql数据库中
电影信息爬取并存入mysql数据库的完整代码如下:
from pyquery import PyQuery as pq
import re
import time
import pymysql
def get_details(url):
html = pq(url=url)
# 获取电影的评分
sore = html(".rating_num").text()
# 定位到目标信息的区域,并将定位区域网页的PyQuery对象转换为字符串
info = str(html(".subject.clearfix #info").children())
# 将对应区域网站信息去空额、去回车换行
info = info.replace(' ', '').replace('\n', '').replace('\r', '')
# 用正则表达式匹配导演信息
director = ""
try:
director = re.match('.*?rel="v:directedBy">(.*?)</a>.*?', info).group(1)
except:
print("获取导演信息失败")
# 匹配编剧信息,对于编剧数量大于2的只取前两个,只有一个的只取第一个
scriptwriters = []
try:
scriptwriter = re.match('.*?编剧.*?ahref=.*?">(.*?)</a>.*?ahref=.*?">(.*?)</a>', info)
scriptwriters.append(scriptwriter.group(1))
scriptwriters.append(scriptwriter.group(2))
except:
scriptwriter = re.match('.*?编剧.*?ahref=.*?">(.*?)</a>.*?', info)
if scriptwriter:
scriptwriters.append(scriptwriter.group(1))
# 匹配演员信息,演员信息较多,一般大于5个,但也有为4个的,这里只匹配五个演员信息,演员信息不足5的电影很少,可以手动输入
actors = []
try:
actor = re.match('.*?主演.*?starring">(.*?)</a>.*?v:starring">(.*?)</a>.*?starring">(.*?)</a>.*?starring">(.*?)</a>.*?starring">(.*?)</a>.*?', info)
for i in range(1, 6):
actors.append(actor.group(i))
except:
print("未匹配到合适的演员信息")
# 匹配电影的类型信息,最多匹配两个
movie_types = []
try:
types = re.match('.*?类型.*?genre">(.*?)</span>.*?genre">(.*?)</span>.*?', info)
for i in range(1, 3):
movie_types.append(types.group(i))
except:
movie_types.append(re.match('.*?类型.*?genre">(.*?)</span>.*?', info).group(1))
print("电影类型获取失败")
# 匹配电影的地区信息
location = ""
try:
location = re.match('.*?地区:</span>(.*?)<br/>', info).group(1)
except:
print("电影地区获取失败")
# 匹配电影的语言
language = ""
try:
language = re.match('.*?语言:</span>(.*?)<br/>', info).group(1)
except:
print("电影语言获取失败")
# 匹配电影的上映日期
movie_date = ""
try:
movie_date = re.match('.*?上映日期.*?">(.*?)</span>', info).group(1)
except:
print("电影上映日期获取失败")
# 匹配电影的时长
movie_time = ""
try:
movie_time = re.match('.*?片长.*?">(.*?)</span>', info).group(1)
except:
print("电影时长获取失败")
# 匹配电影的其他名称
movie_other_name = ""
try:
movie_other_name = re.match('.*?又名:</span>(.*?)<br/>', info).group(1)
except:
print("电影其他名称获取失败")
context = {
"sore": sore,
"scriptwriters": scriptwriters,
"director": director,
"actors": actors,
"movie_types": movie_types,
"location": location,
"language": language,
"movie_date": movie_date,
"movie_time": movie_time,
"movie_other_name": movie_other_name
}
return context
if __name__ == '__main__':
# 链接数据库
db = pymysql.connect(host="localhost", user="root", password="******", db="movies", port=3306)
cur = db.cursor()
# 通过榜单爬取前250部电影的详细地址
movie_urls = {}
movie_names = []
for i in range(10):
url = "https://movie.douban.com/top250?start=" + str(i*25)
html = pq(url=url)
a = html('.item .pic a')
for b in a.items():
movie_name = b('img').attr('alt')
movie_names.append(movie_name)
movie_url = b.attr('href')
movie_urls[movie_name] = movie_url
# print("%s:%s" % (movie_name, movie_url))
print("ok"+str(i))
time.sleep(2)
id = 1
for name in movie_names:
print("开始爬取%s的内容" % name)
movie_info = get_details(movie_urls[name])
movie_info["name"] = name
# 将演员信息的列表转换为字符串
movie_info["actors"] = '/'.join(movie_info["actors"])
movie_info["scriptwriters"] = '/'.join(movie_info["scriptwriters"])
movie_info["movie_types"] = '/'.join(movie_info["movie_types"])
data = (str(id), movie_info["name"], movie_info["scriptwriters"], movie_info["sore"], movie_info["director"],
movie_info["actors"], movie_info["movie_types"], movie_info["location"], movie_info["language"],
movie_info["movie_date"], movie_info["movie_time"], movie_info["movie_other_name"])
print(data)
sql_insert = """insert into top(id, movie_name, scriptwriters, sore, director, actors, types, location,
language, movie_date, movie_time, movie_other_name)
values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
"""
reCount = cur.execute(sql_insert, data)
db.commit()
time.sleep(2) # 每次操作完一个页面均等待两秒
id += 1 # id存放电影的排名
cur.close()
db.close()
代码运行之后数据库中的数据如下:
6.使用Neo4j将获取到的电影数据关系化展示
Neo4j提供了数据导入的接口,只需按照要求写好相应的csv文件导入即可,但作为初学者,在这里采用最原始的办法构建图数据库,依次从mysql数据库中取出电影数据,将这些数据构建为相应的节点和边,同时需要考虑节点重复的问题。构建代码如下:
from py2neo import Graph, Node, Relationship, NodeMatcher
import pymysql
if __name__ == '__main__':
# 链接Neo4j
graph = Graph('http://localhost:7474', username='neo4j', password='123456')
# 链接mysql数据库
db = pymysql.connect(host="localhost", user="root", password="940606", db="movies", port=3306)
cur = db.cursor(cursor=pymysql.cursors.DictCursor)
for i in range(1, 251):
sql = "SELECT * FROM top WHERE id=%s" % str(i)
cur.execute(sql)
info = cur.fetchone()
print(info)
# 构建电影节点
movie_node = Node("movie", name=info['movie_name'], 评分=info['sore'], 类型=info['types'],
地区=info['location'], 语言=info['language'], 上映日期=info['movie_date'],
影片时长=info['movie_time'])
graph.create(movie_node)
# 构建人物节点,创建节点之前会先在图中查找有没有已经存在的人物,若有则让这个节点直接指向他就好了
matcher = NodeMatcher(graph)
search = matcher.match("person", name=info['director'])
if search:
director_node = search.first()
else:
director_node = Node('person', name=info['director'])
graph.create(director_node)
# 构建关系
director_relationship = Relationship(director_node, "导演", movie_node)
graph.create(director_relationship)
scriptwriters = info['scriptwriters'].split('/')
matcher = NodeMatcher(graph)
for scriptwriter in scriptwriters:
search = matcher.match("person", name=scriptwriter)
if not search:
write_node = Node('person', name=scriptwriter)
graph.create(write_node)
else:
write_node = search.first()
write_relationship = Relationship(write_node, "编剧", movie_node)
graph.create(write_relationship)
actors = info['actors'].split('/')
matcher = NodeMatcher(graph)
for actor in actors:
search = matcher.match("person", name=actor)
if not search:
actor_node = Node('person', name=actor)
graph.create(actor_node)
else:
actor_node = search.first()
actor_relationship = Relationship(actor_node, "参演", movie_node)
graph.create(actor_relationship)
print("完成电影《%s》" % info['movie_name'])
cur.close()
db.close()
完成之后打开Neo4j可以看到一个可视化的电影关系图: