近期学习Neo4j,以豆瓣top250数据为研究对象,实现python在线爬取数据写入Neo4j创建知识图谱,下文详细介绍步骤。
1、知识图谱设计
通过分析网页,爬取网页可以得到movie、country、type、time、director、actor、score等信息,此处我将movie、country、type、time、director、actor作为节点,而score作为movie的属性,网上有很多地方讲到只将movie、director、actor作为节点,其余均作为movie的属性,这个我之前也做过,但最后的效果并不是我想要的,至于什么效果,后文会提到。节点和关系设计如下图。
2、爬取数据并写入Neo4j
此处就直接上代码了:
from bs4 import BeautifulSoup
from urllib.request import urlopen,urlparse,urlsplit,Request
import urllib.request
import re
import codecs
import random
import py2neo
from py2neo import Graph
#
ua_list = [
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36",#Chrome
"Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0",#firwfox
"Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",#IE
"Opera/9.99 (Windows NT 5.1; U; zh-CN) Presto/9.9.9",#Opera
]
if __name__ == "__main__":
# connect to graph
graph = Graph (
"http://localhost:11010/",
username="admin",
password="password"
)
for i in range(0,9):
ua = random.choice( ua_list )
url = 'https://movie.douban.com/top250?start='+str(i*25)+'&filter='
req = urllib.request.Request( url, headers={'User-agent' : ua} )
html=urlopen(req).read()
soup = BeautifulSoup ( html, 'lxml' )
page=soup.find_all('div', {'class' : 'item'})
punc = ':· - ...:-'
list_item=[]
for item in page:
content = {}
try :
text0=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' )[0]
text1=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' ) [1]
#get film
film=item.find( 'span', {'class' : 'title'} ).text.strip ( )
film=re.sub ( r"[%s]+" % punc, "", film.strip ( ) )
# get score
score=item.find ( 'span', {'class' : 'rating_num'} ).text.strip ( )
graph.run (
"CREATE (movie:Movie {name:'" + film + "', score:'" + score +"'})" )
#get director
directors=text0.strip().split(' ')[0].strip().split(':')[1]
directors = re.sub ( r"[%s]+" % punc, "", directors.strip ( ) )#存在特殊字符需要先去除
# director=directors.split ( '/' )
if len ( directors.split ( '/' ))>1:
print(film+'has more than one director')
#创建director节点
if directors not in list_item:
graph.run (
"CREATE (director:Person {name:'" + directors + "'})" )
list_item.append ( directors )
#创建director-movie关系
graph.run (
"match (p:Person{name:'" + directors + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:directed]->(b)" )
#get actor
actors = text0.strip ( ).split ( ' ' ) [1].strip ( ).split ( ':' ) [1]
actors = re.sub ( r"[%s]+" % punc, "", actors.strip ( ) )#存在特殊字符需要先去除
if len ( actors.split ( '/' ) ) == 1 :
actor = actors
if actor not in list_item:
graph.run (
"CREATE (actor:Person {name:'" + actor + "'})" )
list_item.append ( actor )
graph.run (
"match (p:Person{name:'" + actor + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )
else :
actor = actors.split ( '/' )
if '...' in actor:
actor.remove ( '...' )
for i in range(len(actor)-1):
if actor[i] not in list_item :
graph.run (
"CREATE (actor:Person {name:'" + actor [i] + "'})" )
list_item.append ( actor [i] )
graph.run (
"match (p:Person{name:'" + actor[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )
#get time
time=text1.strip ( ).split ( '/' ) [0].strip()
if time not in list_item:
graph.run (
"CREATE (time:Time {year:'" + time + "'})" )
list_item.append ( time )
graph.run (
"match (p:Time{year:'" + time + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:created_in]->(p)" )
#get country
#maybe more than one
country=text1.strip ( ).split ( '/' ) [1].strip().split(' ')[0]
if country not in list_item:
graph.run (
"CREATE (country:Country {name:'" + country + "'})" )
list_item.append ( country )
graph.run (
"match (p:Country {name:'" + country + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:produced_by]->(p)" )
#get type
types=text1.strip ( ).split ( '/' ) [2].strip().split(' ')
if len(types)==1:
type = types
if type not in list_item:
graph.run (
"CREATE (type:Type {name:'" + type + "'})" )
list_item.append ( type )
graph.run (
"match (p:Type{name:'" + type + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )
else:
for i in range(0,len(types)):
if types[i] not in list_item:
graph.run (
"CREATE (type:Type {name:'" + types[i] + "'})" )
list_item.append ( types[i] )
type_relation="match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)"
graph.run (
"match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )
except:
continue
代码比较粗糙,后续再完善。
3、知识图谱show
整体效果如上图,即可以通过country、type、time信息显性化的检索相关信息,如果只将movie、director、actor作为node,则需要点击具体节点才能看到其属性country、type、time等信息。
如此,一个简易的豆瓣top250知识图谱就构建好了,但是,此处仍存在一个问题-数据重复,做完后发现不仅仅是节点有重复,关系竟然也有重复的,这个问题还在探究中。