wikidata_旅游者的第1部分使用新语义库将wikidata导入neo4j

最新推荐文章于 2024-07-19 14:06:16 发布

李_涛

最新推荐文章于 2024-07-19 14:06:16 发布

阅读量837

点赞数

原文链接：https://towardsdatascience.com/traveling-tourist-part-1-import-wikidata-to-neo4j-with-neosemantics-library-f80235f40dc5

版权

这篇博客介绍了如何将wikidata数据有效地导入到Neo4j数据库中，利用NeoSemantics这一语义库，为旅游数据分析提供基础。

摘要由CSDN通过智能技术生成

wikidata

旅游旅客 (Traveling tourist)

After a short summer break, I have prepared a new blog series. In this first part, we will construct a knowledge graph of monuments located in Spain. As you might know, I have lately gained a lot of interest and respect for the wealth of knowledge that is available through the WikiData API. We will continue honing our SPARQL syntax knowledge and fetch the information regarding the monuments located in Spain from the WikiData API. I wasn’t aware of this before, but scraping the RDF data available online and importing it into Neo4j is such a popular topic that Dr. Jesus Barrasa developed a Neosemantics library to help us with this process.

短暂的暑假过后，我准备了一个新的博客系列。在第一部分中，我们将构建位于西班牙的古迹的知识图。您可能知道，最近我对WikiData API所提供的丰富知识产生了浓厚的兴趣和尊重。我们将继续磨练我们的SPARQL语法知识，并从WikiData API获取有关西班牙古迹的信息。我以前没有意识到这一点，但是抓取在线提供的RDF数据并将其导入Neo4j是如此受欢迎，以至于Jesus Barrasa博士开发了Neosemantics库来帮助我们完成此过程。

In the next parts of the series, we will take a look at the pathfinding algorithms available in the Neo4j Graph Data Science library.

在本系列的下一部分中，我们将研究Neo4j Graph Data Science库中提供的寻路算法。

议程 (Agenda)

Install Neosemantics library
安装Neosemantics库
Graph model
图模型
Construct WikiData SPARQL query
构造WikiData SPARQL查询
Import RDF Graph
导入RDF图
Reverse Geocode with OSM API
使用OSM API反向地理编码
Verify Data
验证数据

安装Neosemantics库 (Install Neosemantics library)

In this blog series, we will use the standard APOC and GDS libraries, which we can install with a single click in the Neo4j Desktop application. On top of that, we will add the Neosemantics library to our stack. It is used to interact with RDF data in the Neo4j environment. We can either import RDF data to Neo4j or export property graph model in RDF format.

在本博客系列中，我们将使用标准的APOC和GDS库，只需单击一下即可在Neo4j Desktop应用程序中安装它们。最重要的是，我们将Neosemantics库添加到我们的堆栈中。它用于与Neo4j环境中的RDF数据进行交互。我们可以将RDF数据导入Neo4j或以RDF格式导出属性图模型。

To install the Neosemantics library, we download the latest release and save it to the Neo4j plugins folder. We also need to add the following line to the Neo4j configuration file.

要安装Neosemantics库，我们将下载最新版本并将其保存到Neo4j plugins文件夹中。我们还需要将以下行添加到Neo4j配置文件中。

dbms.unmanaged_extension_classes=n10s.endpoint=/rdf

We are now ready to start our Neo4j instance. First, we need to initiate the Neosemantics configuration with the following cypher procedure.

现在我们准备启动Neo4j实例。首先，我们需要使用以下密码程序来启动Neosemantics配置。

CALL n10s.graphconfig.init({handleVocabUris: "IGNORE"})

Take a look at the documentation for information about the configuration options.

查看文档以获取有关配置选项的信息。

图模型 (Graph model)

Monuments are in the center of our graph. We have their name and the URL of the image stored as a node property. The monuments have been influenced by various architectural styles, which we indicate as a relationship to an Architecture node. We will save the city and the state of the monument as a two-level hierarchical location tree.

纪念碑位于我们图表的中心。我们将其名称和图像的URL存储为节点属性。这些古迹受到各种建筑风格的影响，我们将其表示为与“建筑”节点的关系。我们将城市和纪念碑的状态保存为两级分层位置树。

The Neosemantics library requires a unique constraint on the property “uri” of the nodes labeled Resource. We will also add indexes for the State and City nodes. The apoc.schema.assert procedure allows us to define many indexes and unique constraints with a single call.

Neosemantics库要求对标记为Resource的节点的属性“ uri”具有唯一的约束。我们还将添加州和城市节点的索引。 apoc.schema.assert过程允许我们通过一次调用定义许多索引和唯一约束。

CALL apoc.schema.assert(
  {State:['id'], City:['id']},
  {Resource:['uri']})

构造WikiData SPARQL查询 (Construct WikiData SPARQL query)

For me, the easiest way to construct a new SPARQL query is using the WikiData query editor. It has a lovely autocomplete feature. It also helps with query debugging.

对我来说，构造新的SPARQL查询的最简单方法是使用WikiData查询编辑器。它具有可爱的自动完成功能。它还有助于查询调试。

We want to retrieve all the instances of monuments located in Spain. I have found the easiest way to find various entities on WikiData is by simply using Google search. You can then inspect all the available properties of the entity on the website. The SPARQL query I first came up looks like this:

我们要检索位于西班牙的所有古迹实例。我发现在WikiData上查找各种实体的最简单方法是仅使用Google搜索。然后，您可以在网站上检查实体的所有可用属性。我首先出现的SPARQL查询如下所示：

SELECT * 
WHERE { ?item wdt:P31 wd:Q4989906 . 
        ?item wdt:P17 wd:Q29 . 
        ?item rdfs:label ?monumentName . 
          filter(lang(?monumentName) = "en") 
        ?item wdt:P625 ?location . 
        ?item wdt:P149 ?architecture . 
        ?architecture rdfs:label ?architectureName . 
          filter(lang(?architectureName) = "en") 
        ?item wdt:P18 ?image }

The first two lines in the WHERE clause define the entities we are looking for:

WHERE子句中的前两行定义了我们正在寻找的实体：

// Entity is an instance of monument entity
?item wdt:P31 wd:Q4989906 . 
// Entity is located in Spain
?item wdt:P17 wd:Q29 .

Next, we also determine which properties of the entities we are interested in. In our case, we would like to retrieve the monument’s name, image, location, and architectural style. If we run this query in the query editor, we get the following results.

接下来，我们还确定我们感兴趣的实体的哪些属性。在本例中，我们要检索纪念碑的名称，图像，位置和建筑风格。如果在查询编辑器中运行此查询，则会得到以下结果。

We have defined the information we would like to retrieve in the WHERE clause of the SPARQL query. We need to massage the data format a bit before we can import it with Neosemantics. The first and most crucial thing is to change the SELECTclause to CONSTRUCT. This way, we will get returned RDF triplets instead of a table of information. You can read more about the difference between SELECTand CONSTRUCTin this blog post written by Mark Needham.

我们已经在SPARQL查询的WHERE子句中定义了要检索的信息。在使用Neosemantics导入数据之前，我们需要对数据格式进行一些调整。首先也是最关键的事情是将SELECT子句更改为CONSTRUCT 。这样，我们将获得返回的RDF三元组，而不是信息表。您可以在Mark Needham撰写的此博客文章中了解有关SELECT和CONSTRUCT之间差异的更多信息。

With the Neosemantics library, we can preview what our stored graph model would look like with the n10s.rdf.preview.fetch procedure. We will start by inspecting the graph schema of an empty CONSTRUCT statement.

使用Neosemantics库，我们可以使用n10s.rdf.preview.fetch过程预览存储的图形模型的外观。我们将从检查空的CONSTRUCT语句的图模式开始。

WITH '
CONSTRUCT 
WHERE { ?item wdt:P31 wd:Q4989906 .
        ?item wdt:P17 wd:Q29 .
        ?item rdfs:label ?monumentName .
        ?item wdt:P625 ?location .
        ?item wdt:P149 ?architecture .
        ?architecture rdfs:label
        ?architectureName .
        ?item wdt:P18 ?image} limit 10 ' AS sparql
CALL n10s.rdf.preview.fetch(
  "https://query.wikidata.org/sparql?query=" +
     apoc.text.urlencode(sparql),"JSON-LD",
  { headerParams: { Accept: "application/ld+json"} ,
    handleVocabUris: "IGNORE"})
YIELD nodes, relationships
RETURN nodes, relationships

Results

结果

One problem is that the nodes have no labels. You could also notice that the relationship types are not very informative as P149 or P31 does not mean much if you don’t know the WikiData property mapping. Another thing, which is not very obvious from this visualization is that the URL of the image is stored as a separate node. If you remember the graph model from before, we decided we want to save the image URL as a property of the monument node.

一个问题是节点没有标签。您还可能会注意到，关系类型的信息量不是很多，因为如果您不了解WikiData属性映射，则P149或P31的意义不大。从此可视化效果来看不是很明显的另一件事是，图像的URL存储为单独的节点。如果您还记得以前的图形模型，我们决定将图像URL保存为纪念碑节点的属性。

I won’t go much into detail, but inside the CONSTRUCT clause, we can define how our graph schema should look like in Neo4j. We have also defined we want to save the URL of the image as a property of the monument node instead of a separate node with the following syntax:

我不会详细介绍，但是在CONSTRUCT子句中，我们可以定义图模式在Neo4j中的外观。我们还定义了我们想要将图像的URL保存为纪念碑节点的属性，而不是使用以下语法的单独节点：

?item wdt:P18 ?image . 
  bind(str(?image) as ?imageAsStr)

We can now preview the new query with the updated CONSTRUCT statement.

现在，我们可以使用更新的CONSTRUCT语句预览新查询。

WITH ' PREFIX sch: <http://schema.org/> 
CONSTRUCT{ ?item a sch:Monument; 
            sch:name ?monumentName; 
            sch:location ?location; 
            sch:img ?imageAsStr; 
            sch:ARCHITECTURE ?architecture. 
          ?architecture a sch:Architecture; 
           sch:name ?architectureName } 
WHERE { ?item wdt:P31 wd:Q4989906 . 
        ?item wdt:P17 wd:Q29 . 
        ?item rdfs:label ?monumentName . 
          filter(lang(?monumentName) = "en") 
        ?item wdt:P625 ?location . 
        ?item wdt:P149 ?architecture . 
        ?architecture rdfs:label ?architectureName .  
          filter(lang(?architectureName) = "en") 
        ?item wdt:P18 ?image . 
          bind(str(?image) as ?imageAsStr) } limit 100 ' AS sparql CALL n10s.rdf.preview.fetch(
  "https://query.wikidata.org/sparql?query=" +  
      apoc.text.urlencode(sparql),"JSON-LD", 
    { headerParams: { Accept: "application/ld+json"} ,   
      handleVocabUris: "IGNORE"})
YIELD nodes, relationships 
RETURN nodes, relationships

Results

结果

导入RDF图 (Import RDF graph)

Now we can go ahead and import the graph to Neo4j. Instead of the n10s.rdf.preview.fetch procedure we use n10s.rdf.import.fetch and keep the rest of the query identical.

现在我们可以继续将图形导入Neo4j。代替n10s.rdf.preview.fetch过程，我们使用n10s.rdf.import.fetch并使其余查询保持相同。

WITH 'PREFIX sch: <http://schema.org/> 
CONSTRUCT{ ?item a sch:Monument; 
            sch:name ?monumentName; 
            sch:location ?location; 
            sch:img ?imageAsStr; 
            sch:ARCHITECTURE ?architecture. 
           ?architecture a sch:Architecture; 
            sch:name ?architectureName } 
WHERE { ?item wdt:P31 wd:Q4989906 . 
        ?item wdt:P17 wd:Q29 . 
        ?item rdfs:label ?monumentName . 
         filter(lang(?monumentName) = "en") 
        ?item wdt:P625 ?location . 
        ?item wdt:P149 ?architecture . 
        ?architecture rdfs:label ?architectureName .
         filter(lang(?architectureName) = "en") 
        ?item wdt:P18 ?image . 
         bind(str(?image) as ?imageAsStr) }' AS sparql 
CALL n10s.rdf.import.fetch(
  "https://query.wikidata.org/sparql?query=" +   
   apoc.text.urlencode(sparql),"JSON-LD", 
   { headerParams: { Accept: "application/ld+json"} , 
     handleVocabUris: "IGNORE"}) 
YIELD terminationStatus, triplesLoaded
RETURN terminationStatus, triplesLoaded

Let’s start with some exploratory graph queries. We will first count the number of monuments in our graph.

让我们从一些探索性的图查询开始。我们将首先在图中计算纪念碑的数量。

MATCH (n:Monument) 
RETURN count(*)

We have imported 1401 monuments into our graph. We will continue with counting the number of monuments grouped by an architectural style.

我们已经将1401个纪念碑输入到我们的图形中。我们将继续计算按建筑风格分组的古迹数量。

MATCH (n:Architecture) 
RETURN n.name as monument, 
       size((n)<--()) as count 
ORDER BY count DESC 
LIMIT 5

Results

结果

Romanesque and Gothic architecture styles influence the most monuments. While I was exploring WikiData, I have noticed that an architectural style can be a subclass of other architectural styles. As an exercise, we will import the architectural hierarchy relationships to our graph. In our query, we will iterate over all architecture styles stored in our graph, and fetch any parent architectural style from WikiData and save it back to our graph.

罗马式和哥特式建筑风格影响最大。当我浏览WikiData时，我注意到一种建筑风格可以是其他建筑风格的子类。作为练习，我们将架构层次结构关系导入到我们的图形中。在我们的查询中，我们将遍历存储在图中的所有架构样式，并从WikiData中获取任何父架构样式并将其保存回我们的图中。

MATCH (a:Architecture) 
WITH ' PREFIX sch: <http://schema.org/> 
CONSTRUCT { ?item a sch:Architecture; 
             sch:SUBCLASS_OF ?style. 
            ?style a sch:Architecture; 
             sch:name ?styleName;} 
WHERE { filter (?item = <' + a.uri + '>) 
        ?item wdt:P279 ?style . 
        ?style rdfs:label ?styleName 
         filter(lang(?styleName) = "en") } ' AS sparql 
CALL n10s.rdf.import.fetch(
  "https://query.wikidata.org/sparql?query=" + 
    apoc.text.urlencode(sparql),"JSON-LD", 
  { headerParams: { Accept: "application/ld+json"} , 
    handleVocabUris: "IGNORE"}) 
YIELD terminationStatus, triplesLoaded 
RETURN terminationStatus, triplesLoaded

We can now look at some examples of architectural hierarchy.

现在我们来看一些体系结构层次结构的例子。

MATCH (a:Architecture)-[:SUBCLASS_OF]->(b:Architecture)
RETURN a.name as child_architecture,
       b.name as parent_architecture
LIMIT 5

Results

结果

It seems that modernism is a child category of Art Noveau, and Art Noveau is a child category of decorative arts.

看来现代主义是新艺术运动的一个子类别，而新艺术运动是装饰艺术的一个子类别。

空间充实 (Spatial enrichment)

At first, I wanted to include the municipality information of monuments available on WikiData, but as it turned out, this information is relatively sparse. No worries though, I later realized we could use the reverse geocode API to retrieve this information. APOC has a dedicated procedure available for reverse geocoding. By default, it uses Open Street Map API, but we can customize it to work with other providers as well. Check the documentation for more information.

首先，我想包括WikiData上可用的古迹的市政信息，但事实证明，该信息相对较少。不用担心，我后来意识到我们可以使用反向地理编码API来检索此信息。 APOC具有可用于反向地理编码的专用过程。默认情况下，它使用Open Street Map API，但我们也可以自定义它以与其他提供商一起使用。查看文档以获取更多信息。

First, we have to transform the location information to a spatial point data type.

首先，我们必须将位置信息转换为空间点数据类型。

MATCH (m:Monument) 
WITH m, 
   split(substring(m.location, 6, size(m.location) - 7)," ") as point 
SET m.location_point = point(
  {latitude: toFloat(point[1]), 
   longitude: toFloat(point[0])})

Check a sample response from the OSM reverse GeoCode API.

检查来自OSM反向GeoCode API的样本响应。

MATCH (m:Monument)
WITH m LIMIT 1
CALL apoc.spatial.reverseGeocode(
  m.location_point.latitude,
  m.location_point.longitude)
YIELD data
RETURN data

Results

结果

{   
    "country": "España",   
    "country_code": "es",
    "isolated_dwelling": "La Castanya",
    "historic": "Monument als caiguts en atac Carlista 1874",      
    "road": "Camí de Collformic a Matagalls",   
    "city": "el Brull",
    "municipality": "Osona",
    "county": "Barcelona",
    "postcode": "08559",
    "state": "Catalunya" 
}

Open Street Map API is a tad interesting as it differs between cities, towns, and villages. Also, the monuments located in the Canaries have no state available but are a part of the Canaries archipelago. We will treat archipelago as a state and lump city, town, and village under a single label City. For batching purposes, we will use the apoc.periodic.iterate procedure.

Open Street Map API有点有趣，因为城市，城镇和村庄之间存在差异。同样，位于加那利群岛的纪念碑没有可用的州，而是加那利群岛的一部分。我们会将群岛视为一个州和一个城市，一个城镇和一个单一的城市标签下的村庄。为了进行批处理，我们将使用apoc.periodic.iterate过程。

CALL apoc.periodic.iterate(
  'MATCH (m:Monument) RETURN m',
  'WITH m
   CALL apoc.spatial.reverseGeocode(
      m.location_point.latitude,m.location_point.longitude)
    YIELD data
   WHERE (data.state IS NOT NULL OR 
          data.archipelago IS NOT NULL)
   MERGE (s:State{id:coalesce(data.state, data.archipelago)})
   MERGE (c:City{id:coalesce(data.city, data.town, 
                             data.village, data.county)})
   MERGE (c)-[:IS_IN]->(s)
   MERGE (m)-[:IS_IN]->(c)',
   {batchSize:10})

This query will take a long time because the default throttle delay setting is five seconds. If you don’t have the time to wait, I have saved the spatial results to GitHub, and you can easily import them with the following query in less than five seconds.

该查询将花费很长时间，因为默认的油门延迟设置为5秒。如果您没有时间等待，我已将空间结果保存到GitHub，您可以在不到五秒钟的时间内通过以下查询轻松导入它们。

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/blogs/master/Traveling_tourist/traveling_tourist_cities.csv" as row
MATCH (m:Monument{uri:row.uri})
MERGE (c:City{id:row.city})
MERGE (s:State{id:row.state})
MERGE (m)-[:IS_IN]->(c)
MERGE (c)-[:IS_IN]->(s);

We will first inspect if there are any missing spatial values for monuments.

我们将首先检查纪念碑是否缺少任何空间值。

MATCH (m:Monument) 
RETURN exists ((m)-[:IS_IN]->()) as has_location, 
       count(*) as count

Results

结果

We have retrieved the spatial information for almost all of the monuments. Something you need to be careful when creating location hierarchy trees is that every node in the tree has only a single outgoing relationship to its parent. If this rule is broken, the structural integrity of the location tree is lost, as some entities will have more than a single location. Check my Location trees post for more information on how to circumnavigate this problem.

我们已经检索了几乎所有古迹的空间信息。创建位置层次结构树时，需要注意的一点是，树中的每个节点与其父级只有一个外向关系。如果违反此规则，则位置树的结构完整性将丢失，因为某些实体将拥有多个位置。查看我的位置树文章，了解有关如何解决此问题的更多信息。

MATCH (c:City)
WHERE size((c)-[:IS_IN]->()) > 1
RETURN c

Luckily we don’t run into this problem here. We can now explore the results of our spatial enrichment. We will look at the count of monuments grouped by architectural style located in Catalunya.

幸运的是，我们在这里没有遇到这个问题。现在，我们可以探索空间丰富化的结果。我们将按加泰罗尼亚的建筑风格，对古迹进行分类。

MATCH (s:State{id:'Catalunya'})<-[:IS_IN*2..2]-(:Monument)-[:ARCHITECTURE]->(architecture)
RETURN architecture.name as architecture,
       count(*) as count
ORDER BY count DESC
LIMIT 5

Results

结果

Let’s quickly look at the WikiData definition of vernacular architecture for educational purposes.

让我们快速看一下针对教育目的的WikiData本地体系结构定义。

Vernacular architecture is architecture characterized by the use of local materials and knowledge, usually without the supervision of professional architects.

乡土建筑是一种使用本地材料和知识的建筑，通常不需要专业建筑师的监督。

We can also look at the most frequent architectural style of monuments by states. We will use the subquery syntax introduced in Neo4j 4.0.

我们还可以查看各州最常见的古迹建筑风格。我们将使用Neo4j 4.0中引入的子查询语法。

MATCH (s:State)
CALL {
  WITH s
  MATCH (s)<-[:IS_IN*2..2]-()-[:ARCHITECTURE]->(a)
  RETURN a.name as architecture, count(*) as count
  ORDER BY count DESC LIMIT 1
}
RETURN s.id as state, architecture, count
ORDER BY count DESC 
LIMIT 5

Results

结果

结论 (Conclusion)

If you have followed the steps in this post, your graph should look something like in the above picture. I was always impressed by how easy it is to fetch data from various APIs using only cypher. And if you want to call any custom endpoint, you can still use the apoc.load.json procedure.

如果您已按照本文中的步骤操作，则您的图形应类似于上图所示。仅使用cypher从各种API提取数据是如此容易，这总是给我留下深刻的印象。而且，如果要调用任何自定义终结点，仍可以使用apoc.load.json过程。

In the next part, we will dig into the pathfinding algorithms. In the meantime, try Neo4j and join the Twin4j newsletter.

在下一部分中，我们将深入研究寻路算法。同时，尝试Neo4j并加入Twin4j通讯。

As always, the code is available on GitHub.

与往常一样，该代码在GitHub上可用。