5 reasons should not waste time researching knowledge graph

最新推荐文章于 2024-09-06 11:11:12 发布

xipingpi0868

最新推荐文章于 2024-09-06 11:11:12 发布

阅读量743

点赞数

文章标签：自然语言处理知识图谱

原文链接：https://dongshengwang.medium.com/5-reasons-knowledge-graph-will-never-bloom-418601957f33

版权

Authors: Dr Dongsheng Wang and Dr Hongyin Zhu

Figure 1. Knowledge Graph Visualization Example (From Web)

Knowledge Graph (KG) is used to organize knowledge as a graph with triple knowledge, i.e., entities (nodes) and relations (edges). KG has been one of the most popular technologies in the AI community and has attracted increasing attention in recent years. However, KG is definitely not a new technology; this article will detail five reasons why KG will never be thriving (or at least will not be in the next two foreseeable decades) by tracing its past and analyzing its present.

Starting from the semantic web, proposed in around 1989 for more than two decades, the key technology has been descended the same (or at least similar), although the naming has been evolved from semantic technology, knowledge base, linked data to the current knowledge graph. Despite the massive effort for more than 20 years with these different versions, they all failed to thrive. With the latest version - KG, it has set up a smaller goal (i.e., as a way of knowledge organization) than its ancestor semantic web (i.e., leading to a whole web3.0 revolution). The question is would KG be able to thrive in the next few decades? To answer this question, we track the KG technology from past to president thoroughly in a comprehensive manner. We admit there is a remarkable development of KG with the knowledge representation learning in recent years. However, the existing crucial barriers for the last two decades are still there without being eliminated or even mitigated.

We will first illustrate the problems that have failed the semantic web, which has been descended without being solved. Then, we depict five problems from the KG angle but not limited to: entity suffering from granularity and ambiguity, relation splurging to explosion, triple knowledge complexity, hard access, and knowledge acquisition latency. We also detail knowledge representation learning would not mitigate those problems either; on the contrary, it might exaggerate those. Therefore, KG would not be capable of leading to any breakthrough in the seeable future either along this route. This is alarming because hardly can we see any successful and easy-to-use KG applications surrounding us while some researchers embrace unrealistic expectations on it, i.e., to achieve cognitive intelligence or advanced intelligence.

At the end of the article, we also bring some profound suggestions to the table with our best R&D knowledge. We believe there need to be radical changes that could drive us away from whirlpools and navigate us to a real-world land that could genuinely benefit us.

1 Origin of Knowledge Graph

1.1 Grand proposal for Web3.0 - Semantic Web

The initial grand plan for the Web is to evolve the whole Web era from 2.0 to 3.0. We can summarize the web development with three eras chronologically: web1.0, web2.0 and the targeting web3.0.

Specifically, web 1.0 is a single direction service mode where few editors maintain the servers that serve the massive users, e.g., yahoo, Sohu, etc. Web 2.0 is a dual direction service mode with no boundary between users and editors, e.g., social networks, blogs, tweets, youtube, Instagram, etc. With web2.0, the editors of the servers become the users themselves, resulting in a vast amount of data. Web 3.0 involves the machine as triangle relations (machine, users, editors) called the machine-readable/understandable, or Semantic Web.

1.2 One word to understand semantic web - Unifying

It has ever been a myth for a long time when the semantic web was first proposed two decades ago because it is often related to the terms of machine-readable, semantics, reasoning, or whatever fancy terms you might have seen somewhere. However, the semantic web is so simple that we can show that no more than one word can explain everything - 'unifying'. Its only purpose is to unify the data form as much as possible, and RDF triple form is a unified solution for global knowledge representation. Let us see how those fancy terms can be interpreted as the word - 'Unifying':

Unifying = formalized、standardized data

Unifying = could be distributed (because of the same triple format)

Unifying = Machine-readable (by designing a resolver to this unified format)

Unifying = re-use & share (again, same data format)

If someone was talking about "standardized, formalized, semantics, machine-readable, reasoning, etc.", you just follow up: "I gotcha, you are talking about unifying".

This unified form, RDF, is nothing more than a triple composition with URI adoption into each string for uniqueness. Why triple (i.e., subject, predicate, object) format as standard? There is a philosophy that the triple format is the minimum unit that can describe world knowledge. In other words, 2-tuple just cannot, and a 4-tuple can always be reduced into a triple. This is why a relational database can always be published as RDF data, as shown in the below picture as an instance.

Figure 2. A 5-tuple or relational database to RDF triples

Subsequently, the whole SW community had been obsessed with ontology generation (i.e., transforming from the existing web of data to RDF data) for many years, as many semantic technology researchers might have been aware.

Machine-readable only works when there is a parser that can recognize the unified RDF format. No data format can be born with reasoning capacity. For example, we, as human, is a parser to the natural language because we have learned English, Chinese, Korean, etc. Programming languages like Jena and query standard like SPARQL are such parsers developed additionally for the unified RDF data form to be machine-understandable.

1.3 Four reasons why SW has faded

It has been more than two decades. The semantic Web has faded away somehow, or more precisely, it has not been developed into the influential web3.0 era that we have expected. Why is that? We introduce the four main barriers that have prevented the development of the Semantic Web.

1.3.1 RDF Sharing disability

Academic agents are only obsessed with building their knowledge bases. These knowledge bases are supposed to be global datasets, but they often suffer from being hard to understand, outdated, unreliable, and substantial diverse. Therefore these RDF databases turn to be purely another type of local databases from one way or another. The linked data has proved to be hard.

Industrial agents are just realistic and pragmatic. There is still no incentive for them to transform or share their data as RDF widely and effectively. At present, the internet displays the data that each agent wants to; data delivery expressed in JSON is often more than enough for developers. Even if some private companies and industries are willing to share their data, they do not want to make an additional effort to publish RDF data because it requires some effort but few reward.

1.3.2 Expensive to consume RDF

There requires a massive effort in consuming RDF data. From an engineering perspective, the existing applications usually interact with local data or data from a specific API where these data are mainly transmitted in JSON format. The fields in JSON are structured, accessible and sufficient to describe. When people have to use RDF data, they also need to study its schema first and parse it into local memory to ensemble these triples into complete instances. This process is necessary because the triples of the same entity can be distributed, and sometimes the predicates (or relations) can be splurging and arbitrary.

When they do not even adopt some external RDFs effectively, their local RDF database is not superior to a simple relational database. Therefore, today, people are still more willing to write a program that parses structured JSON rather than learning RDF schema first and ensemble triples in the memory to look it up. In short, the whole process of manipulating RDF is usually trivial, time consuming and expensive, which outweighs the benefit it can bring.

1.3.3 Ontology definition splurging

The ideal is for everyone to use the shared schemes of words to define an RDF dataset, i.e., unifying the ontology schemes as much as possible. However, scheme splurging is usually exponential rather than linear resulted from different groups of people from different places. This is why we have to spend a long time understanding a new RDF dataset from whom we do not know. Even we have LOV (Linked Open Vocabulary) for years, we are still struggling to unify the schemes till today. We cannot pre-see the breakthrough in this field either with the same or similar effort for decades.

Figure 3. LOV(2018)

1.3.4 Expensive to use first-order logics

The propaganda of reasoning capacity by the earlier researchers has pushed the semantic web into an embarrassing situation. It did have attracted intensive attention at an earlier age; but, the semantic web has never fulfilled this promise of creating some applications with powerful reasoning capacity. Although the content of today's reasoning has been expanded, and many path reasoning using machine learning methods have appeared, the accuracy of the inference results is far from satisfactory, and we cannot afford to use unreliable results. There are three reasons this direction turned out to be a dead-end.

First, the existing first-order logics did not boom in the traditional AI systems, not to mention reasoning efficiency and effectiveness.

Second, descriptive logics require high-quality and small-scale data because inference rules are human labour costly, thus, it is infeasible to large-scale dirty data generated each day.

Last but not least, the other methodology can be superior to the descriptive logics with better efficiency and effectiveness. Hidden information can be predicted by machine learning models that are trained on massive data.

In short, first-order logics is expensive and replaceable.

2 Five reasons KG will not be thriving

Nobody mentions semantic web anymore, which is good. However, its descendant KG has been proposed with a new name to solve the NLP and IR problems as a semantic representation framework. KG focuses on information organization with entities and their relations. This is a smaller target than revolutionizing the whole web.

Some researchers turn to be highly optimistic about it because they have adopted knowledge representation learning to promote the gain of KG. Nevertheless, we maintain that KG will not lead to any breakthrough either since the same problems for two decades are still there, which might, in turn, exacerbate the problems. This is to us like a kid who cannot crawl but start trying to running. We detail this next from the angle of KG.

2.1 Entities suffer from granularity and ambiguity

Granularity and ambiguity are the two critical problems involved with entities, which prevent them from being generalized. First of all, it is tough to define entity granularity. Entities often do not have clear boundaries like 'Mouse Brain' and 'Brain', or 'Apple' and 'Green Apple' (e.g., "I bought some green apples from the market.."). These entities can independently exist in different scenarios, whereas one might include the other. What entities we want to create and define is a case-by-case problem. Entity granularity is often necessary to be adapted in different domains or different scenarios, which is hard to be normalized for general usages. Ambiguity is another well-known challenge. For example, 'Green Apple' can be the fruit, a music name, a book name, a brand name, or even a chapter of an article. Do we always need these fine-grained entities for disambiguation all the time?

Inspired by word embedding technology, knowledge embedding has been popular research in recent years. Knowledge embedding (such as transE and transR models) expresses entity and relation with distributed semantics with low-dimension, as the example shown in the Figure 4. The vector representation has led to an improvement for the task like link prediction (predicting an entity with a specific relation with another given entity). However, this does not mitigate the granularity and ambiguous problems at all. Thouhg quite a few KG embeddings are built in academica, they have not been widely used in reality. Moreover, the accuracy of knowledge embedding is still questionable, and the distributed representation is not interpretable for symbolic representation.

Figure 4. knowledge graph embeddings (from web)

In short, the entities are hard to be generalized or aligned for universal usage. This has not been solved in the semantic web earlier either, e.g., ontology definition and alignment (for those familiar with semantic web technologies). Plus, we cannot handle the ambiguity it brings when we use multiple KGs.

2.2 Relation splurging to explosion

If you look at the figure 1 again, you will find it is easy to read because the relations are mostly composed of multi-word phrases. However, the relations can be exponentially complex derived right from the extremely simple triple format. The knowledge graph often have to use these compound phrases to express the natural relations between entities. As a reslt, facing hundreds of relations like song_of, written_by, create_by, is_author_of, build_time, cause_by, has_been_to, has_visited, etc., humans ourselves would get confused by the ambiguity, let alone machine. This is why sometimes you are expected to define domain and range if you expect some relations belong to some specific classes. So the consumers have to figure out the concept and relation definition of the KG first. However, with the growth of relations from different sources, the relations can be increasingly more ambiguous between each other, e.g., belongs_to, part_of, included_by, etc. As a result, it is just hard to disambiguate by neither human or machine.

Researchers are obsessed with knowledge representation learning (sometimes combining the knowledge graph embedding) in recent years with deep neural models on the NLP corpus. They achieve this by projecting entities and relations into low-dimensional space (or embedding space) to assist the KG tasks like knowledge graph completion (KGC), entity recognition, relation extraction, etc. It is worth mentioning that recently the pre-trained models (e.g. BERT) have remarkably improved the performance of these tasks for knowledge acquisition. However, this will eventually exaggerate the ambiguity because the massive learned entities and/or relations will lead to more diverse granularity and complexity. In addition, new triple management itself is a challenging work.

In short, the KG relationship is plurging. When it comes to heterogeneous KGs, the relation splurging can be developed into relation explosion and thus can be extremely hard to consume. Thus, before solving that, the obsession with knowledge representation learning is more like a kid trying to fly before he could walk.

2.3 Simple triple format leads to higher complexity

The triple format has simplified the data sharing as a minimum unit to represent world knowledge. However, the KG creator, in turn, have to define complex relations with the compound phrase for accurate descripion as discussed in section 2.2.

The triple knowledge form eventually brings even more complex than we imagined. When we ask who the United States president is, we can retrieve the triple: (X, president, United States). This actually hides a default that we are aking the current president. When the question is slightly different, like "who was the United States president in 1989?", we have to use a query like (X, is_the_40th president, United States) or (XX, president_period, from 1981 to 1989). This quesion immediately increases the complexity of retrieving, and actually, this triple knowledge itself is an ambiguity for previous simple question as you can imagine. In reality, it is more common to have knowledge containing constraints that involves multiple relations rather than single relation. As show in Figure below, this is more than trivial for both who design it or who consume it.

Figure 5. from web

We should not blame that developers sometimes question would it be more straightforward if we can design a relational database table. When you employ a relational database with a table name called 'president', you can easily define the properties that belong only to this table like name, x_th, period, spouse, spouse_period, etc. And this concept oriented design, in general, is more accessible and human-understandable than RDF oriented design.

Having confused human ourselves, knowledge with multiple relations are also ambiguous to machines. We can find that even with state-of-the-art knowledge representation learning models, the F1 performacne is generally lower than 50% for complex question answering over KG. This is almost meaningless for industrial application. When there are errors in the KG, the correction process is usually much more tedious than a relational database.

In short, knowledge is often diverse and conditional than simple deterministic knowledge. When we use simplified triple formats to express complex knowledge, we end up with higher complexity.

2.4 Hard to access Knowledge Graph as a database

The consumption of triple knowledge requires a high understanding of the ontology schemes and RDF query knowledge for whoever wants to access it. People inevitably have to parse and load all the triples together in memory and manipulate it with a program (like Jena library), SPARQL query, or sometimes both. The reason is that these triples can be located in different lines or different files and databases. We know this mechanism is the advantage of the triple format that enables distributed storage but it sets the barrier for us to look it up quickly on the other side.

Moreover, the triples are usually not readable due to URIs and distributed triples. What we see first are usually the triples like (<uri:28809> <uri:creator> <uri:201339>), then we have to figure out the readable relations like 'rdfs:label', 'skos:prefLabel' or 'dbpedia:name', in order to find triples like (<uri:28809> <rdfs:label> "The Matrix"). This process always force the consumers to spend quite a bit time to figure it out what relations an entity might have. In addition, the SPARQL queries are more complicated than SQL because of the diverse URIs and complex query grammar.

In short, the URI mechanism and SPARQL have been barriers that stopped many developers from accessing, plus the relation splurging lead to more complexity than the convenience they brought for KG consumption.

2.5 Unreliable quality and generation latency

The constructed RDF usually suffers from quality and latency problem. First, there is no technology so far that can guarantee the accuracy of RDF triples.

Even if the triples is accurate, triple knowledge aquisition is always slower than the emerging of new data. In other words, RDF generation is a linear effort of us, while the speed of new information growing is exponential. For example, when Elon Must tweeted a message, it will be commented and retweeted tens of thousands of times within a few minutes. This is why the end-to-end deep learning models are more and more popular because it avoid to extract the information as structured data, but directly consumes the raw data to predict the ultimate purpose.

In short, we do not want a structured dataset that is not reliable and is of high latency in general.

3 Why people still insist it

We have reviewed the development from the origin of KG - semantic web to its current version and current purpose. However, the main problems are still there without genuinely solved or even mitigated. We have discussed that knowledge representation learning would not mitigate those problems either; on the contrary, it might exaggerate those. For example, the fact that the performance on complex question answering could hardly be beyond 0.5 (F1) even with the state-of-art model indicates that we should not be so optimistic. Since many researchers might have been already aware of it, why many are still obsessed with it? If you reviewed the five reasons above, you would not understand why some top researchers claim KG could promote us to 'cognitive intelligence' or 'advanced intelligence'.

We try to explain it with the following four reasons:

1. KG is the only plausible technology that is principally close to humans way of thinking;

2. There are no other alternative technologies to sustain the hope of the next breakthrough, especially after the breakthrough of deep learning from the connectionism community;

3. We have to admit that there is a compromise between symbolism researchers and connectionism researchers;

4. Technology media are only responsible for propaganda but not responsible for verifying it or held accountably. Researchers would not need to pay for bragging.

4 Our Suggestions

4.1 Emphasizing knowledge rather than the form or graph

Natural language and RDF are just different forms of knowledge. We need to shift from the emphasis on the form to knowledge itself. Natural language understanding and generation have greatly surpassed the development of KG. On the contrary, graph neural networks and graph algorithms themselves have demonstrated their value on many tasks; therefore, we can explore making the knowledge graph directly compatible with natural language understanding. It is worth mentioning that some recent researcher claim that relational knowledge already present in a wide range of state-of-the-art pre-trained language models (refer to a paper named "Language Models as Knowledge Bases?", though we make some reservation on this view). While we have explained that the general KG is, to a large extent, a dead-end, the token-level semantics from BERT representation can be an effective ingredient that can construct domain entities for specific usage.

Techniqually, we can consider picking up n-tuple knowledge form. A trade-off between the simplified triple format and the informative n-tuple structure would guide us somewhere we feel more comfortable. Neo4j (a graph database) is somehow in this direction, but it is not born for the knowledge graph with its own complexity. The new standard would require both the researchers and developers to define the schema and n-tuple knowledge again so humans and machines can understand and access the knowledge easily.

4.2 Ease the KG technology as much as possible

URI can be replaced with cheaper alternatives. We maintain that even if global uniqueness is a vital requirement, some cheaper alternatives are more realistic than the URI mechanism. For example, an extremely cheap alternative is the universally unique identifier (UUID) that many engineers are already familiar with. Specifically, we can use UUIDs to identify entities both locally and globally. When the entities from different datasets are merged, they are still unique globally. Therefore, it can remove the URI from the entity and predicate lightly.

If an n-tuple knowledge form is created, we should make the knowledge pieces accessible by both manual check-up or query languages like SPARQL (but should be a more lightweight version). We maintain that confidence and label should be two of the mandatory relations with n-tuple knowledge. For example:

Final words

Finally, it is worth bearing in mind that we are not saying KG is absolutely meaningless; what we try to highlight is that the development of KG will be far below our expectation, like what we already did on the semantic web. This article expresses our thoughts and is for reference only. We welcome any comments or disagreements.

Authors introduction:

The two authors have 7 years of frontier R&D experience in the domains of KG and NLP. Dongsheng Wang received his PhD from the University of Copenhagen, and Hongyin Zhu received his PhD from the University of Chinese Academy of Sciences. Dongsheng wang (dswang2011@gmail.com) currently works in industrial on conversational AI, and Hongyin Zhu (hongyin_zhu@163.com) is doing his postdoc at Tsinghua university.

xipingpi0868

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
5 reasons should not waste time researching knowledge graph

Authors: Dr Dongsheng Wang and Dr Hongyin ZhuFigure 1. Knowledge Graph Visualization Example (From Web)Knowledge Graph (KG) is used to organize knowledge as a graph with triple knowledge, i.e., entities (nodes) and relations (edges). KG has been one
复制链接

扫一扫