1. a single graph representation derived from a given dataset
2. multiple graphs models to represent a given database.
The classical approach for detecting outliers in a dataset is to model the data as a singl graph and to apply a single outlier detection method, as sketched in Figure 1(a).
By using this approach, the identification of outliers is biased by the given model and the selected algorithm.
Alternatively, one could use an ensemble approach to apply a set of complementary outlier detection methods on a single graph and combine their results, such that the algorithm bias is reduced. This approach is sketched in Figure 1(b).
Existing work for outlier detection in graphs follows the methodologies in Figures 1(a) and 1(b). As a consequence the built-in bias from the graph model selection is not adressed
Here we propose a new methodology that tackles the reduction of graph model bias towards outlier detection by generating multiple graph models to represent the same data.
The overall workflow for an ensemble method combining outlier detection results from multiple graphs is depicted in Figure 1©.
First, multiple graph models represent the same dataset, possibly taking different aspects of the dataset into account for deriving different graph models. We assume, though, that the nodes in different graphs represent the same entities. Only their relations change from model to model.
Next, some algorithm to detect (node) outliers in graphs are applied to each graph model.
In the last step, results from the outlier detection on the different graph representations are combined.
Through the ensemble of different graphs modeling the same data, we can expect an increasing precision and robustness of the outlier detection
Conclusion
Outlier detection is a subjective and unsupervised task that demands good knowledge and understanding of the data.
Using a single graph model of relation-rich datasets may only model some aspects of the data, thus not making proper use of potential information.
Using multiple graph models may capture more and complementary information.
We therefore suggest, based on our findings, to explore real world data using multiple graph models that are as complementary as possible.
In a practical application, a data analyst is interested in certain entities that lend themselves as a set of nodes in a graph representation while several attributes or inter-relational connections may be represented as edges between nodes. Instead of looking for the one and only, best-ever graph representation of some given raw data, the data analyst should
therefore generate multiple graph models describing different aspects of the raw data,capturing a large variety of characteristics, or putting different emphasis on certain characteristics. That is, the graphs may differ both quantitatively (how dense they are) and qualitatively (which relationships are expressed in the graph structure).
These multiple graph models aim to materialize the various perspectives that the analyst wants to highlight, that is, they should cover the problem scenario as well as possible and in as many different ways as suitable.
Clearly, many questions remain open. We focused in this study purely on the aspect of the impact of multiple graph models for a given dataset.
We evaluated this impact using two different outlier detection algorithms, four combination functions, and two similarity
measures on synthetic and real world data.
For a practical application, various aspects will have strong influence on the achievable quality, for example the algorithm used to detect outliers on the individual graphs and the method used to combine the individual results(as we have seen in this evaluation).
However based on our study we can maintain the recommendation to consider several different graph representations in any case.