在尝试使用预登录握手确认_不确定如何找到您要寻找的内容尝试图分析

本文探讨了在大数据环境中寻找欺诈行为的挑战,指出传统的数据分析方法可能不足以应对日益复杂的数据互联性。作者建议使用图分析,通过图形数据库如Neo4j来揭示数据之间的关系和模式,从而更好地检测异常和潜在的欺诈行为。图分析能够揭示节点之间的连接,对于发现网络中的离群值和模式特别有用,尤其适用于金融和保险行业的欺诈检测。然而,数据质量、选择合适的节点和顶点以及如何构建图模型是成功实施的关键因素。
摘要由CSDN通过智能技术生成

在尝试使用预登录握手确认

You have probably heard it many times already, but the amount of data in the world is growing at an incredible rate. This is in large part due to data storage having become so cheap that where you 20+ years ago would keep only certain (golden) records, you now store everything “just in case”. (This reluctance to throw anything out, has in fact come back to haunt many banks as they now scramble to dig out that one source of the truth — cue heavenly light and angel choir.) However, it also has a lot to do with the interconnectedness between appliances, devices, cars, and the list goes on. A connectedness that will only increase in coming years. It is obvious that this new digital paradigm beckons a new way of storing and working with data. One which naturally caters for the connectivity.

您可能已经听过很多次了,但是世界上的数据量正以惊人的速度增长。 这在很大程度上是由于数据存储变得如此便宜,以至于20年前您只保留某些(黄金)记录,现在您以“以防万一”的方式存储一切。 (这种不愿扔掉任何东西的想法,实际上已经回到困扰许多银行的地步,因为它们现在正努力挖掘真相的一个来源-暗示天堂之光和天使合唱团。)但是,这也与电器,设备,汽车之间的相互联系,然后再继续。 这种联系只会在未来几年内增加。 显然,这种新的数字范例为存储和处理数据提供了新的方式。 自然地满足连接性的一种。

[…] you are looking for something in a big (and growing) haystack of data, you just don’t know precisely what

[…]您正在寻找大量(且不断增长)的数据堆中的内容,只是不知道到底是什么

Before you get too carried away with that philosophical gaze, staring into the future, I would like to stop you and instead have you imagine that you have recently joined an apparently progressive fraud-detection and prevention unit in a bank or insurance company (this is your imagination, so feel free to pick another company, government entity, etc. where fraud can happen, which would be practically everywhere). Your first task is to find anomalies or suspicious combinations in a BIG dataset. Only problem is that, what constitutes anomalous and suspicious is not well-defined, and in many cases not known. Not to mention that this will change over time — fraudsters tend to be quite agile or maybe we have just never really applied the right tools to find them. In other words, your boss tells you, you are looking for something in a big (and growing) haystack of data, you just don’t know precisely what — plus there are plenty of other stacks around which you may need to add to the mix.

在您开始凝视着对未来的哲学注视之前,我想阻止您,而是让您想像您最近加入了一家银行或保险公司中一个显然是循序渐进的欺诈检测和预防部门(这是发挥您的想象力,请随意选择可能发生欺诈的另一家公司,政府实体等(几乎到处都是)。 您的首要任务是在BIG数据集中查找异常或可疑组合。 唯一的问题是,构成异常和可疑的内容尚不明确,在许多情况下是未知的。 更不用说这会随着时间的推移而改变-欺诈者往往非常敏捷,或者也许我们只是从未真正地使用正确的工具来找到他们。 换句话说,您的老板告诉您,您正在寻找大量(且不断增长)的数据堆中的某些内容,而您只是不知道确切的内容,此外,您可能还需要添加许多其他堆栈混合。

Being the awesome data scientist that you are, you think “no matter, I’ll just run through the motions of exploratory data analysis (EDA), perhaps whip out a few of my favorite clustering algorithms to enrich the data and attach outlier scores, perform some dimensionality reduction tricks and cook up a few nice vizes and wham-bam, we’ll not only have that haystack sorted, but we will also be able to spit out family albums of the data where it will be clear who’s the black sheep”. No surprise really, that you are feeling pretty good about yourself, as you take a sip of coffee and pop open Databricks, Alteryx, a Jupyter notebook or whatever your weapon of choice is for this sort of work, only to realize that nearly all the data fields are pieces of information like account numbers, free text fields, transaction IDs, receiving account IDs, and personal information on the account holders. Curious about the ensuing swearing, your newfound office-bff leans over to ask what’s going on, only to learn that there will be very little wham and even less bam on account of such data having no obvious ordering and as such not lending itself too well to the analysis approach you had been daydreaming about just before you went for coffee.

作为您真棒的数据科学家,您认为“无论如何,我都会进行探索性数据分析(EDA)的动作,也许会淘汰一些我最喜欢的聚类算法以丰富数据并附加异常值,执行一些降维技巧,并煮一些漂亮的Vize和Wham-Bam,我们不仅可以对干草堆进行分类,而且还可以吐出数据的家庭相册,清楚地知道谁是败类”。 毫不奇怪,当您喝一杯咖啡然后弹出Databricks,Alteryx,Jupyter笔记本或您选择用于此类工作的任何武器时,自己会感觉很好,只是意识到几乎所有数据字段是一些信息,例如帐号,自由文本字段,交易ID,接收帐户ID和帐户持有人的个人信息。 对随之而来的咒骂感到好奇,您新成立的office-bff俯身问发生了什么事,却发现由于这些数据没有明显的顺序并且因此自身放得太好,因此造成的损失很小,而且有害物质也更少在您喝咖啡之前,您一直在做白日梦的分析方法。

[…] you end up taking a nice face-in-the-key-board-napppppppppp…

[…]您最终要在键盘上贴上一张漂亮的脸napppppppppppp…

After a few deep breaths you get on your not so merry way and look at frequencies of entries, apply k-modes clustering, etc. You even learn (as I recently did) that there is something called multiple correspondence analysis, which is the PCA of the categorical world (apparently). However, while this does get you some of the way, and churn out some outliers and a few suspicious cases, it turns out that these are the not so agile criminals. Having worked really hard on this for at least 45 mins, you end up taking a nice face-in-the-key-board-napppppppppp…. You dream yourself away to a quiet vacation with your friends, where you visit museums, try the local cuisine and go to the trebuchet range. All the while, taking turns to pay and then transferring money to each other afterwards. As you are woken by your office-bff laughing at cat videos you suddenly have an epiphany: what if you could easily see who was transferring money to whom? and if on top of that you could add information, such as whether a person was a company owner, who else was part of the ownership of that company, and were they customers as well.You realize that what you need is a data model that is more flexible than the classical relational database. One which permits for easy integration of new data, but also one that stores information not only about the single customer or employee (or whatever is your object of scrutiny), but also about how they are connected. What you are looking for is a graph database!

经过一口深呼吸,您开始以不太快乐的方式,查看条目的频率,应用k-modes聚类,等等。您甚至了解到(就像我最近所做的那样),有一种叫做多重对应分析的东西,这就是PCA。类别世界(显然)。 但是,尽管这确实为您提供了一些帮助,并找出了一些离群值和一些可疑案件,但事实证明,这些人并不是那么敏捷的罪犯。 经过至少45分钟的辛苦工作,您最终还是得到了一张漂亮的脸蛋键盘napppppppppppp。 您可以梦想与朋友一起度过一个宁静的假期,在那里参观博物馆,尝试当地美食并前往投石机。 一直以来,轮流付款,然后再互相转账。 当您被办公室的工作人员对猫的视频大笑时醒来时,您突然顿悟了:如果您可以轻松地看到谁在向谁汇款,该怎么办? 除此之外,您还可以添加信息,例如某人是否是公司所有者,其他人是否属于该公司的所有权以及他们是否也是客户。您意识到,您需要的是一个数据模型比传统的关系数据库更灵活。 一种允许轻松集成新数据,另一种不仅可以存储有关单个客户或雇员(或您要检查的对象)的信息,还可以存储它们之间的联系的信息。 您正在寻找的是图形数据库!

图中是什么? (What is in a graph?)

But what is a graph? In mathematical terms, a graph, G, consists of a set of vertices, V, and a set of edges, E, that connect these vertices. It is as simple as that! But just to unfold this a little more, we need to return to the dream for a second (you were probably dozing off anyways at this point). You and your friends each make for a so-called node or vertex — think of a dot — together you form a set of nodes or vertices; i.e., we now have the V. Each node can contain information about number of accounts, credit cards, mortgage types, user of mobile bank (y/n) and so on, but also any other type of information that could be relevant such as age, education and marital status. This is the type of information that would usually be present in a row(s) (your row(s)) in a relation database. The transferals between you make for connections or edges — think of lines connecting your dot to your friends’ dots.

但是什么是图? 用数学术语来说,图G由一组顶点V和一组连接这些顶点的边E组成。 它是如此简单! 但是,只是为了进一步展现这一点,我们需要回到梦想上一秒钟(此时您可能一直在打do睡)。 您和您的朋友各自创建一个所谓的节点或顶点(想起一个点),一起形成了一组节点或顶点。 也就是说,我们现在有了V。每个节点都可以包含有关帐户数量,信用卡,抵押类型,移动银行用户(y / n)等的信息,还可以包含其他可能相关的信息,例如年龄,学历和婚姻状况。 这是通常在关系数据库的一行(您的一行)中显示的信息类型。 您之间的转移是为了连接或边缘-想想将点连接到朋友点的线。

[…] with regular databases, you are moving in one to two dimensions […] but you only get the full picture when you step out of the plane and see all the connections

[…]对于常规数据库,您正在以一维或二维移动[…],但是只有当您走出飞机并查看所有连接时,才能获得完整图片

The set of these edges comprise the E in our graph. Mentally filling in all the lines between each of the nodes, i.e. using the elements of E to connect the elements of V, your group constitutes a little mesh, which is the graph, G. However, your friends transfer money to other friends. So, your little bundle or cluster is actually part of a bigger graph. In addition, we could add other types of connections, such as paying for a membership at the same gym. This connection you will most likely only share with a subset of your friends, so this connection type defines another clustering dimension. Other connection types could be paying at/to the same shops, charities, concerts or even getting on the bus at the same stop if we want to go all big brother. How much and how often you transfer to one of these other nodes, or how many connection types you share could be used as a proxy for how close you are to this node.

这些边的集合在我们的图中包含E。 用心填充每个节点之间的所有线,即使用E的元素连接V的元素,您的组构成一个小的网格,即图形G。但是,您的朋友将钱转移给其他朋友。 因此,您的小束或簇实际上是较大图的一部分。 此外,我们可以添加其他类型的连接,例如在同一体育馆付费。 您很可能仅与朋友的一个子集共享此连接,因此此连接类型定义了另一个群集维度。 如果我们想和所有大哥一起去,其他的联系方式可能是在/去同一家商店,慈善机构,音乐会或什至在同一站上车。 您将多少次和多少次转移到这些其他节点之一,或者您共享多少个连接类型都可以用作代理与该节点的距离。

为什么没有人想到罪犯!? (Why won’t anyone think about the criminals!?)

Now think about the criminals in such a graph — they will have connections that appear similar to yours, and maybe some will even be (in)directly linked to you. Big whoop! But you could in principle go further and scrape information of the internet about who they know on Facebook or LinkedIn, which in many cases, however, would not be legal. So maybe just stick with adding readily available intel on how many phone numbers do they have, with whom do they share an address or bank account, whether or not one of them has an important position in politics or is related to any some such, if the others are owners of companies suspected of dubious activity, do they appear in searches on news media. Then interesting patterns start to emerge. Patterns that can be uncovered using methods from graph theory. And what is really cool about graphs is that they allow you to visualize these relationships, like in the following figure.

现在,在这样的图表中考虑罪犯-他们的联系看上去与您的相似,甚至有些甚至可以(间接)链接到您。 大呼! 但是原则上您可以走得更远,从Internet上刮擦有关他们在Facebook或LinkedIn上认识的人的信息,但是在许多情况下,这是不合法的。 因此,也许坚持只添加随时可用的电话号码,以了解他们拥有多少电话号码,与谁共享地址或银行帐户,无论其中一个在政治上是否具有重要地位或是否与其中任何一个有关系,如果其他人是涉嫌可疑活动的公司的所有者,是否出现在新闻媒体的搜索中。 然后有趣的模式开始出现。 可以使用图论中的方法发现的模式。 图形的真正酷点在于它们使您可以可视化这些关系,如下图所示。

Image for post

I sometimes think about it as though, with regular databases, you are moving in one to two dimensions when you are performing classical EDA (of course the data can be however many dimensional you want). You can perform analysis on aggregations of node information, e.g. calculate distributions for bank account information and subdivide this based on age, income, etc. You may even be able to run along the edges and build a local road network in your head by connecting receiver/sender account IDs, but you only get the full picture when you step out of the plane and see all the connections.

我有时会想一想,对于常规数据库,当您执行经典EDA时,您将在一到两个维度上移动(当然,数据可以是您想要的多个维度)。 您可以对节点信息的聚合进行分析,例如,计算银行帐户信息的分布,并根据年龄,收入等细分。您甚至可以通过连接接收方在边缘运行并在您的脑海中构建本地道路网络/ sender帐户ID,但只有当您离开飞机并查看所有连接时,才能获得完整图片。

铺设数据结构以连接点 (Laying the Data Fabric for Connecting the Dots)

If, at this point, you find yourself intrigued by the promises of graph analysis, you will probably be thinking about how to approach the immensely daunting task of bringing all your company’s disparate data together in such a way as to start the graph journey.

如果在这一点上,您发现自己对图分析的承诺感兴趣,那么您可能会在考虑如何进行艰巨的任务,即以开始图的方式将公司所有不同的数据整合在一起。

if you can achieve this, I promise that what you will be hearing is the chief data officer, data managers, analysts and scientists clapping profusely

如果您能做到这一点,我保证您会听到的是首席数据官,数据经理,分析师和科学家们鼓掌

All I can say is that, I appreciate where that nervous look is coming from! What you would be looking to do is integrating and cleaning data from all manner of (legacy) systems with different naming conventions, date specifications, duplicates and other inconsistencies — not an easy feat. On top of that, it is desirable, if not necessary, to be able to trace data back to its source (data lineage) so you know how it is processed before you see it, ensure access is limited to the right people, and that it is easy to find the data one is looking for (data cataloging). All in all, a pretty tall order in most companies today. However, if you can achieve this, I promise that what you will be hearing is the chief data officer, data managers, analysts and scientists clapping profusely, only soon to be joined by the chief risk and data protection officer as they realize how this will enable them to do extended KYC (know your customer) and guarantee GDPR compliance.

我只能说的是,我欣赏那紧张的表情来自哪里! 您要做的是集成和清除来自具有各种命名约定,日期规范,重复项和其他不一致之处的各种(旧式)系统中的数据,这并非易事。 最重要的是,如果不需要的话,希望能够将数据追溯到其源(数据沿袭),以便在看到数据之前先了解其处理方式,确保访问仅限于合适的人员,并且很容易找到要查找的数据(数据分类)。 总而言之,当今大多数公司的订单量很高。 但是,如果您能够实现这一目标,我保证您会听到的是首席数据官,数据经理,分析师和科学家们鼓掌鼓掌,直到不久之后,首席风险和数据保护官就会加入进来,因为他们意识到这将如何使他们能够进行扩展KYC(了解您的客户)并保证GDPR合规性。

Before all this enthusiasm grows into a parade down the imaginary company corridors with flotillas, conga lines and all, you are probably wondering if this will remain a pipe dream or if there is a way this can be achieved. Luckily (or naturally), there are companies out there looking to make your life easier (and of course make some money at the same time), by semi-automating exactly this type of project. Some of those who promise to bring you this data-nirvana are household names such as SAP, IBM, Microsoft, Oracle and Informatica. However, seeing as our focus here is on graphs, I want to call attention to a relative newcomer, namely CluedIn. CluedIn’s platform — which won the 2020 Cool Data Vendor Award from Gartner — has as a key component the graph database Neo4j. As I mentioned before, graph databases, compared to relational databases, have a very flexible data model or schema. CluedIn utilize this to weave all those different data sources into a nice data fabric. As a bonus, the flexible schema makes addition of new systems or external data sources much less of a headache than with relational databases.

在所有这些热情发展成虚构的公司走廊,摆放着彩花,康茄舞等所有东西之前,您可能想知道这是否仍然是梦想,还是有办法实现这一目标。 幸运的是(或者自然地),有公司通过半自动化此类项目使您的生活更轻松(当然同时也赚了一些钱)。 承诺为您带来这一数据天堂的人中有些人家喻户晓,例如SAP,IBM,Microsoft,Oracle和Informatica。 但是,鉴于我们这里的重点是图表,因此我想引起人们的注意,即相对新手CluedIn 。 CluedIn的平台(获得了Gartner的2020年酷数据供应商奖)具有图形数据库Neo4j作为关键组件。 如前所述,与关系数据库相比,图形数据库具有非常灵活的数据模型或架构。 CluedIn利用此功能将所有这些不同的数据源编织到一个不错的数据结构中。 另外,灵活的架构使添加新系统或外部数据源的麻烦比关系数据库要少得多。

没有免费的午餐 (No free lunch)

Of course, this does not happen by itself, and to make it all possible, CluedIn — like some of the other players mentioned above — have prebuilt connectors for a bunch of common (legacy) systems. And according to CluedIn, it is then a matter of plugging your different data sources into their platform (using the connectors), after which a set of proprietary algorithms will crawl the data, make connections across systems, i.e. match data points, and perform the data integration and cleaning automatically. When the dust has settled, you should end up with a one-stop shop for your data where you can, among other things, see data quality KPIs (and how to improve them) and search all your company’s data; e.g., find all the places where a given customer’s information is stored. All of this is, by and large, made possible due to the flexibility that comes from using a graph.

当然,这并不是靠它本身发生的,为了使一切成为可能,CluedIn(与上面提到的其他一些参与者一样)已经为一堆常见的(旧式)系统预先建立了连接器。 然后根据CluedIn,只需将不同的数据源(使用连接器)插入其平台,然后,一组专有算法将对数据进行爬网,在系统之间建立连接(即匹配数据点)并执行数据集成和自动清理。 当尘埃落定之后,您应该以一站式购买数据的方式结束,在其中您可以查看数据质量KPI(以及如何改进它们)并搜索公司的所有数据; 例如,找到存储给定客户信息的所有位置。 归因于使用图形所带来的灵活性,所有这些都大体上成为可能。

图形分析什么时候有意义? (When does graph analysis make sense?)

We started out this story with a case where the data did not have too much of a natural ordering. Does this mean that graphs are only suited for such data? No, that was just because I got the idea for this short text when working on a problem with such data. But hopefully it is clear that graph analysis is futile if there are no connections between the nodes. This, however, will rarely be the case.

我们以一个数据没有太多自然顺序的情况开始了这个故事。 这是否意味着图形仅适用于此类数据? 不,那是因为在处理此类数据问题时,我想到了此简短文本。 但是希望可以很清楚,如果节点之间没有连接,则图分析是徒劳的。 但是,这种情况很少发生。

It is also important to note that I am not claiming that graph analysis will necessarily tell you what you are looking for. However, because graphs and graph databases carry more information about the connectedness and shape of the data, they (or rather analysis methods applied to them) can help find interesting patterns in the data, which might not be easy to discern through approaches that are common practice on data stored in relational databases; for those interested a survey of graph clustering algorithms can be found here and a free book from Neo4j on graph algorithms (with machine learning) here. Using the patterns identified, one can then structure the analysis to seek out similar patterns.

同样重要的是要注意,我并不是说图分析必定会告诉您要查找的内容。 但是,由于图形和图形数据库承载了有关数据的连通性和形状的更多信息,因此它们(或更确切地说是应用于它们的分析方法)可以帮助找到数据中有趣的模式,这可能很难通过常见方法来辨别。对关系数据库中存储的数据进行实践; 对于图形聚类算法有兴趣的调查可以发现这里和Neo4j的免费书籍上图算法(机器学习)这里。 然后,使用识别出的模式,可以构造分析以寻找相似的模式。

注意事项 (A word of caution)

By now you may think that graph analysis is a magic bullet, but like all analysis and modelling, a lot will depend on your data quality. Another important matter is how you choose to view your data. What should be considered nodes and what should be considered vertices? This may not always be obvious.

到现在为止,您可能会认为图分析是神奇的子弹,但是像所有分析和建模一样,很大程度上取决于您的数据质量。 另一个重要问题是您如何选择查看数据。 应该将哪些节点视为顶点,将哪些顶点视为顶点? 这可能并不总是很明显。

Finally, today’s setting-the-scene-story was centered on fraud, but it could just as well be used to monitor identities and accesses across several IT systems, in a compliance unit to have a constant overview of who has access to which pieces of GDPR-sensitive data, for figuring out which customers are likely to want the same types of services, which teams in a company interact the most, and so on, and so on… The key point is that graphs appear naturally everywhere, from road networks to social networks be they online or irl, so why not add graph DBs to your technology stack and graph analysis to your tool kit?

最后,今天的场景故事以欺诈为中心,但它也可以用于监视合规部门中多个IT系统的身份和访问,以不断概述谁有权访问哪些内容。 GDPR敏感数据,用于确定哪些客户可能需要相同类型的服务,公司中哪些团队互动最多,等等,等等……关键点在于,图自然会从道路网络到处出现到社交网络(无论是在线还是irl),那么为什么不将图形数据库添加到您的技术堆栈中并将图形分析添加到您的工具箱中呢?

… oh yeah, if you are still interested in how to detect fraud using graph analysis, Neo4j have an example here. And if, after that, you are interested in how to get started with Neo4j using, say, Python to built your graph, have a look at my article here, where I use data on exports to build a graph based onthe trade between countries and combine it with democracy scores to analyze which democracies trade the most with authoritarian regimes among other things.

…哦,是的,如果您仍然对如何使用图分析检测欺诈感兴趣,Neo4j此处有一个示例。 如果,在这之后,你有兴趣在如何与Neo4j的使用,比方说,Python来建立你的图形开始,看看我的文章在这里,在这里我用的出口数据为基础,以建立一个图形的国家之间onthe贸易将其与民主得分相结合,分析哪些民主国家与独裁政权进行交易最多。

Originally published at https://www.linkedin.com.

最初发布在https://www.linkedin.com

翻译自: https://medium.com/swlh/not-exactly-sure-how-to-find-what-you-are-looking-for-try-graph-analysis-e96bc47489fb

在尝试使用预登录握手确认

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值