python数据挖掘:概念_大数据:概念,安全性和用例

python数据挖掘:概念

Introduction

介绍

Big Data refers to data collections that are so large and complex that they are difficult for traditional database tools to manage. Big Data is considered as the base of the future in the field of Information Technology (IT). Organizations today are dependent upon the data sizes, which is why their interest is increasing in Big Data analytics. The key to Big Data is organizing data for quick reference to get the source from summaries and indexes. Amazon AWS uses DDN with Lustre, Microsoft has been using Cray with Lustre; and Google uses FUSE or their own storage [1][2][3][4][5].

大数据是指庞大而复杂的数据收集,以至于传统数据库工具难以管理。 大数据被认为是信息技术(IT)领域的未来基础。 当今的组织依赖于数据大小,这就是为什么他们对大数据分析的兴趣日益增加的原因。 大数据的关键是组织数据以供快速参考,以从摘要和索引中获取源。 Amazon AWS在Lustre上使用DDN,Microsoft在Lustre上使用Cray。 Google使用FUSE或自己的存储[1] [2] [3] [4] [5]。

Big Data knowledge can enable crafting the right plan or strategy and make you ready for the battle of the industry.  But like all other different fields, if you are new to something, you have to face some problems as challenges. Today, we are here with typical Big Data challenges faced by the organizations along with their solutions. 

大数据知识可以帮助制定正确的计划或策略,并使您为行业之战做好准备。 但是,与所有其他不同领域一样,如果您是新手,则必须面对一些挑战。 今天,我们在这里面临着组织及其解决方案所面临的典型大数据挑战。

Understanding

理解

Frequently many organizations neglect to know the advantages and disadvantages of Big Data as a new technology in the market. They are also unable to understand the importance of Big Data for their business organization. Without any reasonable information, they have different perspectives, like it may be dangerous for the project, or maybe it is expensive and many more. 

通常,许多组织忽视了将大数据作为市场上的新技术来了解其优缺点。 他们也无法理解大数据对其业务组织的重要性。 如果没有任何合理的信息,他们会有不同的观点,例如对于项目可能很危险,或者可能很昂贵。

You need to do proper research to understand the benefits, advantages and disadvantages of Big Data. Never accept or reject any technology without understanding the deep concept. To see Big Data acknowledgements at different levels, you must complete attending workshops and the various events of Big Data. You can also contact your allies which are using the technology in the present time and also making benefits or profits from it. Big Data is a given, and it is a requirement for Artificial Intelligence, Deep Learning training [6]. To do in-depth learning training you need as much data as possible, the point of Deep Learning is in part to find patterns you may not see. If you are not doing deep learning, you need to process the data by other algorithms and try to keep up with the information as it comes in. Big Data is not done in real-time.  We train with the Big Data and use that to find algorithms we apply in real-time, like self-driving cars.

您需要进行适当的研究以了解大数据的优势,劣势。 在不理解深刻概念的情况下,切勿接受或拒绝任何技术。 要查看不同级别的大数据确认,您必须完成参加研讨会和大数据的各种活动。 您还可以与当前正在使用该技术的盟友联系,并从中获得收益或利润。 大数据是给定的,它是人工智能深度学习培训的要求[6]。 要进行深度学习培训,您需要尽可能多的数据,深度学习的部分目的是找到您可能看不到的模式。 如果您不进行深度学习,则需要通过其他算法处理数据,并尝试跟上信息的步伐。大数据不是实时完成的。 我们使用大数据进行训练,并使用它来查找我们实时应用的算法,例如自动驾驶汽车。

Concepts 

概念

Data Structures should be established to better manage Big Data. Data structures allow for the effective management and indexing of large data sets. Data structure generally refers to either structured or unstructured data [7].

应该建立数据结构以更好地管理大数据。 数据结构允许对大型数据集进行有效的管理和索引。 数据结构通常是指结构化或非结构化数据[7]。

Structured

结构化的

  • Definition: Data is generally in a relational database management system (RDBMS).

    定义:数据通常位于关系数据库管理系统(RDBMS)中。
  • Examples: Tabled data that contains names, phone numbers, addresses, social security numbers, and any items that can be contained in client data. 

    示例:包含姓名,电话号码,地址,社会保险号以及客户数据中可以包含的任何项目的表述数据。
  • Database: "Structured Query Language" (SQL) for the required relational databases.

    数据库:所需的关系数据库的“结构化查询语言”(SQL)。

Unstructured 

非结构化

  • Definition: Everything that does not fall under structured data.

    定义:不属于结构化数据的所有内容。
  • Examples: Text files, email, social media, websites, text messages, phone calls, location data, media files, imagery, and sensory data, to name a few.

    示例:文本文件,电子邮件,社交媒体,网站,文本消息,电话,位置数据,媒体文件,图像和感官数据,仅举几例。
  • Database: The most common database of this type is "not only SQL (NoSQL)".

    数据库:这种类型的最常见数据库是“不仅是SQL(NoSQL)”。

As per the definition and guideline of Big Data, the attributes of Big Data are abridged as "5Vs", i.e., Volume, Variety, Velocity, Value and Veracity. Keeping in mind this is a growing field [8][9].

根据大数据的定义和准则,大数据的属性缩写为“ 5V”,即体积,品种,速度,价值和准确性。 请记住,这是一个不断发展的领域[8] [9]。

The base definition is based on the three V’s: Variety, Volume and Velocity. 

基本定义基于三个V:变化,体积和速度。

  • Variety: Multiple forms of the data – Variety refers to the many types of data that come from many sources.

    多样性:数据的多种形式–多样性是指来自多种来源的多种数据类型。
  • Volume: The scale or size of data – Volume is the amount of data being generated.

    卷:数据的规模或大小–卷是所生成的数据量。
  • Velocity: Analysis of moving or streaming data – Velocity refers to the speed of data generation and the rate at which it is being processed.

    速度:分析移动或流式数据–速度是指数据生成的速度和处理速度。

The importance of Big Data is the value added by measurable, reliable data. The modern version of Big Data still follows the definition of very large, complex data, but recently has been expanded to include the V’s value and veracity.

大数据的重要性是可衡量的,可靠的数据所增加的价值。 大数据的现代版本仍然遵循非常大的复杂数据的定义,但最近已扩展为包括V的值和准确性。

  • Value: The benefits of understanding data.

    价值:理解数据的好处。
  • Veracity: Uncertainty of data – Veracity refers to the quality of the data - is it accurate and reliable?

    准确性:数据的不确定性–准确性指的是数据的质量–是否准确可靠?

The constant evolution of Big Data means its main concepts are always evolving. Our current understanding will also evolve beyond the 5 Vs, as we further define what Big Data means in the future. Some possible additions to the V’s are the following:

大数据的不断发展意味着其主要概念始终在发展。 随着我们进一步定义未来大数据的含义,我们目前的理解还将超越5V。 V的一些可能添加如下:

  • Validity – refers more specifically to the precision and accuracy of data. 

    有效性–更具体地指数据的准确性和准确性。
  • Vulnerability – relates to cybersecurity risk. 

    漏洞–与网络安全风险有关。
  • Volatility – refers to how quickly the data becomes irrelevant and invalid. 

    波动率–指数据变得无关紧要和无效的速度。
  • Visualization – represents the many ways we can view Big Data. 

    可视化–代表了我们查看大数据的多种方式。

Security

安全

Big Data involves the integration of data with various divisions of the business organizations.   Many organizations think that Big Data can be a threat when they share information with various third-party software to make data visible for other departments of the organization. Big Data always provides plenty of backend dispersed data storage, which is not supported locally by different platforms. The third-party software can only see the data, but they may access the data for their use.

大数据涉及将数据与业务组织的各个部门进行集成。 许多组织认为,当它们与各种第三方软件共享信息以使数据对组织的其他部门可见时,大数据可能会构成威胁。 大数据始终提供大量的后端分散数据存储,不同平台本地不支持。 第三方软件只能看到数据,但是他们可以访问数据以供使用。

While new technologies are being introduced and Big Data are being used in many ways, the security and confidentiality of Big Data have been considered a concern. Big Data includes various security and privacy concerns. The main issues in (BDS) Big Data Security are protecting and verifying data [10][11]. 

在引入新技术并以多种方式使用大数据的同时,大数据的安全性和机密性也被认为是一个问题。 大数据包括各种安全和隐私问题。 (BDS)大数据安全性的主要问题是保护和验证数据[10] [11]。

Due to the large volume, speed and diversity of Big Data, the processing of such large data is challenging for conventional security models. This paradigm presents a challenge to security professionals who must adapt to the massive scope of Big Data. The following table lists common threats to Big Data:

由于大数据量大,速度快和多样性大,因此对于常规安全模型而言,处理此类大数据具有挑战性。 这种范例给必须适应大数据范围的安全专业人员带来了挑战。 下表列出了对大数据的常见威胁:

Threats

Description

Breach of privacy

Big Data is a solution often used to store great volumes of personal information. Such a large store of data may make it easier for an attacker to steal sensitive personal information in one comprehensive attack.

Privilege escalation

Because Big Data can represent wide swaths of information, some users may be able to view data that they are not authorized to view. This is especially true if systems are not in place to restrict how users can view and edit database entries. Multiple users with unrestricted visibility to data can threaten its confidentiality.

Repudiation

The size of Big Data may make event monitoring difficult or infeasible. Without proper controls for non-repudiation, an attacker may be able to change data and then plausibly deny having done so.

Forensic

Complications include accurately securing, collecting, and evaluating Big Data sets is especially difficult because Big Data implementations often lack a consistent structure and have a variety of different sources.

 

威胁

描述

违反隐私

大数据是一种通常用于存储大量个人信息的解决方案。 如此庞大的数据存储量可能使攻击者更容易在一次全面攻击中窃取敏感个人信息。

特权升级

由于大数据可以代表大量信息,因此某些用户可能能够查看他们无权查看的数据。 如果没有适当的系统来限制用户查看和编辑数据库条目的方式,则尤其如此。 具有不受限制的数据可见性的多个用户可以威胁其机密性。

抵赖

大数据的大小可能使事件监视变得困难或不可行。 如果没有适当的不可否认控制,攻击者可能能够更改数据,然后似乎拒绝这样做。

法证

精确地保护,收集和评估大数据集的复杂性尤其困难,因为大数据实现常常缺乏一致的结构,并且来源多种多样。

 

Cloud

Big Data is a data warehouse where organizations can save a huge amount of data. Big Data is, in many cases, a cloud-based storage space. Big Data is always prepared to handle, clean, process and perform various activities on the data. Today’s business organizations have a massive amount of data, and they are saving them in the cloud as Big Data. 

大数据是组织可以在其中保存大量数据的数据仓库。 在许多情况下,大数据是基于云的存储空间。 大数据始终准备处理,清理,处理和执行数据上的各种活动。 当今的商业组织拥有大量数据,并且正在将它们作为大数据保存在云中。

Big Data is not the cloud.  Big Data is large, fast and diverse data. The cloud is one tool that has a solution. Effectively in house computing, set up correctly, is an internal cloud where the data is only accessible to people you directly give access to, internally. There is a major security concern on truly sensitive data in the cloud (meaning like AWS, Azure, etc.), where a foreign government, other company and their contractors all have potential access to your data, and you have limited control [12].

大数据不是云。 大数据是大型,快速且多样化的数据。 云是具有解决方案的一种工具。 有效地在房屋计算中正确设置的是内部云,内部数据只能由您直接允许其访问的人员访问。 对云中真正敏感的数据(例如,AWS,Azure等)的安全性存在重大担忧,外国政府,其他公司及其承包商都可能访问您的数据,而您的控制权有限[12] 。

Another challenge faced by organizations is the cost of data storage in the Big Data. Most companies think that Big Data will cost them much as compared to the traditional data storing methods. But this is nothing more than a myth. The cost will depend on your needs or requirements.  Setting up internally requires hardware, software, maintenance and the most skilled people to set up and maintain the internal cloud. Cloud providers have the efficiency of scale that they can take advantage of for both cost, scale, co-location and speed.

组织面临的另一个挑战是大数据中数据存储的成本。 大多数公司都认为,与传统的数据存储方法相比,大数据将花费更多的成本。 但这仅是一个神话。 费用将取决于您的需求或要求。 内部设置需要硬件,软件,维护和最熟练的人员来设置和维护内部云。 云提供商在成本,规模,托管和速度两方面都可以利用规模效率。

Example Use Cases

示例用例

Organizations can quickly get lost in the wide range of the Big Data technologies available in the market. The various types of Big Data technology can confuse organizations while choosing one for their business organization or projects. If you try to explore the ocean with incomplete or partial knowledge, then you can never have a clear view of the things you expect from an application or a technology. For example, Big Data tools such as Google BigQuery and Apache Hadoop can be useful platforms for developing your own analysis tools. Third-party cloud-based apps also provide log analysis services. 

组织可以Swift迷失在市场上可用的各种大数据技术中。 在为企业组织或项目选择一种时,各种类型的大数据技术可能会使组织感到困惑。 如果您尝试使用不完全或部分知识来探索海洋,那么您将永远无法清楚地了解您对应用程序或技术所期望的事物。 例如,诸如Google BigQuery和Apache Hadoop之类的大数据工具可能是用于开发自己的分析工具的有用平台。 第三方基于云的应用程序还提供日志分析服务。

Big Data in itself has no value; however, it has great potential. Big Data is used in every aspect of modern life. We use the information in everything. Since information is now easily accessible and shared, each person should be made aware of what their connection to Big Data looks like. Big Data can be used for solving problems related to efficiency by looking at how people and processes impact the overall workflow of the organization [13][14][15][16][17]. 

大数据本身没有价值。 但是,它具有巨大的潜力。 大数据被用于现代生活的各个方面。 我们在所有信息中使用信息。 由于现在可以轻松访问和共享信息,因此应该使每个人都知道自己与大数据的联系。 通过查看人员和流程如何影响组织的整体工作流程,大数据可用于解决与效率相关的问题[13] [14] [15] [16] [17]。

  • CCTV: Governments are using camera surveillance to control populations, track terrorism, and catch criminals with facial recognition. It also helps understanding traffic patterns to make roadways safer or to make transportation more efficient. Camera data can even assist in understanding where to place access controls, like card readers, to make them more secure. This area is boundless and will continue to shape and impact security in very new ways into the future.

    闭路电视:各国政府正在使用摄像头监视来控制人口,追踪恐怖主义并通过面部识别来抓获罪犯。 它还有助于了解交通方式,以使道路更安全或使交通更高效。 相机数据甚至可以帮助您了解访问控制的位置,例如读卡器,以使其更加安全。 这个领域是无限的,并将在未来以新的方式继续塑造和影响安全性。
  • Phones: We use Big Data in phones every day. That notification that you parked your car in a certain location, or when your map knows your home address, are examples of Big Data analytics at work. This is just one of many ways mobile devices are shaping Big Data and cyber-security.

    电话:我们每天在电话中使用大数据。 您将车停在某个位置或地图知道您的住所地址的通知是工作中大数据分析的示例。 这只是移动设备塑造大数据和网络安全的众多方式之一。
  • Network Anomalies: The amount of logged data on organizations’ networks has gotten to the point that without Big Data, it would be impossible to detect attackers. This is why security information and event management systems (SIEMs) have become a standard component in almost any mid-to-enterprise network architecture. These tools allow for advanced correlation on large data sets. On the engineering side, these systems end up being limited by how they handle the Big Data problem. If they cannot deal with the massive amounts of data logged, the security benefits can be limited. Many cybersecurity professionals do Big Data analysis related to cybersecurity before the data even gets to the SIEM, because there is so much data on networks that it is almost impossible to handle, even with Big Data. Network problems almost always come down to latency and throughput, and a reason to handle this on internal resources, not on an external cloud, where DOS, networking issues, server load are blocking.

    网络异常:组织网络上已记录的数据量已经达到了这样的程度:没有大数据,就不可能检测到攻击者。 这就是为什么安全信息和事件管理系统(SIEM)已成为几乎任何中型企业网络体系结构中的标准组件的原因。 这些工具允许对大型数据集进行高级关联。 在工程方面,这些系统最终受到它们处理大数据问题的方式的限制。 如果他们无法处理记录的海量数据,则安全利益可能会受到限制。 许多网络安全专业人员在数据到达SIEM之前就进行了与网络安全相关的大数据分析,因为网络上有如此之多的数据,即使使用大数据,也几乎无法处理。 网络问题几乎总是归结为延迟和吞吐量,这是在内部资源而不是在DOS,网络问题,服务器负载受阻的外部云上处理此问题的原因。
  • Intrusion Detection: Big Data architectures are replacing traditional IDS systems because of the massive amounts of data, high throughput requirements, and the need to understand in as close to real-time as possible. Intrusion detection is one area where Big Data is relatively new in application and is just starting to be heavily researched. There are now a significant number of white papers published on this topic, especially in the area of reduction of “false positives”. If the current hypothesis is correct, then we will reach a point where we can trust that a security event is a threat, and eliminate the false-positive fatigue that analysts are currently facing.

    入侵检测:由于海量数据,高吞吐量需求以及需要尽可能近实时地进行了解,大数据架构正在取代传统的IDS系统。 入侵检测是大数据在应用中相对较新的领域,并且刚刚开始进行大量研究。 现在有大量关于此主题的白皮书,特别是在减少“误报”方面。 如果当前的假设是正确的,那么我们将可以相信安全事件是一种威胁,并消除分析师当前面临的假阳性疲劳。
  • Internet of Things (IoT): IoT devices are everywhere, generating huge data footprints, yet they have minimal amounts of storage or logging capabilities. Since these devices interconnect to other systems, they can report a lot of data, and Big Data can handle this unstructured data in valuable ways. This data may allow us to detect a health issue from a smart watch before the wearer recognizes it (we are already seeing this happen), to know when a device needs repairing before it breaks (think vibration monitoring systems in manufacturing), to understand inefficiencies in a process, or to predict when a person is walking up to a store in order to have what they are going to buy ready at the cash register. The applications for IoT with Big Data are boundless and will likely reshape how we live.

    物联网(IoT):物联网设备无处不在,产生巨大的数据足迹,但它们具有最少的存储或日志记录功能。 由于这些设备与其他系统互连,因此它们可以报告大量数据,大数据可以以有价值的方式处理这些非结构化数据。 这些数据可以使我们在佩戴者识别出智能手表之前就已经发现了健康问题(我们已经看到了这种情况),从而知道设备何时需要在其破裂前进行维修(例如制造中的振动监测系统),以了解效率低下的问题。在此过程中,或预测某人何时走到商店以准备在收银机上准备购买什么。 具有大数据的物联网的应用是无限的,并且可能会重塑我们的生活。
  • Compliance: Big Data and risk scoring are reshaping compliance. In many industries, you must meet specific government requirements for compliance, and Big Data is allowing organizations to define their compliance levels to define risk scores. There are even tools that enable someone to upload full network diagrams, and these tools then develop a risk score. This aggregates with all the other data required for compliance to define a more accurate understanding of risk to ensure the organization can meet its compliance requirements. In many cases, such Big Data analysis allows for a better risk score, which leads to a more secure environment. Risk scoring will become more valuable to industry as it looks to secure networks and data from attackers.

    合规性:大数据和风险评分正在重塑合规性。 在许多行业中,您必须满足政府对合规性的特定要求,大数据允许组织定义其合规性级别来定义风险评分。 甚至有一些工具可以使某人上载完整的网络图,然后这些工具会得出风险评分。 这与合规性所需的所有其他数据进行汇总,以定义对风险的更准确了解,以确保组织可以满足其合规性要求。 在许多情况下,此类大数据分析可提供更好的风险评分,从而带来更安全的环境。 风险评分对业界而言非常重要,因为它希望保护攻击者的网络和数据。

Conclusion

结论

Big Data is considered as the base of the future in the field of Information Technology. The goal of Big Data is to automate multiple processes to assist in finding value. Big Data has turned out to be one of the most encouraging and winning innovations to anticipate future patterns. It is advisable to do proper research and explore technology as much as you can.   

大数据被认为是信息技术领域未来的基础。 大数据的目标是使多个流程自动化以帮助寻找价值。 大数据已成为预测未来模式的最令人鼓舞和最成功的创新之一。 建议您尽可能进行适当的研究和探索技术。

References:

参考文献:

[1]https://aws.amazon.com/big-data/what-is-big-data/

[1] https://aws.amazon.com/big-data/what-is-big-data/

[2]https://www.oracle.com/big-data/what-is-big-data.html

[2] https://www.oracle.com/big-data/what-is-big-data.html

[3]https://aws.amazon.com/fsx/lustre/

[3] https://aws.amazon.com/fsx/lustre/

[4]https://www.cray.com/solutions/supercomputing-as-a-service/cray-clusterstor-in-azure

[4] https://www.cray.com/solutions/supercomputing-as-a-service/cray-clusterstor-in-azure

[5]https://cloud.google.com/storage/docs/gcs-fuse

[5] https://cloud.google.com/storage/docs/gcs-fuse

[6]https://www.ibm.com/blogs/systems/ai-machine-learning-and-deep-learning-whats-the-difference/

[6] https://www.ibm.com/blogs/systems/ai-machine-learning-and-deep-learning-whats-the-difference/

[7]https://blogs.oracle.com/bigdata/structured-vs-unstructured-data

[7] https://blogs.oracle.com/bigdata/structured-vs-unstructured-data

[8]https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx

[8] https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx

[9]https://thesai.org/Downloads/Volume7No3/Paper_37-Extract_Five_Categories_CPIVW.pdf

[9] https://thesai.org/Downloads/Volume7No3/Paper_37-Extract_Five_Categories_CPIVW.pdf

[10]https://journalofbigdata.springeropen.com/articles/10.1186/s40537-016-0059-y

[10] https://journalofbigdata.springeropen.com/articles/10.1186/s40537-016-0059-y

[11]https://www.sciencedirect.com/science/article/pii/S1877050916322864

[11] https://www.sciencedirect.com/science/article/pii/S1877050916322864

[12]https://www.hindawi.com/journals/sp/2018/5418679/

[12] https://www.hindawi.com/journals/sp/2018/5418679/

[13]https://intellipaat.com/blog/7-big-data-examples-application-of-big-data-in-real-life/

[13] https://intellipaat.com/blog/7-big-data-examples-application-of-big-data-in-real-life/

[14]https://arxiv.org/ftp/arxiv/papers/1905/1905.00490.pdf

[14] https://arxiv.org/ftp/arxiv/papers/1905/1905.00490.pdf

[15]https://insidebigdata.com/white-paper/risk-scoring-big-data-and-data-analytics/

[15] https://insidebigdata.com/white-paper/risk-scoring-big-data-and-data-analytics/

[16] https://medium.com/xnewdata/iot-big-data-success-case-1646291b55cb

[16] https://medium.com/xnewdata/iot-big-data-success-case-1646291b55cb

[17]https://hadoop.apache.org/

[17] https://hadoop.apache.org/

翻译自: https://www.experts-exchange.com/articles/34091/Big-Data-Concepts-Security-and-Use-Cases.html

python数据挖掘:概念

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值