数据挖掘和数据探索_数据挖掘中的数据探索

本文介绍了数据挖掘中的数据探索过程,包括数据探索的定义、数据统计描述和数据可视化概念。数据探索涉及数据的积累,通过自动和手动活动进行,如数据概要分析、数据可视化和统计描述。数据统计描述涵盖了集中趋势、分散度和偏度峰度的度量。数据可视化是数据解释的关键工具,帮助理解趋势、异常和模式。文章列举了各种数据可视化技术和图表类型。
摘要由CSDN通过智能技术生成

数据挖掘和数据探索

In the previous article, we have learnt about Data Mining with its advantages, disadvantages and various applications. Now, let us move forward in the depth of data mining which includes various steps by which the data is dealt out. Let us start with Data Exploration. This article includes,

在上一篇文章中,我们了解了数据挖掘的优点,缺点和各种应用程序 。 现在,让我们进一步深入数据挖掘,其中包括处理数据的各个步骤。 让我们从数据探索开始。 本文包括

  1. Definition of Data Exploration

    数据探索的定义

  2. Statistical Description of Data

    数据统计描述

  3. Concept of Data Visualization

    数据可视化的概念

  4. Various technique of Data Visualization

    各种数据可视化技术

1)数据探索的定义 (1) Definition of Data Exploration)

Data exploration is the process of accumulating data relevant and concerned with information about a target object or field. These characteristics will embrace the size or quantity of information, completeness of the information, correctness of the information, doable relationships amongst knowledge components or files/tables within the knowledge.

数据探索是累积与目标对象或字段的信息相关和有关的数据的过程。 这些特征将包含信息的大小或数量,信息的完整性,信息的正确性,知识组件或知识内的文件/表之间的可行关系。

Data exploration is usually conducted employing a combination of automatic and manual activities. Automatic activities will embrace data profiling or data visualization or tabular report to offer the analyst initial read into the information and an understanding of key characteristics. Usually, it is followed by manual drill-down or filtering of the information to spot anomalies or patterns known through the automatic actions.

通常使用自动和手动活动的组合来进行数据探索 。 自动活动将包括数据概要分析或数据可视化或表格报告,以使分析师初步了解信息并了解关键特征。 通常,随后是手动向下钻取或过滤信息以发现通过自动操作已知的异常或模式。

Data exploration can even need manual scripting and queries into the information (e.g. exploitation languages like SQL or R) or exploitation spreadsheets or similar tools to look at the data. All of those activities are aimed toward making a mental model and understanding of the information within the mind of the analyst, and shaping basic information (statistics, structure, relationships) for the information set that may be employed in future analysis. Once this initial understanding of the information is done, the information is pruned or refined by removing unusable elements of the information (data cleansing), correcting poorly formatted components and shaping relevant relationships across datasets. This method is additionally referred to as crucial knowledge quality.

数据探索甚至可能需要手动编写脚本并查询信息(例如,利用SQL或R之类的利用语言)或利用电子数据表或类似工具来查看数据。 所有这些活动的目的是在分析师的脑海中建立思维模型并理解信息,并为可能在将来的分析中使用的信息集形成基本信息(统计,结构,关系)。 一旦完成了对信息的初步理解,就可以通过删除信息中不可用的元素(数据清理),更正格式不正确的组件并在数据集中建立相关关系来修剪或精炼信息。 此方法还称为关键知识质量。

2)数据统计描述 (2) Statistical Description of Data)

Statistics play an important role in all fields. It helps in collecting data, be it in any field. Along with that, it also helps in analyzing data using statistical techniques. Statistics is all about the “collection” of data. Also, the goal is to maintain the data for the welfare of everyone in the area. According to various calculations, there are several predictions that led to one or the other answer.

统计在所有领域都起着重要作用。 无论在任何领域,它都有助于收集数据。 除此之外,它还有助于使用统计技术分析数据。 统计信息都是关于数据的“收集”的。 此外,目标是维护该地区每个人的福利数据。 根据各种计算,有几种预测可以得出一个或另一个答案。

Various methods of statistics include,

各种统计方法包括:

2.1) Measure of Central Tendency

2.1)集中趋势测度

In statistics, a central tendency. maybe referred to as a middle or location of the distribution. Measures of central tendency are often called averages. The most common measures of central tendency area unit,

在统计中,这是中心趋势。 可能称为分布的中间或位置。 集中趋势的度量通常称为平均值。 集中趋势区域单位最常用的度量,

  1. The arithmetic mean: the sum of all numerical values divided by the total number of numerical values.

    算术平均值 :所有数值的总和除以数值总数。

  2. Median: It refers to the midpoint of data after arranging the data in ascending order.

    中位数 :是指数据按升序排列后的中点。

  3. Mode: It refers to the most frequently occurring number in the data.

    模式 :指数据中最频繁出现的数字。

2.2) Measure of Dispersion

2.2)分散度

In statistics, dispersion is related to variability, scattering and spread is the extent to which a distribution is stretched or squeezed. It tells the variation of the info from each other and provides a transparent plan concerning the distribution of the info. The measure of dispersion shows the homogeneity or the heterogeneity of the distribution of the observations Common examples of measures of statistical dispersion are,

在统计中,分散与可变性有关,分散和扩散是分布被拉伸或压缩的程度。 它告诉彼此信息的变化,并提供有关信息分布的透明计划。 弥散量度显示观测值分布的同质性或异质性统计弥散量度的常见示例有:

  1. Range: It refers to the difference between the highest value to the lowest value.

    范围 :指最大值与最小值之间的差。

  2. Variance: It refers to the sum of the square of deviations from the sample mean which is divided by one less than the sample size.

    方差 :它是指与样本均值的偏差平方之和,除以小于样本大小的一。

  3. Standard Deviation: It refers to the square root of the variance.

    标准偏差 :指方差的平方根。

  4. Interquartile Range: The IQR is a measure of variability, based on dividing information set into quartiles. Quartiles divide a rank-ordered knowledge set into four equal components. The values that separate components square measure known as the primary, second, and third quartiles; and that they square measure denoted by Q1, Q2, and Q3.

    四分位数间距 :IQR是基于将信息集划分为四分位数的可变性度量。 四分位数将等级排序的知识集分为四个相等的组成部分。 分开各个分量平方的值称为主要,第二和第三四分位数; 并以Q1,Q2和Q3表示平方。

2.3) Measure of Skewness and Kurtosis

2.3)偏度和峰度的度量

Skewness may be a live of symmetry, or more precisely, the lack of symmetry. The data set is symmetric if it looks the same to the left and right of the center point.

偏斜可能是对称现象,或更确切地说是缺乏对称性。 如果数据集在中心点的左侧和右侧看起来相同,则它是对称的。

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, information sets with high kurtosis tend to possess serious tails or outliers. Data sets with low kurtosis tend to possess lightweight tails or a lack of outliers. A uniform distribution would be an extreme case.

峰度是数据相对于正态分布是重尾还是轻尾的度量。 也就是说,峰度高的信息集倾向于具有严重的尾巴或离群值。 峰度低的数据集倾向于具有轻量级的尾巴或缺乏离群值。 均匀分布将是极端情况。

3)数据可视化的概念 (3) Concept of Data Visualization)

Data image is that the graphical illustration of knowledge and data. By mistreatment visual parts like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Visualization is an increasingly key tool to make sense of the trillions of rows of data generated every day.

数据图像是知识和数据的图形说明。 通过对图表,图形和地图等可视零件进行错误处理,数据可视化工具提供了一种可访问的方式,用于查看和理解数据中的趋势,异常值和模式。 可视化是一种越来越重要的工具,可以用来理解每天生成的数万亿行数据。

Data image helps to inform stories by curating information into a type easier to know, highlighting the trends and outliers. A good image tells a story, removing the noise from data and highlighting the useful information. In the world of huge information, information image tools and technologies area unit essential to investigate huge amounts of data and create data-driven selections.

数据图像通过将信息整理成易于理解的类型,突出趋势和异常值,从而有助于为故事提供信息。 一个好的图像可以说明一个故事,可以消除数据中的干扰并突出显示有用的信息。 在海量信息的世界中,信息图像工具和技术领域对调查大量数据并创建数据驱动的选择至关重要。

4)各种数据可视化技术 (4) Various Technique of Data Visualization)

4.1) Common general types of data visualization

4.1)数据可视化的常见常规类型

  • Charts

    图表

  • Tables

    桌子

  • Graphs

    图表

  • Maps

    地图

  • Infographics

    信息图表

  • Dashboards

    仪表板

4.2) More specific examples of methods to visualize data

4.2)更具体的方法实例化数据

  • Area Chart

    面积图

  • Bar Chart

    条形图

  • Box-and-whisker Plots

    箱须图

  • Bubble Cloud

    泡泡云

  • Bullet Graph

    项目符号图

  • Cartogram

    制图

  • Circle View

    圆形检视

  • Dot Distribution Map

    点分布图

  • Gantt Chart

    甘特图

  • Heat Map

    热图

  • Highlight Table

    高亮表

  • Histogram

    直方图

  • Matrix

    矩阵

  • Network

    网络

  • Polar Area

    极地地区

  • Radial Tree

  • Scatter Plot (2D or 3D)

    散点图(2D或3D)

  • Streamgraph

    流图

  • Text Tables

    文字表

  • Timeline

    时间线

  • Treemap

    树状图

翻译自: https://www.includehelp.com/basics/data-exploration-in-data-mining.aspx

数据挖掘和数据探索

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值