arcgis简化数据_简化数据探索

最新推荐文章于 2024-03-07 16:25:11 发布

weixin_26750481

最新推荐文章于 2024-03-07 16:25:11 发布

阅读量298

点赞数

文章标签： python java 算法人工智能机器学习

原文链接：https://towardsdatascience.com/data-exploration-simplified-2c045a495fe4

版权

arcgis简化数据

什么是数据探索？ (What is Data Exploration?)

Data exploration, as defined by Wikipedia, is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems.

Wikipedia定义的数据探索是一种类似于初始数据分析的方法，通过这种方法，数据分析师可以使用视觉探索来了解数据集中的内容和数据特征，而不是通过传统的数据管理系统。

It is at this stage in the data analysis process where data scientists or analysts, AI or ML engineers attempt to really understand the data they are working with. At this point, they seek to familiarize themselves with the data they are working with, so as to be able to know how efficient it will be in solving the problem at hand and how much more processing they would have to do. At this stage, the data analyst or scientist, AI or ML engineer uses so many tools and libraries to be able to efficiently explore their datasets. This makes the process of data exploration very difficult to fully accomplish, especially if you don’t know many essential tools, libraries and methods for exploring your datasets well. There’s good news, imagine being able to explore your datasets efficiently and very quickly in just one line of code!

在数据分析过程的现阶段，数据科学家或分析师，AI或ML工程师试图真正了解他们正在使用的数据。在这一点上，他们试图使自己熟悉正在使用的数据，以便能够知道解决当前问题的效率如何，以及他们还需要进行多少处理。在此阶段，数据分析师或科学家，AI或ML工程师使用了如此多的工具和库来有效地探索其数据集。这使得很难完全完成数据探索的过程，尤其是在您不了解许多基本工具，库和方法无法很好地探索数据集的情况下。有个好消息，想象一下仅用一行代码就能高效，快速地浏览数据集！

xplore (xplore)

With so many concerns and problems related to effectively exploring your data, my teammates, Benjamin Acquaah, Adam Labaran and myself wrote some automation scripts that can drastically automate and simplify the data exploration process. We wrote this script using the Pandas open-source library, utilizing the many methods and functions that come with the library. After fixing a couple of bugs, optimizing the code and running a series of tests, we finally built and packaged xplore 🎉.

由于与有效浏览数据有关的诸多担忧和问题，我的队友Benjamin Acquaah ， Adam Labaran和我自己写了一些自动化脚本，可以极大地自动化和简化数据探索过程。我们使用Pandas开源库编写了此脚本，并利用了该库随附的许多方法和功能。固定一对夫妇的bug，优化代码和运行一系列测试后，我们终于构建和打包Xplore数据库 🎉。

xplore is a python package built for data scientists or analysts, AI or ML engineers for exploring features of a dataset in a single line of code for quick analysis before data wrangling and feature extraction. xplore also utilizes the full power of pandas-profiling and can generate very advanced and detailed reports about the exploration of your data if the user wants that.

xplore 是为数据科学家或分析师，AI或ML工程师构建的python软件包，用于在一行代码中探索数据集的特征，以便在进行数据整理和特征提取之前进行快速分析。 xplore 还可以利用熊猫分析的全部功能，如果用户需要，可以生成有关数据浏览的非常高级和详细的报告。

xplore is fully open-sourced and interest contributors or code enthusiasts can find the complete source code and test files on Github. The project has also been published on PyPi so any interested person can easily install and import it in their local projects.

xplore是完全开源的，感兴趣的人或代码爱好者可以在上找到完整的源代码和测试文件。 Github 。该项目还已经在PyPi上发布，因此任何感兴趣的人都可以轻松地将其安装并导入其本地项目中。

Image for post — **xplorexplore** **Divine Alorvor神圣的阿洛沃**

如何使用Xplore探索数据 (How To Explore Your Data Using Xplore)

In this section, I am going to guide you through a walkthrough tutorial on how you can install, import and use the xplore package in your local project.

在本节中，我将指导您完成一个逐步教程，该教程介绍如何在本地项目中安装，导入和使用xplore软件包。

安装xplore软件包 (Installing the xplore package)

The minimum requirement to be able to successfully explore your data using the xplore package is to have Python installed on your computer. A bonus would be to have Anaconda also installed on your computer and added to PATH. With these things done, you can easily navigate to your command prompt (Windows), Terminal (Linux/macOS) or you anaconda prompt and run the command:

使用xplore软件包成功浏览数据的最低要求是在计算机上安装Python。一个好处是可以在计算机上安装Anaconda并将其添加到PATH中。完成这些操作后，您可以轻松导航到命令提示符(Windows)，终端(Linux / macOS)或anaconda提示符并运行以下命令：

pip install xplore

By running this command, the latest version of xplore and its dependencies will be fully stored on your local computer so you can easily import and use it in your python files later.

通过运行此命令，最新版本的xplore及其依赖项将完全存储在本地计算机上，以便以后可以轻松地将其导入并在python文件中使用。

将xplore模块导入代码中 (Importing the xplore module into your code)

In any python file where you would like to explore your data using the xplore package, you would have to import the xplore module directly into your code file so you can easily access and use the built-in methods that come with the module. This can be done by adding the following line when making the necessary imports in your code:

在您想使用xplore探索数据的任何python文件中包，则必须将xplore模块直接导入代码文件中，以便可以轻松访问和使用该模块随附的内置方法。这可以通过在代码中进行必要的导入时添加以下行来完成：

from xplore.data import xplore

准备数据进行探索 (Preparing your data for exploration)

As at the moment this article was published, the xplore package has only been optimized to work with labelled data only. Preparing your data for exploration using the xplore package is as easy as just assigning the data you have read into your code to any variable name.

截至本文发布时， xplore软件包仅进行了优化，仅可用于带标签的数据。使用xplore软件包准备数据进行探索就像将您已读入代码的数据分配给任何变量名一样容易。

import pandas as pddata = pd.read_csv('name_of_data_file.csv')

使用xplore方法 (Using the xplore method)

The process of actually exploring your data is simplified to just one line of code. Everything else that has to be done will be automatically done by the automation script we wrote. You can explore your data using the command:

实际浏览数据的过程简化为仅一行代码。我们编写的自动化脚本将自动完成所有其他需要完成的工作。您可以使用以下命令浏览数据：

xplore(data)

By running this, you will see an output of the analysis made from the exploration of your data. Almost all the vital analysis you need will be printed out but for people who want to see a very detailed and advanced report from the analysis of their data, type in ‘y’ when the prompt asks you if you would like to see a detailed report on the exploration of your data. Otherwise, you can easily type in ’n’ when you see that prompt if you are comfortable with the outputs that were printed out.

通过运行此命令，您将看到对数据的探索得出的分析结果。几乎所有您需要的重要分析都将被打印出来，但是对于那些希望从其数据分析中看到非常详细和高级的报告的人，在提示询问您是否要查看详细报告时键入“ y”探索您的数据。否则，如果您对打印出的输出感到满意，则可以在看到该提示时轻松键入“ n”。

运行xplore方法的示例输出 (Sample output from running the xplore method)

After running code where you used the xplore method to explore your data, here’s what your output should look like:

运行使用xplore方法浏览数据的代码后，输出应如下所示：

------------------------------------
The first 5 entries of your dataset are:
   rank country_full country_abrv  total_points  ...  three_year_ago_avg  three_year_ago_weighted  confederation   rank_date
0     1      Germany          GER           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
1     2        Italy          ITA           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
2     3  Switzerland          SUI           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
3     4       Sweden          SWE           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
4     5    Argentina          ARG           0.0  ...                 0.0                      0.0       CONMEBOL  1993-08-08
[5 rows x 16 columns]
------------------------------------
The last 5 entries of your dataset are:
       rank country_full country_abrv  total_points  ...  three_year_ago_avg  three_year_ago_weighted  confederation   rank_date
57788   206     Anguilla          AIA           0.0  ...                 0.0                      0.0       CONCACAF  2018-06-07
57789   206      Bahamas          BAH           0.0  ...                 0.0                      0.0       CONCACAF  2018-06-07
57790   206      Eritrea          ERI           0.0  ...                 0.0                      0.0            CAF  2018-06-07
57791   206      Somalia          SOM           0.0  ...                 0.0                      0.0            CAF  2018-06-07
57792   206        Tonga          TGA           0.0  ...                 0.0                      0.0            OFC  2018-06-07
[5 rows x 16 columns]
------------------------------------
Stats on your dataset:
<bound method NDFrame.describe of        rank country_full country_abrv  total_points  ...  three_year_ago_avg  three_year_ago_weighted  confederation   rank_date
0         1      Germany          GER           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
1         2        Italy          ITA           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
2         3  Switzerland          SUI           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
3         4       Sweden          SWE           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
4         5    Argentina          ARG           0.0  ...                 0.0                      0.0       CONMEBOL  1993-08-08
...     ...          ...          ...           ...  ...                 ...                      ...            ...         ...
57788   206     Anguilla          AIA           0.0  ...                 0.0                      0.0       CONCACAF  2018-06-07
57789   206      Bahamas          BAH           0.0  ...                 0.0                      0.0       CONCACAF  2018-06-07
57790   206      Eritrea          ERI           0.0  ...                 0.0                      0.0            CAF  2018-06-07
57791   206      Somalia          SOM           0.0  ...                 0.0                      0.0            CAF  2018-06-07
57792   206        Tonga          TGA           0.0  ...                 0.0                      0.0            OFC  2018-06-07
[57793 rows x 16 columns]>
------------------------------------
The Value types of each column are:
rank                         int64
country_full                object
country_abrv                object
total_points               float64
previous_points              int64
rank_change                  int64
cur_year_avg               float64
cur_year_avg_weighted      float64
last_year_avg              float64
last_year_avg_weighted     float64
two_year_ago_avg           float64
two_year_ago_weighted      float64
three_year_ago_avg         float64
three_year_ago_weighted    float64
confederation               object
rank_date                   object
dtype: object
------------------------------------
Info on your Dataset:
<bound method DataFrame.info of        rank country_full country_abrv  total_points  ...  three_year_ago_avg  three_year_ago_weighted  confederation   rank_date
0         1      Germany          GER           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
1         2        Italy          ITA           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
2         3  Switzerland          SUI           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
3         4       Sweden          SWE           0.0  ...                 0.0                      0.0           UEFA  1993-08-08
4         5    Argentina          ARG           0.0  ...                 0.0                      0.0       CONMEBOL  1993-08-08
...     ...          ...          ...           ...  ...                 ...                      ...            ...         ...
57788   206     Anguilla          AIA           0.0  ...                 0.0                      0.0       CONCACAF  2018-06-07
57789   206      Bahamas          BAH           0.0  ...                 0.0                      0.0       CONCACAF  2018-06-07
57790   206      Eritrea          ERI           0.0  ...                 0.0                      0.0            CAF  2018-06-07
57791   206      Somalia          SOM           0.0  ...                 0.0                      0.0            CAF  2018-06-07
57792   206        Tonga          TGA           0.0  ...                 0.0                      0.0            OFC  2018-06-07
[57793 rows x 16 columns]>
------------------------------------
The shape of your dataset in the order of rows and columns is:
(57793, 16)
------------------------------------
The features of your dataset are:
Index(['rank', 'country_full', 'country_abrv', 'total_points',
       'previous_points', 'rank_change', 'cur_year_avg',
       'cur_year_avg_weighted', 'last_year_avg', 'last_year_avg_weighted',
       'two_year_ago_avg', 'two_year_ago_weighted', 'three_year_ago_avg',
       'three_year_ago_weighted', 'confederation', 'rank_date'],
      dtype='object')
------------------------------------
The total number of null values from individual columns of your dataset are:
rank                       0
country_full               0
country_abrv               0
total_points               0
previous_points            0
rank_change                0
cur_year_avg               0
cur_year_avg_weighted      0
last_year_avg              0
last_year_avg_weighted     0
two_year_ago_avg           0
two_year_ago_weighted      0
three_year_ago_avg         0
three_year_ago_weighted    0
confederation              0
rank_date                  0
dtype: int64
------------------------------------
The number of rows in your dataset are:
57793
------------------------------------
The values in your dataset are:
[[1 'Germany' 'GER' ... 0.0 'UEFA' '1993-08-08']
 [2 'Italy' 'ITA' ... 0.0 'UEFA' '1993-08-08']
 [3 'Switzerland' 'SUI' ... 0.0 'UEFA' '1993-08-08']
 ...
 [206 'Eritrea' 'ERI' ... 0.0 'CAF' '2018-06-07']
 [206 'Somalia' 'SOM' ... 0.0 'CAF' '2018-06-07']
 [206 'Tonga' 'TGA' ... 0.0 'OFC' '2018-06-07']]
------------------------------------
Do you want to generate a detailed report on the exploration of your dataset?
[y/n]: y
Generating report...
Summarize dataset: 100%|████████████████████████████████████████████████████████████████████████████| 30/30 [03:34<00:00,  7.14s/it, Completed] 
Generate report structure: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:31<00:00, 31.42s/it] 
Render HTML: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:12<00:00, 12.07s/it] 
Export report to file: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.00it/s] 
Your Report has been generated and saved as 'output.html'

Python is the most powerful language you can still read. — Paul Dubois

Python是您仍然可以阅读的最强大的语言。 —保罗·杜波依斯(Paul Dubois)

Being an avid learner, I am very passionate about making things easier for the next generation of learners and automation is one of the easiest ways you can achieve this. Save yourself the stress of having to waste so much time on data exploration. When next you explore your data, be sure to xplore your data 😉

作为一个狂热的学习者，我非常热衷于使下一代学习者更轻松，而自动化是实现这一目标的最简单方法之一。不必为浪费大量时间进行数据探索而烦恼自己。当你下次浏览数据，一定要XPLORE数据😉

If you love what my teammates and I did with this project, kindly spare a few minutes to leave a star ⭐️on the GitHub repo and tell your friends on twitter about xplore by clicking this link.

如果你喜欢什么我的队友和我做这个项目，请花几分钟时间离开星⭐️on的GitHub库，并通过点击此告诉关于Xplore数据库叽叽喳喳你的朋友链接。

Thank you for sparing a few minutes of your time to read this. I hope it has been educative and helpful 😊I am available if you want to chat personally on Twitter and on LinkedIn. Happy Programming!

感谢您抽出宝贵的时间阅读本文。希望对我有帮助。if如果您想在Twitter和LinkedIn上亲自聊天，我可以提供。编程愉快！

My heartfelt gratitude goes to Anna Ayiku for proofreading and correcting the many mistakes I made writing this.

我衷心感谢 Anna Ayiku 校对并纠正了我在撰写本文时犯下的许多错误。

翻译自: https://towardsdatascience.com/data-exploration-simplified-2c045a495fe4

arcgis简化数据

weixin_26750481

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
arcgis简化数据_简化数据探索

arcgis简化数据什么是数据探索？ (What is Data Exploration?)Data exploration, as defined by Wikipedia, is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand...
复制链接

扫一扫