Data Wrangling

最新推荐文章于 2024-10-10 07:20:50 发布

salt2020

最新推荐文章于 2024-10-10 07:20:50 发布

阅读量1.1w

点赞数 3

分类专栏：学习笔记文章标签：数据分析 data wrangling

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/Guo_ya_nan/article/details/79982582

版权

学习笔记专栏收录该内容

59 篇文章 5 订阅

订阅专栏

数据整理（Data Wrangling）

数据整理(Data Wrangling)可归纳为以下三步：
- 数据收集(Gather)
- 数据评估(Assess)
- 数据清理(Clean)

数据收集（Gather）

收集数据的方式有很多，最简单、最常见的是下载现成的数据。比如从kaggle上下载数据集。

但为了可扩展性（Scalability）和重复性（Reproducibility），有时需要以编程的（Programmatically）方式下载。比如需要下载的文件量较大，有成百上千个，甚至位于不同页面。

从网上爬取数据。比如爬知乎，爬豆瓣。

从各种API获得数据。比如电影数据API，股票数据API，Twitter数据API，等等。

数据评估（Assess）

可以从两方面进行：质量（Quality），整洁度（Tidiness）

质量（Quality）

低质量数据常被称为脏数据（dirty data），比如：
- 数据丢失，缺值。
- 数据无效。
- 数据不准确。
- 数据不一致，比如使用不同的长度单位（英寸和厘米）。

整洁度（Tidiness）

不整洁数据常被称为杂乱数据（messy data），是统计学家、教授和全能数据专家 Hadley Wickham 提出的概念。

A dataset is messy or tidy depending on how rows, columns, and tables are matched up with observations, variables, and types. In tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

数据清理(Clean)

分为手工清理和程序清理。

程序清理：

Define: convert our assessments into defined cleaning tasks. These definitions also serve as an instruction list so others (or yourself in the future) can look at your work and reproduce it.
Code: convert those definitions to code and run that code.
Test: test your dataset, visually or with code, to make sure your cleaning operations worked.

Always make copies of the original pieces of data before cleaning!

Reassess and Iterate

After cleaning, always reassess and iterate on any of the data wrangling steps if necessary.

Store (Optional)

Store data, in a file or database for example, if you need to use it in the future.

关注

3
点赞
踩
11

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论 1

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。