机器学习用于数据预测_如何建立数据科学的肌肉记忆:用于机器学习的数据切片和映射…...

机器学习用于数据预测

by Zhen Liu

刘震

如何建立数据科学的肌肉记忆:用于机器学习的数据切片和映射 (How to build your data science muscle memory: Slicing and Mapping Data for Machine Learning)

When processing data using the Pandas library in Python, are you always confused when it comes to loc and iloc, or map, apply and applymap? Want to quickly select the subset you need and create some new features before creating your machine learning models? Use this tutorial to practice every morning for 10 mins, and repeating it for a week.

使用Python中的Pandas库处理数据时,在loc和iloc或映射,apply和applymap方面,您是否总是感到困惑? 是否想在创建机器学习模型之前快速选择所需的子集并创建一些新功能? 使用本教程,每天早晨练习10分钟,然后重复一周。

It’s like doing a few small crunches a day — not for your abs, but for your data science muscles. Gradually, you’ll notice the change.

这就像一天做几次小动作-不是为了您的腹肌,而是为了您的数据科学力量。 逐渐,您会注意到更改。

Following my previous “Data Science Workout” on data preprocessing, in this tutorial we’ll focus on 1) subsetting data and 2) creating new features.

继我之前关于数据预处理的“数据科学锻炼”之后 ,在本教程中,我们将重点关注1)子集数据和2)创建新功能。

content:1) slicing and dicing data to create your feature matrix (loc, iloc and etc)
2)assign, map and transform data to the ideal scale or label for modeling(map, apply, applymap and more)

First, load the libraries and Zillow data for our exercise:

首先,为我们的练习加载库和Zillow数据:

1.切片和切块数据 (1. Slicing and dicing data)

1.1切片列 (1.1 Slicing Columns)

What is loc and iloc?

什么是loc和iloc?

In pandas, loc and iloc are two ways you can select rows and columns by label(s) or a Boolean array.

在熊猫中,loc和iloc是通过标签或布尔数组选择行和列的两种方法。

.loc[]: you use row’s index (can be both integer and string. Depends on what the index is, for example index can be names, and can be a number), and column name for indexing (can’t use integer to index column location).

.loc[] :您使用行的索引 (可以是整数,也可以是字符串。取决于索引是什么,例如index可以是名称,也可以是数字),以及用于索引的列名(不能使用integer来索引列位置)。

.iloc[] : you can only use integers to do position-based indexing.

.iloc[] :您只能使用整数进行基于位置的索引。

Example: select columns by names using .loc[]:

示例:使用.loc[]按名称选择列:

The two expressions above give you the same result as below.

上面的两个表达式为您提供以下相同的结果。

What if I want to select the first 5 columns?

如果要选择前5列怎么办?

Now we use .iloc[]: it slices columns or rows by location.

现在我们使用.iloc[] :它按位置对列或行进行切片。

Confused with loc() already? Don’t worry — I’ll show you more examples! But keep in mind, .loc[] -> index based vs .iloc[] -> position based.

已经和loc()混淆了吗? 不用担心-我将向您展示更多示例! 但是请记住, .loc[] -> index ba vs .iloc[] -> position基于vs .iloc[] -> position基础。

1.2切片行 (1.2 Slicing Rows)

Select rows using index by .loc[] (the current index in the dataframe is the row number assigned automatically, it starts with 1).

使用.loc[]使用索引选择行(数据框中的当前索引是自动分配的行号,它以1开头)。

Select rows using location by .iloc[]:

使用.iloc[]使用位置选择行:

If you are selecting the 2nd, 3rd and 5th rows in order (remember Python counts from 0 when it works in location, so it’s [1,2,4])

如果您要按顺序选择第二,第三和第五行(请记住,Python在位置运行时从0开始计数,所以它是[1,2,4])

1.3选择列和行 (1.3 Select both columns and rows)

Using iloc to get 1–5 rows, and the first 6 columns by location, can be achieved using loc by using row index and column names. Remember that Python does not slice inclusive of the ending index, so .iloc[1:6, …] only select row 1–5 by position, while .loc[1:5, …]:

使用loc通过使用行索引和列名称,可以使用iloc获取1-5行以及按位置显示的前6列。 请记住,Python不对结尾索引进行切片,因此.iloc[1:6, …]仅按位置选择行1-5,而.loc[1:5, …]

iloc和loc有什么区别? (What’s the difference between iloc and loc?)

To demonstrate the difference better, we change the index from the default order to ‘SizeRank’ column, which is the rank of the size of the region.

为了更好地说明差异,我们将索引从默认顺序更改为“ SizeRank”列,该列是区域大小的等级。

Select by index [1,2,4]: it gives you the rows with index (size rank) that is 1,2,4.

按索引[1,2,4]选择:它为您提供索引(大小等级)为1,2,4的行。

Select using location [1,2,4]:

使用位置[1,2,4]选择:

1.4按位置获取一个特定的单元格 (1.4 Get one specific cell by location)
1.5机器学习过程示例:对特征矩阵(X)和响应向量(y)进行切片 (1.5 Example in machine learning process: slicing data for features matrix (X), and response vector (y))

If you want to see whether monthly rent can be used as training data to identify which state it is, then your X is the monthly rent, and Y is state (just giving an example of slicing data for features and response variable, you can try to see whether this prediction will work).

如果您想查看是否可以将月租金用作培训数据以识别其状态,则X是月租金,Y是州(仅举例说明功能和响应变量的切片数据,您可以尝试看看这个预测是否有效)。

dataframe.values give you the form of an array, which you can use directly in sklearn (like the new X and y in line 16–17 ).

dataframe.values为您提供数组的形式,您可以直接在sklearn中使用它(如第16-17行中的新X和y)。

1.6基于条件的子集 (1.6 Subset based on conditions)

If we want to select the top 10 biggest regions:

如果要选择前十大区域:

Other variations:

其他变化:

What happens if we apply a rule on the entire dataframe? It won’t filter out rows or columns but will show NA for the cells that don’t meet the requirements:

如果我们对整个数据框应用规则,会发生什么? 它不会过滤出行或列,但会显示不符合要求的单元格的NA:

If we filter with a variation of a columns’s value:

如果我们使用列值的变化进行过滤:

什么是lambda函数? (What’s a lambda function?)

Lambda functions can be used wherever function objects are required. It’s anonymous, but you can assign a variable to it, for example:

Lambda函数可在需要函数对象的任何地方使用。 它是匿名的,但是您可以为其分配一个变量,例如:

you can set f = lambda x: max(x)- min(x). Here we filter the regions when SizeRank is an even number.

您可以设置f = lambda x:max(x)-min(x)。 在这里,当SizeRank为偶数时,我们将过滤区域。

Use lambda to apply a rule on more than one column:

使用lambda将规则应用于多个列:

过滤列和行的示例 (Examples on filter both columns and rows)

It gives an error if we run raw_df[raw_df.loc[0]>450000] because there are non-numeric columns like state or city. Using what we learned from my last article, we select numerical columns only.

如果运行raw_df [raw_df.loc [0]> 450000],则会出现错误,因为存在诸如州或城市之类的非数字列。 使用从上一篇文章中学到的知识,我们仅选择数字列。

If we want to select the data ranked top 5 in size, and only keep the months when the rent is greater than 450,000 for the first row [index==0]

如果我们要选择排在前5位的数据,并且只保留第一行租金大于450,000的月份[index == 0]

Now we get back to use raw_df with all the columns, and select the data ranked top 5 in size, and only keep the string columns this time.

现在,我们返回对所有列都使用raw_df,并选择大小排在前5位的数据,这次仅保留字符串列。

For this type of filtering to work, the 2 elements inside the [] have to each yield a series of Boolean results (true, false) on their own. Otherwise it won’t work.

为了使这种类型的过滤有效,[]中的2个元素必须各自产生一系列布尔结果(真,假)。 否则它将无法正常工作。

For example:

例如:

num_df.loc[num_df['SizeRank']<=5, num_df.loc[0:3]>450000.0]

will fail, because num_df.loc[0:3]>450000.0 doesn’t give a series of Booleans, it’s an array of Booleans.

会失败,因为num_df.loc [0:3]> 450000.0不给布尔的SER IES,它的布尔值的Ar射线。

Format like df.loc[df.A>0, df.loc[‘index’]>0] will work because it only deals with one row and one column, so it’s selecting by 2 series of booleans.

像df.loc [df.A> 0,df.loc ['index']> 0]之类的格式将起作用,因为它只处理一行和一列,因此按2个布尔值进行选择。

注意语法! (Be careful of the syntax!)

It gives an error because this format will assume it’s rows but the command is actually selecting columns. .loc[]needs a : on the left side, if the condition is about columns.

它给出一个错误,因为这种格式将假定它是行,但是该命令实际上是在选择列。 如果条件是关于列,则.loc[]在左侧需要一个:

If the condition is about rows, you can ignore the : on the right side.

如果条件是关于行,则可以忽略右侧的:

2.将数据分配,映射和转换为理想比例 (2. Assign, map and transform data to the ideal scale)

2.1。 赋值 (2.1. Assign Values)

Use .copy() if you want to copy the data for some transformation while still keeping the original data untouched.

如果要复制数据进行某种转换,同时仍保持原始数据不变,请使用.copy()。

We are going to use this copied dataframe to practice assigning values.

我们将使用此复制的数据框练习分配值。

  • Assign values to rows use .loc[] or .iloc[]

    使用 .loc[] .iloc[] 为行分配值

  • Assign values to columns

    将值分配给列

  • Create a new column by assigning values by condition

    通过按条件分配值来创建新列

Create a new column by using existing columns: Map or Apply

使用现有列创建新列:映射或应用

  • Map: if too mange columns need to change values through creating a dictionary

    地图:如果管理过多的列需要通过创建字典来更改值

2.2 Map: it iterates over each element of a series, but only one series. We can use map to change values in one column.

2.2 Map :迭代一个系列的每个元素,但只迭代一个系列。 我们可以使用map更改一栏中的值。

For example: when we index a column like this: raw_df[‘2018–04’], it is a series; so we can use map to change the value’s unit in 2018–04 to ‘thousand’ by multiplying 0.001 in this series:

例如:当我们为这样的列编制索引时:raw_df ['2018–04'],它是一个序列; 因此,我们可以使用map通过将以下系列中的0.001乘以0.001,将2018–04的值单位更改为“千”:

If we want to change more than one column to thousands, use applymap.

如果要将一列以上更改为数千列,请使用applymap。

2.3 ApplyMap: This helps to apply a function to each ELEMENT of the dataframe.

2.3 ApplyMap :这有助于将功能应用于数据框的每个ELEMENT。

2.4 Apply: use if we need to apply for one or more columns more specifically.

2.4适用:使用 如果我们需要更具体地申请一或多个专栏。

As the name suggests, it applies a function ALONG any axis of the DataFrame.

顾名思义,它沿DataFrame的任何轴都应用了一个函数。

评论:map,appymap和apply之间有什么区别? (Review: what’s the difference between map, appymap and apply?)

map: operation on every element in one series, or one column of a df

map :对一个系列中的每个元素或df的一列进行操作

applymap: every element in a df (same operation for elements in all the columns and rows)

applymap :df中的每个元素(所有列和行中元素的相同操作)

apply: an operation that takes multiple columns from a df

apply :从df获取多列的操作

Special form of apply:df[['col1','col2']].apply(sum) : it will return the sum of all the values of column1 and column2.

套用的特殊形式: df[['col1','col2']].apply(sum) :它将返回column1和column2的所有值的和。

  • Special form of apply in pandas to get aggregated value:

    在大熊猫中申请的特殊形式,以获得合计价值:

Or use agg to get more types of descriptive statistics:

或使用agg获取更多类型的描述性统计信息:

2.4 Use apply to rescale data for machine learning:

2.4适用于为机器学习重新缩放数据:

Normalize and Standardize data in Python (you can use standard scaler from sklearn, but this is the concept).

在Python中规范化和标准化数据(您可以使用sklearn的标准缩放器 ,但这是概念)。

That’s it for the second part of my series on building muscle memory for data science in Python. The first part is linked at the end.

这就是我在Python中为数据科学构建肌肉内存的系列文章的第二部分。 第一部分在最后链接。

Stay tuned! My next tutorial will show you how to ‘curl the data science muscles’ for joining and pivoting data.

敬请关注! 我的下一个教程将向您展示如何“卷曲数据科学的力量”来连接和旋转数据。

Follow me and give me a few claps if you find this helpful :)

跟随我,如果您觉得有帮助,请给我一些鼓掌:)

You might also be interested in my analysis on rental seasonality:

您可能也对我的租金季节性分析感兴趣:

How to Analyze Rental Seasonality and Trend to Save Money on Your LeaseWhen I was looking for a new apartment to rent, I started to wonder: is there any seasonality impact? Is there a month…medium.freecodecamp.org

如何分析租赁的季节性和趋势以节省您的租金 当我寻找新的公寓进行租赁时,我开始怀疑:是否会对季节性产生影响? 有一个月吗…… medium.freecodecamp.org

翻译自: https://www.freecodecamp.org/news/how-to-build-your-data-science-muscle-memory-slicing-and-mapping-data-for-machine-learning-d38e65986c69/

机器学习用于数据预测

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值