熊猫压缩怎么使用_如何使用熊猫示例选择行和列

熊猫压缩怎么使用

In this tutorial we will learn how to use Pandas sample to randomly select rows and columns from a Pandas dataframe. There are some reasons for randomly sample our data; for instance, we may have a very large dataset and want to build our models on a smaller sample of the data. Other examples are when carrying out bootstrapping or cross-validation. Here we will learn how to; select rows at random, set a random seed, sample by group, using weights, and conditions, among other useful things.

在本教程中,我们将学习如何使用Pandas示例从Pandas数据框中随机选择行和列。 有一些原因需要随机抽样我们的数据; 例如,我们可能有一个非常大的数据集,并希望在较小的数据样本上建立模型。 其他示例包括进行引导或交叉验证时。 在这里,我们将学习如何; 随机选择行,设置随机种子,使用权重和条件按组抽样,以及其他有用的东西。

如何对行进行随机抽样 (How to Take a Random Sample of Rows)

In this section we are going to learn how to take a random sample of a Pandas dataframe. We are going to use an Excel file that can be downloaded here. First, we start by importing Pandas and we use read_excel to load the Excel file into a dataframe:

在本节中,我们将学习如何对Pandas数据框进行随机采样。 我们将使用可在此处下载的Excel文件。 首先,我们首先导入Pandas,然后使用read_excel将Excel文件加载到数据框中:

import pandas as pd

df = pd.read_excel('MLBPlayerSalaries.xlsx')
df.head()

We use the method shape to see how many rows and columns that we have in our dataframe. This method is very similar to the dim function in R statistical programming language (see here).

我们使用方法形状来查看数据框中有多少行和多少列。 该方法与R统计编程语言中的dim函数非常相似(请参见此处 )。

df.shape

  • 阅读《 Pandas Excel教程》以了解有关将Excel文件加载到Pandas数据框中的更多信息。

Now we know how many rows and columns there are (19543 and 5 rows and columns, respectively) and we will now continue by using Pandas sample. In the example below we are not going to use any parameters. The default behavior, when not using any parameters, is sampling one row:

现在我们知道有多少行和列(分别为195435行和列),现在我们将继续使用Pandas示例。 在下面的示例中,我们将不使用任何参数。 不使用任何参数时的默认行为是采样一行:

df.sample()

one randomly selected row using Pandas sample

In the most cases we want to take random samples of more rows than one. Thus, in the next Pandas sample example we are going to take random sample of the size of 200. We are going to use the parameter n to accomplish this:

在大多数情况下,我们希望随机抽取多于一行的样本。 因此,在下一个Pandas示例示例中,我们将随机抽取200个大小的示例。我们将使用参数n来完成此操作:

df.sample(n=200).head(10)

As can be seen in the above image, we also used the head method to print only the 10 first rows of the randomly sampled rows. In most cases, we may want to save the randomly sampled rows. To accomplish this, we ill create a new dataframe:

从上图中可以看出,我们还使用head方法仅打印了随机采样行的前10行。 在大多数情况下,我们可能要保存随机采样的行。 为此,我们将创建一个新的数据框:

df200 = df.sample(n=200)
df200.shape
# Output: (200, 5)

In the code above we created a new dataframe, called df200, with 200 randomly selected rows. Again, we used the method shape to see how many rows (and columns) we now have.

在上面的代码中,我们创建了一个名为df200的新数据框,其中包含200条随机选择的行。 同样,我们使用方法的形状来查看现在有多少行(和列)。

使用NumPy选择的随机采样行 (Random Sampling Rows using NumPy Choice)

It’s of course very easy and convenient to use Pandas sample method to take a random sample of rows. Note, however, that it’s possible to use NumPy and random.choice. In the example below we will get the same result as above by using np.random.choice.

当然,使用Pandas采样方法对行进行随机采样非常简单方便。 但是请注意,可以使用NumPy和random.choice 。 在下面的示例中,通过使用np.random.choice,我们将获得与上面相同的结果。

As usual when working with Python modules, we start by importing NumPy. After this is done we will the continue to create an array of indices (rows) and then use Pandas loc method to select the rows based on the random indices:

与使用Python模块一样,通常从导入NumPy开始。 完成此操作后,我们将继续创建索引(行)数组,然后使用Pandas loc方法根据随机索引选择行:

import numpy as np

rows = np.random.choice(df.index.values, 200)
df200 = df.loc[rows]
df200.head()

Using Pandas sample to randomly select 200 rows

如何使用frac采样Pandas Dataframe (How to Sample Pandas Dataframe using frac)

Now that we have used NumPy we will continue this Pandas dataframe sample tutorial by using sample’s frac parameter. This parameter specifies the fraction (percentage) of rows to return in the random sample. This means that setting frac to 1 (frac=1) will return all rows, in random order. That is, if we just want to shuffle the dataframe it can be done using sample and the parameter frac.

现在,我们已经使用了NumPy,我们将通过使用示例的frac参数继续此Pandas数据框示例教程。 此参数指定要在随机样本中返回的行的分数(百分比)。 这意味着将frac设置为1(frac = 1)将以随机顺序返回所有行。 也就是说,如果我们只想改组数据帧,则可以使用sample和参数frac完成。

df.sample(frac=1).head()

Pandas Sample using the frac parameter

As can be seen in the output table above the order of the rows are now random. We can use shape, again, to see that we have the same amount of rows:

从上面的输出表中可以看出,行的顺序现在是随机的。 我们可以再次使用shape来查看行数是否相同:

df.sample(frac=1).shape
# Output: (19543, 5)

As expected there are as many rows and columns as in the original dataframe.

正如预期的那样,行和列的数量与原始数据帧中的一样多。

如何使用Numpy随机播放Pandas Dataframe (How to Shuffle Pandas Dataframe using Numpy)

Here we will use another method to shuffle the dataframe. In the example code below we will use the Python module NumPy again. We have to use reindex (Pandas) and random.permutation (NumPy). More specifically, we will permute the datframe using the indices:

在这里,我们将使用另一种方法来重新整理数据帧。 在下面的示例代码中,我们将再次使用Python模块NumPy。 我们必须使用重新索引(Pandas)和random.permutation(NumPy)。 更具体地说,我们将使用索引置换datframe:

df_shuffled = df.reindex(np.random.permutation(df.index))

We can use frac to get 200 randomly selected rows also. Before doing this we will, of course, need to calculate how many % 200 is of our total amount of rows. In this case it’s approximately 1% of the data and using the code below will also give us 200 random rows from the dataframe.

我们也可以使用frac获取200个随机选择的行。 当然,在执行此操作之前,我们将需要计算出我们的总行数中有多少%200。 在这种情况下,它大约是数据的1%,使用下面的代码还将为我们提供来自数据帧的200条随机行。

df200 = df.sample(frac=.01023)

Note, the frac parameter cannot be used together with n. We will get a ValueError that states that we cannot enter a value for both frac and n.

注意,frac参数不能与n一起使用。 我们将得到一个ValueError,指出不能同时为frac和n输入值。

熊猫样品更换 (Pandas Sample with Replacement)

We can also, of course, sample with replacement. By default Pandas sample will sample without replacement. In some cases we have to sample with replacement (e.g., with really large datasets). If we want to sample with replacement we should use the replace parameter:

当然,我们也可以提供替换样品。 默认情况下,Pandas示例将进行采样而无需更换。 在某些情况下,我们必须进行替换采样(例如,使用非常大的数据集)。 如果要进行替换采样,则应使用replace参数:

df5 = df.sample(n=5, replace=True)

带种子的样本数据框 (Sample Dataframe with Seed)

If we want to be able to reproduce our random sample of rows we can use the random_state parameter. This is the seed for the random number generator and we need to input an integer:

如果我们希望能够重现行的随机样本,则可以使用random_state参数。 这是随机数生成器的种子,我们需要输入一个整数:

df200 = df.sample(n=200, random_state=1111)

We can, of course, use both the parameters frac and random_state, or n and random_state, together. In the example below we randomly select 50% of the rows and use the random_state. It is further possible to use replace=True parameter together with frac and random_state to get a reproducible percentage of rows with replacement.

当然,我们可以同时使用参数frac和random_state,或者同时使用n和random_state。 在下面的示例中,我们随机选择50%的行并使用random_state。 进一步可以将replace = True参数与frac和random_state一起使用,以获得可替换行的百分比。

df200 = df.sample(frac=.5, replace=True, random_state=1111)

使用熊猫样本时,frac和n不能一起使用
()

熊猫样品配重 (Pandas Sample with Weights)

The sample method also have the parameter weights and this can be used if we want to increase the probability for certain rows to be sampled. We start of the next Pandas sample example by importing NumPy.

采样方法也具有参数权重,如果我们想增加某些行被采样的可能性,可以使用此参数权重。 我们通过导入NumPy开始下一个Pandas示例示例。

import numpy as np

df['Weights'] = np.where(df['Year'] <= 2000, .75, .25)
df['Weights'].unique()

# Output: array([0.75 , 0.25])

In the code above we used NumPy’s where to create a new column ‘Weights’. Up until the year 2000 the weights are .5. This will increase the probability for Pandas sample to select rows up until this year:

在上面的代码中,我们使用NumPy在其中创建新列“ Weights”。 直到2000年,权重为0.5。 到今年为止,这将增加熊猫样本选择行的可能性:

df2 = df.sample(frac=.5, random_state=1111, weights='Weights')
df2.shape

# Output: (9772, 6)

熊猫按组抽样 (Pandas Sample by Group)

It’s also possible to sample each group after we have used Pandas groupby method. In the example below we are going to group the dataframe by player and then take 2 samples of data from each player:

使用Pandas groupby方法后,也可以对每个组进行采样。 在下面的示例中,我们将按播放器对数据帧进行分组,然后从每个播放器中抽取2个数据样本:

grouped = df.groupby('Player')
grouped.apply(lambda x: x.sample(n=2, replace=True)).head()

Pandas sample by group (player)

The code above may need some clarification. In the second line, we used Pandas apply method and the anonymous Python function lambda. What it will do is run sample on each subset (i.e., for each Player) and take 2 random rows. Note, here we have to use replace=True or else it won’t work.

上面的代码可能需要澄清。 在第二行中,我们使用了Pandas apply方法和匿名Python函数lambda。 它要做的是在每个子集上(即每个玩家)运行样本,并随机抽取2行。 注意,这里我们必须使用replace = True,否则它将无法正常工作。

有条件的熊猫随机样本 (Pandas Random Sample with Condition)

Say that we want to take a random sample of players with a salary under 421000 (or rows when the salary is under this number. Could be certain years for some players. This is quite easy, in the example below we sample 10% of the dataframe based on this condition.

假设我们要随机抽取薪水低于421000的球员样本(或薪水低于此数字,则排成一行。对于某些球员来说可能是某些年。这很容易,在下面的示例中,我们采样了10%基于此条件的数据框。

df[df['Salary'] < 421000].sample(frac=.1).head()

Pandas sample random selecting columns

It’s also possible to have more than one condition. We just have to add some code to the above example. Now we are going to sample salaries under 421000 and prior to the year 2000:

也可能有多个条件。 我们只需要在上面的示例中添加一些代码即可。 现在,我们将抽样调查421000以下和2000年之前的工资:

df[(df['Salary'] < 421000) & (df['Year'] < 2000)].sample(frac=.1).head()

按条件采样熊猫数据框
()

使用熊猫采样和删除 (Using Pandas Sample and Remove)

We may want to take a random sample from our dataframe and remove those rows. Maybe we want to create two different dataframes; one with 80% of the rows and one with the remaining 20%. Both of these things can, of course, be done using sample and the drop method. In the code example below we create two new dataframes; one with 80% of the rows and one with the remaining 20%.

我们可能想从数据框中随机抽取样本并删除这些行。 也许我们想创建两个不同的数据框; 一个占行的80%,剩下的占20%。 当然,这两种方法都可以使用sample和drop方法完成。 在下面的代码示例中,我们创建了两个新的数据框。 一个占行的80%,剩下的占20%。

df1 = df.sample(frac=0.8, random_state=138)
df2 = df.drop(df1.index)

If we merely want to remove random rows we can use drop and the inplace parameter:

如果我们只想删除随机行,则可以使用drop和inplace参数:

df.drop(df1.index, inplace=True)
df.shape

# Same as: df.drop(df.sample(frac=0.8, random_state=138).index, 
                             inplace=True)
# Output: (3909, 5)

More useful Pandas guides:

更多实用的熊猫指南:

保存熊猫样本 (Saving the Pandas Sample)

Finally, we may also want to save the to work on later. In the example code below we are going to save a Pandas sample to csv. To accomplish this we use the to_csv method. The first parameter is the filename and because we don’t want an index column in the file, we use index_col=False.

最后,我们可能还想保存以便以后使用。 在下面的示例代码中,我们将将Pandas示例保存到csv。 为此,我们使用to_csv方法。 第一个参数是文件名,并且由于我们不希望文件中有索引列,因此我们使用index_col = False。

import pandas as pd

df = pd.read_excel('MLBPlayerSalaries.xlsx')

df.sample(200, random_state=1111).to_csv('MBPlayerSalaries200Sample.csv', 
                                         index_col=False)

摘要 (Summary)

In this brief Pandas tutorial we have learned how to use the sample method. More specifically, we have learned how to:

在这个简短的Pandas教程中,我们学习了如何使用示例方法。 更具体地说,我们已经学习了如何:

  1. take a random sample of a data using the n (a number of rows) and frac (a percentage of rows) parameters,
  2. get reproducible results using a seed (random_state),
  3. sample by group, sample using weights, and sample with conditions
  4. create two samples and deleting random rows
  5. saving the Pandas sample
  1. 使用n(行数)和frac(行数)参数对数据进行随机抽样,
  2. 使用种子(random_state)获得可重复的结果,
  3. 按组抽样,使用权重抽样以及有条件的抽样
  4. 创建两个样本并删除随机行
  5. 保存熊猫样本

That was it! Now we should know how to use Pandas sample.

就是这样! 现在我们应该知道如何使用熊猫示例。

翻译自: https://www.pybloggers.com/2018/11/how-to-use-pandas-sample-to-select-rows-and-columns/

熊猫压缩怎么使用

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值