熊猫数据集_数据科学家的熊猫

最新推荐文章于 2024-03-23 09:45:59 发布

李_涛

最新推荐文章于 2024-03-23 09:45:59 发布

阅读量649

点赞数

文章标签： python 机器学习大数据人工智能

原文链接：https://medium.com/the-innovation/pandas-for-data-scientist-a2d8a8e81d04

版权

熊猫数据集

There are many tutorials about pandas on the internet and books. Pandas library is one of the most used libraries by data scientists and data engineers. In this tutorial, I am going to list the pandas’ functions I use the most.

互联网上有很多关于熊猫的教程和书籍。熊猫图书馆是数据科学家和数据工程师最常用的图书馆之一。在本教程中，我将列出我最常使用的熊猫功能。

While there are many functions to select here are my top ten most used ones. Please let me know yours in the comment section. I will be happy to add them to my collections. I will be focusing on pandas Data Frame rather than series.

虽然这里有很多功能可供选择，但它们是我最常使用的十大功能。请在评论部分告诉我您的信息。我很乐意将它们添加到我的收藏中。我将专注于熊猫数据框而不是系列。

Let’s starts with importing pandas.

让我们从导入熊猫开始。

import pandas as pd

Files Operation
文件操作

Excel

电子表格

import pandas as pd
 data = pd.read_excel(‘path_to_your_excel file’, sep=‘ ’, index_col=‘name’, dtype=dtypes, sheet_name=’’)

CSV

import pandas as pd
 data = pd.read_excel(‘path_to_your_excel file’, sep=‘ ’ , index_col=‘name’, dtype=dtypes)

Read HTML

阅读HTML

import pandas as pd
data = pd.read_html(‘data.html’, index_col=0)

Note: you can add datatype when you import the files. i.e and if you want to omit the header, set header=False.

注意：导入文件时可以添加数据类型。即，如果要省略标题，请设置header = False。

dtypes = {‘colname’: ’datatype’, ‘Weight’: ‘float32’}

Pandas considers these as missing values

熊猫认为这些是缺失的价值观

(’ ’),’nan’,’-nan’,’NA’,’N/A’,’NaN’,’null’

('')，'nan'，'-nan'，'NA'，'N / A'，'NaN'，'null'

When reading and writing files you can replace them like

读写文件时，您可以像替换它们一样

df.to_csv('new-data.csv', na_rep='(missing)')

2. Statistics in Pandas Data Frame

2.熊猫数据框中的统计数据

Sum

和

df['colName'].sum()

Mean

意思

df['colName'].mean()

Cumulative Sum

累计总和

df['colName'].cumsum()

Summary Statistics

统计摘要

df['colName'].describe()

Count

计数

df['colName'].count()

Min/Max

最小/最大

df['colName'].min()
df['colName'].max()

Median

中位数

df['colName'].median()

Sample Variance

样本差异

df['colName'].var()

Standard Deviation

标准偏差

df['colName'].std()

Skewness

偏度

df['colName'].skew()

Kurtosis

峰度

df['colName'].kurt()

Correlation Matrix Of Values

值的相关矩阵

df.corr()

3. Join Operations

3.加入运营

Pandas provides SQL like capabilities in large data sets

Pandas在大型数据集中提供类似SQL的功能

Simple Join

简单加入

pd.concat([df_a, df_b], axis=1)

Merge two data frames with condition_column value

合并两个带condition_column值的数据帧

pd.merge(df_new, df_n, on=‘join_condition_column_name’)

Merge with outer join

与外部联接合并

pd.merge(df_a, df_b, on=‘condition_column', how='outer’)

Merge with inner join

与内部联接合并

pd.merge(df_a, df_b, on=‘condition_column', how='inner’)

Merge with right join

以正确的合并合并

pd.merge(df_a, df_b, on=’common_column_id’, how=’right’)

Merge with left join

与左联接合并

pd.merge(df_a, df_b, on=‘common_column_id ‘, how=’left’)

Merge while adding a suffix to duplicate column names

在添加后缀以重复列名称时合并

pd.merge(df_a, df_b, on=’ common_column_id ‘, how=’left’, suffixes=(‘_left’, ‘_right’))

Merge based on indexes

根据索引合并

pd.merge(df_a, df_b, right_index=True, left_index=True)

4. Pandas Data Analysis

4.熊猫数据分析

See the first 10 entries

查看前10个条目

dataFrame.head(10)

See the last 10 entries

查看最近的10个条目

dataFrame.tail(10)

Total Number of records in Datasets

数据集中的记录总数

dataFrame.shape[0]

Number of columns in the datasets

数据集中的列数

dataframes.columns

Datasets indexed Detail

数据集索引明细

data.index

Summarize the Data Frame

汇总数据框

data.describe()

Summarize all the columns

汇总所有列

data.describe(include = “all”)

Mean value of column

列的平均值

round(data.columnname.mean())

Least occurred value in column

列中出现最少的值

data.columnnsme.value_counts().tail()

5. Slice and dice

5.切片切块

loc in Pandas : from https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

Loc in Pandas：来自https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

loc is label-based, which means that we have to specify the name of the rows and columns that we need to filter out.

loc 是基于标签的，这意味着我们必须指定需要过滤的行和列的名称。

iloc in Pandas : from https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

大熊猫iloc：来自https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

On the other hand, iloc is integer index-based. So here, we have to specify rows and columns by their integer index.

另一方面，iloc是基于整数索引的。 因此，在这里，我们必须通过整数索引指定行和列。

Examples

例子

import numpy as np # imported numpy to make arrays.
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
print(df)

Output :

输出：

# Using `iloc[]`
print(df.iloc[0][0])

Output: 1

输出1

# Using `loc[]`
print(df.loc[0][2])

Output : 3

输出3

# Using `at[]`
print(df.at[1,2])

Output : 6

输出6

# Using `iat[]`
print(df.iat[0,2])

Output : 3

输出3

6. Appending Column to existing dataframe

6.将列追加到现有数据框

Example

例

df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['A', 'B', 'C'])
print(df)# Append a column to `df`
df.loc[:, 'D'] = pd.Series(['5', '6' ,'7'], index=df.index)
print(df)

Output:

输出：

7. Dropping Duplicate value from dataFrames

7.从dataFrames中删除重复的值

Example

例

df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
print(df)# droping duplicate value from A and C
df = df.drop_duplicates(subset=['A', 'C'], keep=False) 
print(df)# froping the whole column in index 1, which remove the whole row
df = df.drop(df.index[[1]]) 
print(df)

Output :

输出：

8. Update Value

8.更新价值

Example

例

df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
print(df)# Replace strings by number (0-4)
df = df.replace(['foo', 'A','B' ], [10,90 , 80]) 
print(df)# Replace using `regex`
df = df.replace({r'[^0-9]+': 'apple'}, regex=True) 
print(df)

Output:

输出：

9. Apply Function

9.应用功能

we can make a custom function or use a lambda function to change the record in the data frame once and dynamically.

我们可以创建自定义函数或使用lambda函数一次动态地更改数据帧中的记录。

Example

例

df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
print(df)
print('\n')
doubler = lambda x: x*2
df = df.replace({r'[^0-9]+': 10}, regex=True)
df = df.apply(doubler)
print(df)

Output:

输出：

Creating new column from existing columns

从现有列创建新列

dfa = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print(dfa)
dfa = dfa.assign(C=lambda x: x['A'] + x['B'],D=lambda x: x['A'] + x['C'])
dfa

Output:

输出：

10. Taking dictionary, series as input in data frame.

10.以字典，系列作为数据框中的输入。

Take a dictionary as input to your DataFrame

将字典作为DataFrame的输入

my_dict = {1: [‘1’, ‘3’], 2: [‘1’, ‘2’], 3: [‘2’, ‘4’]}
print(pd.DataFrame(my_dict))

Output:

输出：

Series as input to your Data Frame

系列作为数据框的输入

my_series = pd.Series({“England”:”London”, “Nepal”:”kathmandu”, “China”:”Baiging”, “Belgium”:”Brussels”})
print(pd.DataFrame(my_series))
print(len(my_series.shape))

Output:

输出：

There are so many functions i haven’t added in this list. But these are the ones I use the most. Panda is a rich library and it has very good documentation from the official web site. You can read from this link. https://pandas.pydata.org

我没有在列表中添加太多功能。但是这些是我使用最多的。熊猫图书馆是一个丰富的图书馆，它拥有来自官方网站的非常好的文档。您可以从此链接阅读。 https://pandas.pydata.org