Manipulating DataFrames with pandas（datacamp）_pandas dataframe data manipulate-CSDN博客

本文链接：https://blog.csdn.net/Detective_0/article/details/106166229

本文介绍了如何使用pandas库对DataFrame进行数据提取和转换。包括通过iloc和loc进行索引，筛选数据，以及apply、applymap和map的转换方法。此外，还涉及了高级索引、分层索引、数据重塑（如堆叠、展开和融化DataFrame）以及groupby操作，如分组聚合和转换。

摘要由CSDN通过智能技术生成

Extracting and transforming data

1索引 DataFrames

iloc，即index locate 用index索引进行定位，
loc，则可以使用column名和index名进行定位，

df.loc[rowname,colname]
df.iloc[num,num]
df[colname]			#Series
df[[colname]]		#DataFrame

#sample
# Print the boolean equivalence
print(election.iloc[4, 4] == election.loc['Bedford', 'winner'])
# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:, :'Obama']

# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = election[['winner', 'total', 'voters']]

2.筛选

df[condition]

#sample
# Create the boolean array: 
condition = df['a'] > 70

# Filter the df DataFrame with the condition array: 
df_con = df[condition]

3.转换DataFrame

*apply：用在dataframe上，用于对row或者column进行计算；

*applymap：用于dataframe上，是元素级别的操作；

*map：（python自带）用于series上，是元素级别的操作。

#sample
# Write a function to convert degrees Fahrenheit to degrees Celsius: 
def to_celsius(F):
    return 5/9*(F - 32)

# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': 
df_celsius = weather[['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)

# Reassign the column labels of df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']

# Print the output of df_celsius.head()
print(df_celsius.head())

Advanced indexing

1.index

series, DataFrame

immutable(like dictionary keys)

homogenous in data type

pd.read_csv(filename, index_col=___)
df.index = ___
df.index.name=___
df.columns.name=___

#sample
# Generate the list of months: months
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

# Assign months to sales.index
sales.index = months

2.Hierarchical indexing

df = df.set_index([___,___,...])
df = df.sort_index()
df.index.names

#sample
print(sales.loc[['CA', 'TX']])

print(sales.loc['CA':'TX'])

# Access the inner month index and look up data for all states in month 2: 
all_month2 = sales.loc[(slice(None),2),:]

Rearranging and reshaping data

1.Pivoting DataFrame

df_pivot = df.pivot(index='___', columns='___', values='___')

2.Stacking & unstacking

df_pivot.unstack(level=___)				#sub index
df_piv_uns.stack(level=___)				#add index
df_sw = df_piv_uns,swaplevel(___)		#swap level
df_sorted = df_sw.sort_index()

#sample
byweekday = df.unstack(level='weekday')
print(byweekday.stack(level='weekday'))

3.melting DataFrame

pd.melt(filename, id_vars=[], value_vars=[],var_name=___, value_name=___)

picot table

index contains duplicate entries

df_pt = dfpivot_table(index=___, columns=___, values=___, aggfunc=___)

Grouping data

1.categoricals and groupby

df.groupby('')

2.groupby and aggregation

agg 调用的时候要指定字段，apply 默认传入的是整个dataframe

df.groupby(colname).agg([___,___,...])

#sample
df.groupby('a')[['b', 'c']].agg({'b':'sum','c':data_range})

aggregated = titanic.groupby('pclass')[['age','fare']]agg(['max', 'median'])
print(aggregated.loc[:, ('age','max')])

3.groupby and transformation

transform 是针对输入的元素级别转换,t同一时间只允许在一个Series上转换

返回与传入数据相同的行

df.groupby(___).transform(___)

4.groupby and filtering

df.groupby(___).filter(___)

#sample
# Filter 'Units' where the sum is > 35: by_com_filt
df = df.groupby('Company').filter(lambda g:g['Units'].sum() > 35)