数据分析实战-python

Thomas_Cai

已于 2024-03-19 11:00:41 修改

阅读量822

点赞数 1

分类专栏： Python技术文章标签：数据分析 python 数据挖掘

于 2019-09-27 15:33:45 首次发布

本文链接：https://blog.csdn.net/ThomasCai001/article/details/101520912

版权

Python技术专栏收录该内容

16 篇文章 1 订阅

订阅专栏

文章目录

一、 pandas用于数据分析
二、 pandas实战笔记（零散）
参考链接

一、 pandas用于数据分析

1. 显示表的基础信息

pd.set_option('display.max_columns', None)  # 显示最大列
pd.set_option('display.max_rows', None)  # 显示最大行
pd.set_option('expand_frame_repr', False)  # 每一行不分行显示
print(train_transaction.head(n))  # 查看前n行
print(train_transaction.tail(n))  # 查看倒数n行
print(train_transaction.info(verbose=True, null_counts=True))  # verbose:为True显示全部列；null_counts:为True显示非空数量

2. 缺失值处理

2.1 单列填充

中位数或均值：（median, mean）

x_median = train_transaction.iloc[:, i].median()
train_transaction.iloc[:, i] = train_transaction.iloc[:, i].fillna(x_median)

众数：

x_mode = train_transaction.iloc[:, i].mode()[0]
train_transaction.iloc[:, i] = train_transaction.iloc[:, i].fillna(x_mode)

2.2 多列填充

中位数或均值：（median, mean）

df_ma = df.columns[df.dtypes != 'object']
df.fillna(df.loc[:, df_ma].median(), inplace=True)

众数：

df_ma = df.columns[df.dtypes == 'object']
df.fillna(df.loc[:, df_ma].mode().iloc[0], inplace=True)

3. 选取行或列

3.1 选取列

（1）直接方式

df = pd.DataFrame()
df[列名称]  # 也可以是行名称

（2） loc方式（按索引标签）

df = pd.DataFrame()
df.loc[:, 列名称]  # 返回Series
df.loc[:, [列名称]]  # 返回DataFrame
df.loc[:, 列名称1:列名称2]  # 返回列名称1到列名称2的内容 DataFrame格式

（3） iloc方式（按索引位置）

df = pd.DataFrame()
df.iloc[:, 列位置]  # 返回Series
df.iloc[:, [列位置]]  # 返回DataFrame
df.iloc[:, 列位置1:列位置2]  # 返回列位置1到【列位置2 - 1】的内容 DataFrame格式

3.2 选取行

（1）ix方式（同时可以按索引标签或索引位置取行）

df = pd.DataFrame()
df.ix[行名称]

（2）loc方式（按索引标签）

df = pd.DataFrame()
df.loc[行名称]  # 返回Series
df.loc[[行名称]]  # 多加一个方括号，返回DataFrame
df.loc[行名称1:行名称2]  # 返回列名称1到列名称2的内容 DataFrame格式

（3）iloc方式（按索引位置）

df = pd.DataFrame()
df.iloc[行位置]  # 返回Series
df.iloc[[行位置]]  # 返回DataFrame
df.iloc[行位置1:行位置2]  # 返回列位置1到【列位置2 - 1】的内容 DataFrame格式

3.3 选取条件子表

df = pd.DataFrame()
df.iloc[2,4] # 第二行第四列（从0开始）
df.loc[df[列名称]==' ']
df.loc[df[列名称].isnull()]  # 该列为空
df.loc[df[列名称].notnull()]  # 该列不为空
df.loc[条件，[列名]] 返回 DataFrame


def Family_feature(df):
    df['Fam_Size'] =df['SibSp']+df['Parch']
    df['Fam_Size'].loc[df['Fam_Size'] == 0] = 1
    df['Fam_Size'].loc[(df['Fam_Size'] > 1) & (df['Fam_Size'] <= 3)] = 2
#     df['Fam_Size'].loc[df['Fam_Size'] == 2] = 2
#     df['Fam_Size'].loc[df['Fam_Size'] == 3] = 2
    df['Fam_Size'].loc[df['Fam_Size'] > 3] = 3
    return df

4. 数据分析技法

参考： https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12282042.0.0.1dce20429Jt3oQ&postId=6772

4.1 groupby

按某列分组，并计算出另外一列的个数、均值，阿里云天池分析截图：

Notes: 以下sex 和 survived 顺序和结果顺序一致，后面[‘Survived’]或[‘Sex’]效果是一样的。

在这里插入图片描述

4.2 小提琴图

在这里插入图片描述

4.3 直方图和分位数图

在这里插入图片描述

4.4 分布图

在这里插入图片描述

4.5 某列下另外一列的统计量展示

在这里插入图片描述

4.6 描述信息以及连续数据离散化

在这里插入图片描述

4.7 饼图

在这里插入图片描述

4.8 特征相关性

在这里插入图片描述

4.9 特征之间数据分布

在这里插入图片描述

5. 特征工程

5.1 以某列为基础对另外一列统计量（均值方差）聚合特征

代码案例：

temp_df = temp_df.groupby([col])[main_column].agg([agg_type]).reset_index().rename(columns={agg_type: new_col_name})
temp_df.index = list(temp_df[col])
temp_df = temp_df[new_col_name].to_dict()
train_df[new_col_name] = train_df[col].map(temp_df).astype('float32')
test_df[new_col_name] = test_df[col].map(temp_df).astype('float32')

二、 pandas实战笔记（零散）

1. 替换dataframe里的数据：

df1['feature1'].loc[np.where(df1['feature'] == '***')[0]] 
= df1['feature2'].loc[np.where(df1['feature'] == '***')[0]]
eg.result['quantity_x'].loc[np.where(result['date'] == '2050-05-05')[0]] = 
result['quantity_y'].loc[np.where(result['date'] == '2050-05-05')[0]]

2. 读取数据并对时间进行处理

方法一：

df['date'] = pd.to_datetime(df['date'], format="%Y-%m-%d")  # 统一格式
df['date'] = pd.to_datetime(df['date']).sort_values()  # 时间排序

方法二：

df_city.sort_index(by='日期', inplace=True)

3. 时间转格式：str to Timestamp

Reference: https://blog.csdn.net/pipisorry/article/details/52209377

>>> pd.to_datetime('2016-07-01')
>>> Timestamp('2016-07-01 00:00:00')

3.1 dataframe中时间格式统一格式为"%Y-%m-%d"

df_liu['date'] = pd.Series(x.strftime('%Y-%m-%d') for x in pd.to_datetime(list(df_liu['date'])))

4. 存表技巧

最常用方法：

def save_csv(csv_dict, save_path):
    # error_info = {"img_name": error_img_name,
    #               "error_type": error_type,
    #               "explain: 0 can't find pic;1 label no match;2 can't find edge;3 other error": explain_loc}
    error_info_df = pd.DataFrame(csv_dict)
    error_info_df.to_csv(save_path, index=False)

方法一：

>>> import pandas as pd
>>> import numpy as np
>>> res = pd.DataFrame()
>>> ary = np.array([])
>>> ary = np.concatenate([ary , ['string']], axis=0)
>>> res['test'] = ary
>>> res
 test
0  string
>>> res.to_csv('.csv', index=False, sep=',')

方法二：

import pandas as pd
import numpy as np


df = pd.DataFrame(columns=['hao', 'e'])
temp = 0
for i in range(2):
    df.loc[i] = np.array([1, 2])
df.to_csv('test.csv', index=False, sep=',')

方法三：

import pandas as pd
import numpy as np


df_value = []
for i in range(2):
    df_value.append([1, 2])
df = pd.DataFrame(df_value, columns=['XX', 'XX'])
df.to_csv('test.csv', index=False, sep=',')

5. 数组/字符转换

>>> import json
>>> json.dumps(['1234'], ensure_ascii=False) # 把\u00**复原成中文
'["1234"]'
>>> test = json.dumps(['1234'])
>>> json.loads(test)
['1234']

6. matplotlib画图问题

保存覆盖问题在画图前加上即可避免覆盖
解决方法：

plt.figure()

RuntimeWarning: More than 20 figures have been opened.
解决方法：

figure 的重复利用能大大节约时间，但是 matplotlib 维护的 figure 有数量上限（RuntimeWarning: More than 20 figures have been opened.）。并且，不断的创建新的 figure 实例，很容易造成内存泄漏，而应合理的复用，能大大的提高运行速度。此外，在某些情况下，不清理 figure 将有可能造成在第一幅中 plot 的线再次出现在第二幅图中。

以下包括：
plt.cla() # 清除axes，即当前 figure 中的活动的axes，但其他axes保持不变。
plt.clf() # 清除当前 figure 的所有axes，但是不关闭这个 window，所以能继续复用于其他的 plot。
plt.close() # 关闭 window，如果没有指定，则指当前 window。

Close a figure window.
``close()`` by itself closes the current figure
``close(fig)`` closes the `~.Figure` instance *fig*
``close(num)`` closes the figure number *num*
``close(name)`` where *name* is a string, closes figure with that label
``close('all')`` closes all the figure windows

7. xlsx文件pandas处理办法

读取xlsx文件方法

def read_file(file, *args, **kwargs):
    """
    read excel or csv file
    """
    if re.search('\.xlsx$', file):
        return pd.read_excel(file, *args, **kwargs)
    elif re.search('\.csv$', file):
        return pd.read_csv(file, *args, **kwargs)
df_city = read_file(r'文件路径', sheet_name=0)
df_flow = read_file(r'文件路径', sheet_name=1)

保存xlsx文件方法

https://blog.csdn.net/weixin_42130167/article/details/89705581

writer = pd.ExcelWriter(os.path.join(os.getcwd(), '自定义.xlsx'))
df1.to_excel(writer, sheet_name='自定义sheet_name'）#startcol=**， startrow=**)
df2.to_excel(writer, sheet_name='自定义sheet_name'）#startcol=**， startrow=**)
df3.to_excel(writer, sheet_name='自定义sheet_name'）#startcol=**， startrow=**)
...
writer.save()# 写入硬盘

保存html文件方法

HEADER = '''
    <html>
        <head>
            <meta charset="UTF-8">
        </head>
        <body>
    '''
FOOTER = '''
        </body>
    </html>
    '''
 
with open(os.path.join(os.getcwd(), '自定义文件名.html'), 'w') as f:
    f.write(HEADER)
    for df in [df1, df2, df3...]:
        #f.write('<h1><strong>' + '自定义dataframe名' +'</strong></h1>')
        f.write(df.to_html(classes='自定义classname'))
    f.write(FOOTER)

9. 拆分时间以及生成时间技巧

1 首先读表的时候将时间处理

df = read_csv('xxx文件路径', parse_dates=['日期'], index_col='日期')

2 定义拆分时间并对时间进行拆分

encode_cols = ['Month', 'DayofWeek', 'WeekofYear']

3 此函数然后返回则为拆分后的 dataFrame格式文件

def date_transform(df, encode_cols):
    df['Year'] = df.index.year
    df['Month'] = df.index.month
    df['WeekofYear'] = df.index.weekofyear
    df['DayofWeek'] = df.index.weekday
    # df['Hour'] = df.index.hour
    # df['Minute'] = df.index.minute
    for col in encode_cols:
        df[col] = df[col].astype('category')
    df = pd.get_dummies(df, columns=encode_cols)
    return df

4 生成时间技巧返回即为生成的时间以及设定的列的 dataFrame格式文件

unseen_start_date = '2019-11-19 00:00:00'
steps = 6 # day
def produce_unseen_data(unseen_start_date, steps, column_name='任务量', bucket_size='1440T'):
    index = pd.date_range(unseen_start_date, periods=steps, freq=bucket_size)
    df = pd.DataFrame(pd.Series(np.zeros(steps)), index=index, columns=[column_name])
    return df

参考链接

[1] http://appblog.cn/2019/02/08/matplotlib%E4%B9%8Bplt.figure/

[2] https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12282042.0.0.1dce20429Jt3oQ&postId=6772

[3] https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html

[4] https://www.pypandas.cn/

Thomas_Cai

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
1
评论
数据分析实战-python

文章目录一、 pandas用于数据分析1. 显示表的信息2. 缺失值处理（1）单列填充（2）多列填充2. 选取行或列二、 pandas实战笔记（零散）1. 替换dataframe里的数据：2. 读取数据并对时间进行处理3. 时间转格式：str to Timestamp4. 存表技巧5. 数组/字符转换一、 pandas用于数据分析1. 显示表的信息2. 缺失值处理（1）单列填充中位...
复制链接

扫一扫