目录
1.常用方法
本文使用 pandas_profiling 3.1.0
我们在使用pandas-profiling时,以泰坦尼克号为例,一般这样写:
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv('train.csv',index_col=['PassengerId'])
report = ProfileReport(df)
report.to_notebook_iframe()
report.to_file('result.html')
打开结果的网页,可以分为几个部分,我们后面将介绍他们的自定义配置方法。
2.自定义参数
我们将简单地介绍一些我自己常用的参数。
更多介绍请前往官方github和文档
1.风格设置
有两个备选的风格:
dark_mode
orange_mode
report = ProfileReport(df,dark_mode = True)
2.更多计算与统计数据
关于统计,官方有两个配置文件:
对应两个方法,minimal和default,其中minimal适用于大数据集(只算default的一部分),default是默认配置。
改成minimal
report = ProfileReport(df,minimal = True)
此外,还有一个参数,explorative,可以计算更多特征
report = ProfileReport(df, explorative=True)
3.加标题
report = ProfileReport(df, title = 'Pandas Profiling Report')
4.转为json
report.to_file('result.json')
3.自定义配置文件
几乎所有的操作都可以通过修改配置文件得到。我将随便更改配置文件中的一些属性以示意,提供前后对比。
我们不妨拷贝在文件夹pandas_profiling下config_default.yaml文件为一个新的文件,config_custom.yaml文件以供更改。
找不到文件夹pandas_profiling在哪的,可以运行
import pandas_profiling
print(pandas_profiling.__file__)
配置文件结构与更改的对应关系
先把config_custom.yaml扔到同一目录下,然后运行,来配置文件
file = 'config_custom.yaml'
report = ProfileReport(df,dark_mode = True,config_file = file)
1.基本配置
# Title of the document
title: "Pandas Profiling Report"
# Metadata
dataset:
description: "wuhu"
creator: "fk"
author: "ym"
copyright_holder: ""
copyright_year: ""
url: ""
variables:
descriptions: {
'Sex':'OMG'
}
# infer dtypes
infer_dtypes: true
# Show the description at each variable (in addition to the overview tab)
show_variable_description: true
# Number of workers (0=multiprocessing.cpu_count())
pool_size: 0
# Show the progress bar
progress_bar: true
原来:
修改后:
2.变量段(我这边没改)
vars:
num: # 数值数据
quantiles: # 数据集分段以供统计的比例
- 0.05
- 0.25
- 0.5
- 0.75
- 0.95
skewness_threshold: 20
low_categorical_threshold: 5
# Set to zero to disable
chi_squared_threshold: 0.999
cat: # 类别数据
length: true
characters: true
words: true
cardinality_threshold: 50
n_obs: 5
# Set to zero to disable
chi_squared_threshold: 0.999
coerce_str_to_date: false
redact: false
histogram_largest: 50
bool:
n_obs: 3
# string to boolean mapping dict
mappings:
t: true
f: false
yes: true
no: false
y: true
n: false
true: true
false: false
file:
active: false
image:
active: true
exif: true
hash: true
path:
active: false
url:
active: false
3.剩下的统计段
# Sort the variables. Possible values: "ascending", "descending" or null (leaves original sorting)
sort: null
# which diagrams to show
missing_diagrams:
bar: true
matrix: false
heatmap: false
dendrogram: false
correlations:
pearson:
calculate: false
warn_high_correlations: false
threshold: 0.9
spearman:
calculate: true
warn_high_correlations: true
threshold: 0.9
kendall:
calculate: false
warn_high_correlations: false
threshold: 0.9
phi_k:
calculate: true
warn_high_correlations: false
threshold: 0.9
cramers:
calculate: false
warn_high_correlations: false
threshold: 0.9
# Bivariate / Pairwise relations
interactions:
targets: []
continuous: true
# Configuration related to the samples area
samples:
head: 10
tail: 10
random: 0
# For categorical
categorical_maximum_correlation_distinct: 100
report:
precision: 10
更改后和之前的对比。
4.画图(一般改颜色,就是camp)
# Plot-specific settings
plot:
# Image format (svg or png)
image_format: "svg"
dpi: 800
scatter_threshold: 1000
correlation:
cmap: 'RdBu'
bad: '#000000'
missing:
cmap: 'RdBu'
# Force labels when there are > 50 variables
# https://github.com/ResidentMario/missingno/issues/93#issuecomment-513322615
force_labels: true
pie: # 饼图
# display a pie chart if the number of distinct values is smaller or equal (set to 0 to disable)
max_unique: 10
histogram: # 变量后面那个直方图
x_axis_labels: true
# Number of bins (set to 0 to automatically detect the bin size)
# bins: 50
bins: 50
# Maximum number of bins (when bins=0)
# max_bins: 250
max_bins: 250
5.其他
# The number of observations to show
n_obs_unique: 5
n_extreme_obs: 5
n_freq_table_max: 10
# Use `deep` flag for memory_usage
memory_deep: false
# Configuration related to the duplicates
duplicates:
head: 10
key: "# duplicates"
# Configuration related to the rejection of variables
reject_variables: true
# When in a Jupyter notebook
notebook:
iframe:
height: '800px'
width: '100%'
# or 'src'
attribute: 'srcdoc'
html:
# Minify the html
minify_html: true
# Offline support
use_local_assets: true
# If true, single file, else directory with assets
inline: true
# Show navbar
navbar_show: true
# Assets prefix if inline = true
assets_prefix: null
# Styling options for the HTML report
style:
theme: null
logo: ""
primary_color: "#337ab7"
full_width: false
4.总结
本文介绍了修改参数和配置文件的解决方案,对于如warning,变量查看之类的问题我会另外写(因为这篇比较长,看着难受 doge)
在我们阅读官方文档后,还有一些问题,似乎是改参数和配置文件不能解决的:
1.改成中文的报告
2.修改画图的类型,比如我不要直方图,改成kde的图
3.修改报告的结构。github中提供了老版本(2.5.0左右)的解决方案,但目前好像有点问题。
我比较菜,只知道可以改html代码解决问题,或者开web开发者工具改dom节点。