Python使用pandas_profiling库生成报告

辉哥的博客

已于 2023-02-20 19:49:35 修改

阅读量3.2k

点赞数 1

分类专栏： Python 文章标签： python 开发语言后端

于 2022-01-27 15:46:51 首次发布

本文链接：https://blog.csdn.net/qq_43278973/article/details/122718777

版权

Python 专栏收录该内容

10 篇文章 2 订阅

订阅专栏

Python使用pandas_profiling库生成报告

Python安装pandas_profiling

卸载pandas_profiling
pip uninstall pandas_profiling

命令行安装
pip install pandas_profiling==2.10.1 --指定版本
pip install pandas_profiling

清华镜像安装
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas_profiling

安装pandas_profiling报错处理

--安装 pandas_profiling 报错：ImportError: DLL load failed: 找不到指定的程序。
解决办法：查看报错，重新安装即可

pip uninstall scipy
pip install scipy

pip uninstall missingno
pip install missingno

--python文件取名注意：
pandas_profiling.py --有关键字会报错
pcx_pandas_profiling.py --ok

报错：
ERROR: Cannot uninstall 'PyYAML'.  It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

错误:无法卸载“PyYAML”。 它是一个distutils安装的项目，因此我们不能准确地确定哪些文件属于它，这将导致只部分卸载。

解决办法：卸载以后，在重新安装就可以了

在线下载命令
pip install -i https://pypi.douban.com/simple  scrapy

--安装pandas profiling
pip install PyHamcrest==1.9.0
pip install PyYAML --ignore-installed
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas_profiling

常用的python 镜像
豆瓣，该网站比较稳定，速度也比较快
https://pypi.douban.com/simple

清华大学
https://pypi.tuna.tsinghua.edu.cn/simple

中国科技大学
https://mirrors.ustc.edu.cn/pypi/web/simple

阿里
https://mirrors.aliyun.com/pypi/simple/

Python 代码如下：

import pandas as pd
import pandas_profiling
import os
import re

intput_dir = os.walk(r"../test_data")
output_dir = '../test_data'
hospitol = 'XX'

for path, dir_list, file_list in intput_dir:
    for file_name in file_list:
        if file_name == 'XX.csv': #跑单张表pandas_profiling时使用；
            file_path = os.path.join(path, file_name)
            df = pd.read_csv(file_path)
            # 获取表名
            tablename = re.compile(r'\w+')
            t_lst = re.findall(tablename, file_name)
            for l in t_lst:
                table_name = str.lower(l)
                #minimal=True 该参数，如果不设会出更详细的pandas_profiling报告;
                profile = pandas_profiling.ProfileReport(df, title=f'{hospitol}{table_name}表数据质量报告',minimal=True)
                profile.to_file(output_file=os.path.join(output_dir, table_name + '.html'))

以下是Pandas Profiling(2.11版)官方文档内容:

Pandas Profiling

Pandas Profiling Logo Header

Documentation | Slack | Stack Overflow

Generates profile reports from a pandas DataFrame.

The pandas df.describe() function is great but a little basic for serious exploratory data analysis.
pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

Type inference: detect the types of columns in a dataframe.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

Announcements

Version v2.10.0rc1 released

v2.10.0rc1 includes a major overhaul of the type system, now fully reliant on visions.
See the changelog below to know what has changed.

Spark backend in progress

We can happily announce that we’re nearing v1 for the Spark backend for generating profile reports.
Stay tuned.

Support `pandas-profiling`

The development of pandas-profiling relies completely on contributions.
If you find value in the package, we welcome you to support the project through GitHub Sponsors!
It’s extra exciting that GitHub matches your contribution for the first year.

Find more information here:

January 5, 2021 💘

Examples

The following examples can give you an impression of what the package can do:

Census Income (US Adult Census data relating income)
NZA (open data from the Dutch Healthcare Authority)
Stata Auto (1978 Automobile data)
Vektis (Vektis Dutch Healthcare data)
Colors (a simple colors dataset)
UCI Bank Dataset (banking marketing dataset)

Specific features:

Russian Vocabulary (demonstrates text analysis)
Cats and Dogs (demonstrates image analysis from the file system)
Celebrity Faces (demonstrates image analysis with EXIF information)
Website Inaccessibility (demonstrates URL analysis)
Orange prices and Coal prices (showcases report themes)

Installation

Using pip

You can install using the pip package manager by running

pip install pandas-profiling[notebook]

Alternatively, you could install the latest version directly from Github:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using conda

You can install using the conda package manager by running

conda install -c conda-forge pandas-profiling

From source

Download the source code by cloning the repository or by pressing ‘Download ZIP’ on this page.

Install by navigating to the proper directory and running:

python setup.py install

Documentation

The documentation for pandas_profiling can be found here. Previous documentation is still available here.

Getting started

Start by loading in your pandas DataFrame, e.g. by using:

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame(
    np.random.rand(100, 5),
    columns=["a", "b", "c", "d", "e"]
)

To generate the report, run:

profile = ProfileReport(df, title="Pandas Profiling Report")

Explore deeper

You can configure the profile report in any way you like. The example code below loads the explorative configuration file, that includes many features for text (length distribution, unicode information), files (file size, creation time) and images (dimensions, exif information). If you are interested what exact settings were used, you can compare with the default configuration file.

profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True)

Learn more about configuring pandas-profiling on the Advanced usage page.

Jupyter Notebook

We recommend generating reports interactively by using the Jupyter notebook.
There are two interfaces (see animations below): through widgets and through a HTML report.

This is achieved by simply displaying the report. In the Jupyter Notebook, run:

profile.to_widgets()

The HTML report can be included in a Jupyter notebook:

Run the following code:

profile.to_notebook_iframe()

Saving the report

If you want to generate a HTML report file, save the ProfileReport to an object and use the to_file() function:

profile.to_file("your_report.html")

Alternatively, you can obtain the data as JSON:

# As a string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

Large datasets

Version 2.4 introduces minimal mode.

This is a default configuration that disables expensive computations (such as correlations and dynamic binning).

Use the following syntax:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")

Command line usage

For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling executable.

Run the following for information about options and arguments.

pandas_profiling -h

Advanced usage

A set of options is available in order to adapt the report generated.

title (str): Title for the report (‘Pandas Profiling Report’ by default).
pool_size (int): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).
progress_bar (bool): If True, pandas-profiling will display a progress bar.

More settings can be found in the default configuration file, minimal configuration file and dark themed configuration file.

Example

profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})
profile.to_file("output.html")

Supporting open source

Maintaining and developing the open-source code for pandas-profiling, with millions of downloads and thousands of users, would not be possible without support of our gracious sponsors.

Lambda workstations, servers, laptops, and cloud services power engineers and researchers at Fortune 500 companies and 94% of the top 50 universities. Lambda Cloud offers 4 & 8 GPU instances starting at $1.50 / hr. Pre-installed with TensorFlow, PyTorch, Ubuntu, CUDA, and cuDNN.

We would like to thank our generous Github Sponsors supporters who make pandas-profiling possible:

Martin Sotir, Joseph Yuen, Brian Lee, Stephanie Rivera, nscsekhar, abdulAziz

More info if you would like to appear here: Github Sponsor page

Types

Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.).
pandas-profiling currently recognizes the following types: Boolean, Numerical, Date, Categorical, URL, Path, File and Image.

We have developed a type system for Python, tailored for data analysis: visions.
Selecting the right typeset drastically reduces the complexity the code of your analysis.
Future versions of pandas-profiling will have extended type support through visions!

Contributing

Read on getting involved in the Contribution Guide.

A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. Join the Slack community.

Editor integration

PyCharm integration

Install pandas-profiling via the instructions above

Locate your pandas-profiling executable.

On macOS / Linux / BSD:

$ which pandas_profiling
(example) /usr/local/bin/pandas_profiling

On Windows:

$ where pandas_profiling
(example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe

In PyCharm, go to Settings (or Preferences on macOS) > Tools > External tools
Click the + icon to add a new external tool
Insert the following values
- Name: Pandas Profiling
- Program: The location obtained in step 2
- Arguments: "$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"
- Working Directory: $ProjectFileDir$

To use the PyCharm Integration, right click on any dataset file:

External Tools > Pandas Profiling.

Other integrations

Other editor integrations may be contributed via pull requests.

Dependencies

The profile report is written in HTML and CSS, which means pandas-profiling requires a modern browser.

You need Python 3 to run this package. Other dependencies can be found in the requirements files:

Filename	Requirements
requirements.txt	Package requirements
requirements-dev.txt	Requirements for development
requirements-test.txt	Requirements for testing
setup.py	Requirements for Widgets etc.

辉哥的博客

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Python使用pandas_profiling库生成报告

Python使用pandas_profiling库生成报告Python安装pandas_profiling命令行安装pip install pandas_profilingpip install pandas_profiling==2.10.1 --指定版本清华镜像安装pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas_profiling卸载pandas_profilingpip uninstall pa
复制链接

扫一扫