什么是Python数据分析与可视化
Python数据分析与可视化是指使用Python编程语言进行数据处理、分析和可视化的过程。Python拥有许多强大的库和工具,使得数据分析和可视化变得更加简单和高效。
数据分析通常涉及对数据进行清洗、转换、整合和统计等处理,以获取有关数据的洞察和见解。Python提供了一些常用的数据分析库,如NumPy、Pandas和SciPy,它们提供了各种数据结构、函数和算法,用于处理和分析数据。
可视化是将数据以图表、图形或其他可视形式展示出来的过程。通过可视化,我们可以更直观地理解数据、发现模式和趋势,并向其他人传达数据的信息。Python中最常用的数据可视化库是Matplotlib和Seaborn,它们提供了各种绘图函数和样式,使得绘制各种类型的图表变得简单。
除了Matplotlib和Seaborn,还有其他一些流行的数据可视化库,如Plotly、Bokeh和Altair,它们提供了更交互性和高级的可视化功能。
使用Python进行数据分析和可视化的基本步骤包括:
1. 数据获取:从文件、数据库或API等来源获取数据。
2. 数据清洗和预处理:对数据进行清洗、转换和处理,去除异常值、处理缺失数据等。
3. 数据分析:使用统计方法、机器学习算法等对数据进行分析,探索数据的特征和关系。
4. 数据可视化:使用合适的图表、图形或地图等方式将数据可视化,以便更好地理解和传达数据信息。
5. 结果解释和报告:对分析和可视化结果进行解释,并撰写相应的报告或文档。
Python数据分析与可视化的优势包括语法简洁易学、强大的库和工具支持、丰富的社区资源和广泛的应用领域。无论是从事科学研究、商业分析还是数据驱动的决策,Python都是一个强大的工具。
Python数据分析依赖的两个对象
Python数据分析依赖的两个主要对象是NumPy数组和Pandas数据帧。
1. NumPy数组(N-dimensional Array):NumPy是Python中一个强大的数值计算库,提供了多维数组对象和一系列用于操作数组的函数。在数据分析中,NumPy数组是一种高效的数据结构,用于存储和处理大规模数值数据。NumPy数组具有快速的数值运算和广播功能,可以进行向量化操作,使得处理大型数据集变得更加高效。它还提供了各种数学、统计和线性代数函数,用于数据计算和分析。
2. Pandas数据帧(DataFrame):Pandas是一个基于NumPy构建的数据分析工具,提供了高级数据结构和数据操作功能。其中最常用的数据结构是DataFrame,它类似于表格或电子表格的数据结构,包含行和列的标签,以及灵活的索引和选择功能。Pandas的DataFrame可以处理异构数据,支持处理时间序列数据、缺失数据、重复数据等。它提供了丰富的数据操作、转换和处理函数,如数据过滤、排序、合并、聚合等,方便进行数据清洗、转换和分析。
这两个对象在Python数据分析中扮演着重要的角色。NumPy提供了底层的数值计算和数组操作能力,而Pandas构建在NumPy之上,提供了更高级的数据结构和数据处理功能,使得数据分析更加便捷和灵活。
除了NumPy和Pandas,Python数据分析还涉及其他一些常用的库和工具,如SciPy(科学计算库)、Matplotlib(绘图库)、Seaborn(统计可视化库)等。这些库共同构成了Python数据分析的生态系统,为数据科学家和分析师提供了丰富的工具和资源。
导入外部数据
要导入外部数据到Python中进行数据分析,可以使用多种方法,具体取决于数据的来源和格式。下面是几种常见的导入外部数据的方法:
1.CSV文件:如果数据存储在逗号分隔值(CSV)文件中,可以使用pandas
库的read_csv()
函数来读取数据并创建一个数据帧(DataFrame)对象。例如:
import pandas as pd
data = pd.read_csv('data.csv') # 替换为你的CSV文件路径
2.Excel文件:如果数据存储在Excel文件中,可以使用pandas
库的read_excel()
函数来读取数据。需要先安装openpyxl
或xlrd
库。例如:
import pandas as pd
data = pd.read_excel('data.xlsx', sheet_name='Sheet1') # 替换为你的Excel文件路径和工作表名称
3.JSON文件:如果数据存储在JSON文件中,可以使用json
库来读取数据。例如:
import json
open('data.json') as f: # 替换为你的JSON文件路径
data = json.load(f)
4.数据库:如果数据存储在数据库中,可以使用p
库的read_sql()
函数来执行SQL查询并将结果读取为数据帧。需要先安装相应的数据库驱动程序,如pyodbc
、pymysql
、psycopg2
等。例如:
import pandas as pd
import sqlite3
conn = sqlite3.connect('database.db') # 替换为你的数据库连接信息
query = 'SELECT * FROM table' # 替换为你的SQL查询语句
data = pd.read_sql(query, conn)
以上只是一些常见的导入外部数据的方法,实际情况可能因数据来源和格式而有所不同。无论是从文件、数据库、API还是其他来源导入数据,Python提供了广泛的库和工具来处理各种数据格式。
例如:
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "6677b697",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd "
]
},
{
"cell_type": "markdown",
"id": "7266548f",
"metadata": {},
"source": [
"# 1. 读取csv数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3c15c8e6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>姓名</th>\n",
" <th>性别</th>\n",
" <th>省份</th>\n",
" <th>城市</th>\n",
" <th>年龄</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>超级管理员</td>\n",
" <td>男</td>\n",
" <td>广东广州</td>\n",
" <td>泰迪科技</td>\n",
" <td>23.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>叶亦凯</td>\n",
" <td>男</td>\n",
" <td>广东广州</td>\n",
" <td>广州</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>邓彬彬</td>\n",
" <td>女</td>\n",
" <td>广东广州</td>\n",
" <td>广州</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>张建涛</td>\n",
" <td>男</td>\n",
" <td>广东广州</td>\n",
" <td>广州</td>\n",
" <td>22.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>莫诗怡</td>\n",
" <td>女</td>\n",
" <td>广西</td>\n",
" <td>广州</td>\n",
" <td>23.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>728</th>\n",
" <td>杜殿雨</td>\n",
" <td>女</td>\n",
" <td>江西九江</td>\n",
" <td>佛山</td>\n",
" <td>48.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>729</th>\n",
" <td>杜鹦超</td>\n",
" <td>男</td>\n",
" <td>天津</td>\n",
" <td>广州</td>\n",
" <td>43.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>730</th>\n",
" <td>杜小悦</td>\n",
" <td>女</td>\n",
" <td>广东广州</td>\n",
" <td>佛山</td>\n",
" <td>23.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>731</th>\n",
" <td>杜雨玲</td>\n",
" <td>女</td>\n",
" <td>福建厦门</td>\n",
" <td>广州</td>\n",
" <td>37.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>732</th>\n",
" <td>杜依醇</td>\n",
" <td>女</td>\n",
" <td>广西南宁</td>\n",
" <td>佛山</td>\n",
" <td>46.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>733 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" 姓名 性别 省份 城市 年龄\n",
"0 超级管理员 男 广东广州 泰迪科技 23.0\n",
"1 叶亦凯 男 广东广州 广州 21.0\n",
"2 邓彬彬 女 广东广州 广州 21.0\n",
"3 张建涛 男 广东广州 广州 22.0\n",
"4 莫诗怡 女 广西 广州 23.0\n",
".. ... .. ... ... ...\n",
"728 杜殿雨 女 江西九江 佛山 48.0\n",
"729 杜鹦超 男 天津 广州 43.0\n",
"730 杜小悦 女 广东广州 佛山 23.0\n",
"731 杜雨玲 女 福建厦门 广州 37.0\n",
"732 杜依醇 女 广西南宁 佛山 46.0\n",
"\n",
"[733 rows x 5 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data1 = pd.read_csv(r'../../数据集/新用户表.csv',\n",
" sep=',',\n",
" encoding='gbk'\n",
" )\n",
"data1"
]
},
{
"cell_type": "markdown",
"id": "a0be77ea",
"metadata": {},
"source": [
"# 2. 读取excel数据"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "63cd631c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>明细编号</th>\n",
" <th>订单编号</th>\n",
" <th>菜名</th>\n",
" <th>数量</th>\n",
" <th>单价</th>\n",
" <th>菜品添加时间</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2352</td>\n",
" <td>366</td>\n",
" <td>香酥两吃大虾</td>\n",
" <td>1</td>\n",
" <td>89</td>\n",
" <td>2016-08-11 11:49:43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2354</td>\n",
" <td>366</td>\n",
" <td>姜葱炒花蟹</td>\n",
" <td>1</td>\n",
" <td>45</td>\n",
" <td>2016-08-11 11:51:17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2356</td>\n",
" <td>366</td>\n",
" <td>香烤牛排\\r\\n\\r\\n\\r\\n</td>\n",
" <td>1</td>\n",
" <td>55</td>\n",
" <td>2016-08-11 11:52:48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2358</td>\n",
" <td>366</td>\n",
" <td>铁板牛肉</td>\n",
" <td>1</td>\n",
" <td>66</td>\n",
" <td>2016-08-11 11:53:47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2361</td>\n",
" <td>366</td>\n",
" <td>蒜香包</td>\n",
" <td>1</td>\n",
" <td>13</td>\n",
" <td>2016-08-11 11:54:12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3642</th>\n",
" <td>3856</td>\n",
" <td>1125</td>\n",
" <td>白饭/大碗</td>\n",
" <td>1</td>\n",
" <td>10</td>\n",
" <td>2016-08-20 22:06:33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3643</th>\n",
" <td>3857</td>\n",
" <td>1125</td>\n",
" <td>泡椒凤爪\\r\\n\\r\\n\\r\\n</td>\n",
" <td>1</td>\n",
" <td>58</td>\n",
" <td>2016-08-20 22:07:31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3644</th>\n",
" <td>3858</td>\n",
" <td>1125</td>\n",
" <td>拌土豆丝</td>\n",
" <td>1</td>\n",
" <td>25</td>\n",
" <td>2016-08-20 22:07:47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3645</th>\n",
" <td>3859</td>\n",
" <td>1125</td>\n",
" <td>牛尾汤</td>\n",
" <td>1</td>\n",
" <td>40</td>\n",
" <td>2016-08-20 22:08:38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3646</th>\n",
" <td>3860</td>\n",
" <td>1125</td>\n",
" <td>照烧鸡腿\\r\\n\\r\\n\\r\\n</td>\n",
" <td>1</td>\n",
" <td>16</td>\n",
" <td>2016-08-20 22:09:17</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3647 rows × 6 columns</p>\n",
"</div>"
],
"text/plain": [
" 明细编号 订单编号 菜名 数量 单价 菜品添加时间\n",
"0 2352 366 香酥两吃大虾 1 89 2016-08-11 11:49:43\n",
"1 2354 366 姜葱炒花蟹 1 45 2016-08-11 11:51:17\n",
"2 2356 366 香烤牛排\\r\\n\\r\\n\\r\\n 1 55 2016-08-11 11:52:48\n",
"3 2358 366 铁板牛肉 1 66 2016-08-11 11:53:47\n",
"4 2361 366 蒜香包 1 13 2016-08-11 11:54:12\n",
"... ... ... ... .. .. ...\n",
"3642 3856 1125 白饭/大碗 1 10 2016-08-20 22:06:33\n",
"3643 3857 1125 泡椒凤爪\\r\\n\\r\\n\\r\\n 1 58 2016-08-20 22:07:31\n",
"3644 3858 1125 拌土豆丝 1 25 2016-08-20 22:07:47\n",
"3645 3859 1125 牛尾汤 1 40 2016-08-20 22:08:38\n",
"3646 3860 1125 照烧鸡腿\\r\\n\\r\\n\\r\\n 1 16 2016-08-20 22:09:17\n",
"\n",
"[3647 rows x 6 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data2 = pd.read_excel(r'../../数据集/新明细表.xlsx')\n",
"data2"
]
},
{
"cell_type": "markdown",
"id": "3cac14fb",
"metadata": {},
"source": [
"# 3. 将表格上传至数据库"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "356f94b8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Engine(mysql+pymysql://root:***@localhost:3306/data_analysis)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sqlalchemy import create_engine \n",
"user = 'root' #用户名\n",
"password = '123456' #密码\n",
"ip = 'localhost' # ip地址\n",
"port = '3306' # 端口\n",
"database = 'data_analysis' # 数据库名\n",
"\n",
"# 创建mysql引擎对象,将上述信息写入函数的参数中\n",
"engine = create_engine(f'mysql+pymysql://{user}:{password}@{ip}:{port}/{database}') \n",
"engine"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b8c95629",
"metadata": {},
"outputs": [],
"source": [
"# 将表data1写入engine中的数据库中的user表中\n",
"data1.to_sql('user',engine,index=False)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "3ef8d997",
"metadata": {},
"outputs": [],
"source": [
"# 将表data2写入engine中的数据库中的detail表中\n",
"data2.to_sql('detail',engine,index=False)"
]
},
{
"cell_type": "markdown",
"id": "ba20bfdd",
"metadata": {},
"source": [
"# 4. 读取Mysql数据库数据"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "28d2bdf8",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd # 导入pandas库并重命名\n",
"from sqlalchemy import create_engine "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1a8affae",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Engine(mysql+pymysql://root:***@localhost:3306/data_analysis)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user = 'root' #用户名\n",
"password = '123456' #密码\n",
"ip = 'localhost' # ip地址\n",
"port = '3306' # 端口\n",
"database = 'data_analysis' # 数据库名\n",
"\n",
"# 创建mysql引擎对象,将上述信息写入函数的参数中\n",
"engine = create_engine(f'mysql+pymysql://{user}:{password}@{ip}:{port}/{database}') \n",
"engine"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1089f0b8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>订单编号</th>\n",
" <th>用餐人数</th>\n",
" <th>订单总花费</th>\n",
" <th>点菜总数</th>\n",
" <th>用餐开始时间</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>417</td>\n",
" <td>4</td>\n",
" <td>165</td>\n",
" <td>5</td>\n",
" <td>2016-08-01 11:05:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>301</td>\n",
" <td>3</td>\n",
" <td>321</td>\n",
" <td>6</td>\n",
" <td>2016-08-01 11:15:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>413</td>\n",
" <td>6</td>\n",
" <td>854</td>\n",
" <td>15</td>\n",
" <td>2016-08-01 12:42:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>415</td>\n",
" <td>4</td>\n",
" <td>466</td>\n",
" <td>10</td>\n",
" <td>2016-08-01 12:51:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>392</td>\n",
" <td>10</td>\n",
" <td>704</td>\n",
" <td>24</td>\n",
" <td>2016-08-01 12:58:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>930</th>\n",
" <td>641</td>\n",
" <td>8</td>\n",
" <td>679</td>\n",
" <td>12</td>\n",
" <td>2016-08-31 21:23:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>931</th>\n",
" <td>672</td>\n",
" <td>6</td>\n",
" <td>800</td>\n",
" <td>24</td>\n",
" <td>2016-08-31 21:24:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>932</th>\n",
" <td>692</td>\n",
" <td>8</td>\n",
" <td>735</td>\n",
" <td>10</td>\n",
" <td>2016-08-31 21:25:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>933</th>\n",
" <td>647</td>\n",
" <td>4</td>\n",
" <td>262</td>\n",
" <td>9</td>\n",
" <td>2016-08-31 21:37:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>934</th>\n",
" <td>570</td>\n",
" <td>8</td>\n",
" <td>589</td>\n",
" <td>13</td>\n",
" <td>2016-08-31 21:41:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>935 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" 订单编号 用餐人数 订单总花费 点菜总数 用餐开始时间\n",
"0 417 4 165 5 2016-08-01 11:05:00\n",
"1 301 3 321 6 2016-08-01 11:15:00\n",
"2 413 6 854 15 2016-08-01 12:42:00\n",
"3 415 4 466 10 2016-08-01 12:51:00\n",
"4 392 10 704 24 2016-08-01 12:58:00\n",
".. ... ... ... ... ...\n",
"930 641 8 679 12 2016-08-31 21:23:00\n",
"931 672 6 800 24 2016-08-31 21:24:00\n",
"932 692 8 735 10 2016-08-31 21:25:00\n",
"933 647 4 262 9 2016-08-31 21:37:00\n",
"934 570 8 589 13 2016-08-31 21:41:00\n",
"\n",
"[935 rows x 5 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sql查询语句\n",
"sql = 'select * from data_analysis.order'\n",
"# 调用pandas库的read_sql_query函数查询数据库数据\n",
"df = pd.read_sql_query(sql,engine) # 传入sql语句和engine对象\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b67a09a5",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Series序列对象
Series是Pandas库中的一种数据结构,它表示一维标记的、带有轴标签的数据序列。Series可以包含任意类型的数据(整数、浮点数、字符串等),类似于一维数组或列表。
创建Series对象时,需要传入一个数据序列(如列表、NumPy数组等)和一个可选的索引序列。索引序列用于标识和访问Series中的数据,默认情况下,索引序列将自动创建为整数序列。
下面是创建和操作Series对象的示例:
import pandas as pd
# 创建一个简单的Series对象
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
# 输出:
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# dtype: int64
# 指定索引序列
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
print(series)
# 输出:
# a 1
# b 2
# c 3
# d 4
# e 5
# dtype: int64
# 访问Series中的数据
print(series[2]) # 输出:3
print(series['c']) # 输出:3
# Series运算
print(series + 1)
# 输出:
# a 2
# b 3
# c 4
# d 5
# e 6
# dtype: int64
# Series统计计算
print(series.mean()) # 输出:3.0
print(series.max()) # 输出:5
Series对象可以像数组一样进行索引、切片和运算,同时还提供了许多方便的函数和方法用于数据处理和分析。通过使用Series对象,可以更方便地对一维数据进行操作和分析。
下面是操作Series对象的示例:
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 概念\n",
"- 利用pandas库从外部读取数据到python中形成的表格叫做DataFrame表格对象\n",
"- Series序列对象即DataFrame表格对象中的某一列数据\n",
"- Series序列对象简称序列对象,其数据类型是Series\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 获取序列对象的方法"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 67663\n",
"1 6016\n",
"2 3148\n",
"3 1470\n",
"4 1409\n",
"5 1183\n",
"6 993\n",
"7 768\n",
"8 749\n",
"9 524\n",
"10 453\n",
"11 443\n",
"12 372\n",
"13 351\n",
"14 351\n",
"15 324\n",
"16 305\n",
"17 295\n",
"18 248\n",
"19 207\n",
"20 201\n",
"21 99\n",
"22 18\n",
"23 9\n",
"24 5\n",
"25 3\n",
"26 2\n",
"27 1\n",
"28 1\n",
"Name: 销售总数, dtype: int64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1. 通过表格对象获取\n",
"df = pd.read_excel('../数据集/各省份车辆销售数量.xlsx')\n",
"df['销售总数']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 a\n",
"1 b\n",
"2 c\n",
"dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 通过pd.Series类生成\n",
"series = pd.Series(['a','b','c'])\n",
"series"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(series)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Series序列对象的属性"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['a' 'b' 'c']\n",
"RangeIndex(start=0, stop=3, step=1)\n",
"None\n",
"object\n",
"3\n"
]
}
],
"source": [
"series = pd.Series(['a','b','c'])\n",
"print(series.values) # 值\n",
"print(series.index) # 索引\n",
"print(series.name) # 名称\n",
"print(series.dtypes) # 数据类型\n",
"print(series.size) # 数据个数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Series序列对象中局部数据的访问方法\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 a\n",
"1 b\n",
"2 c\n",
"dtype: object\n",
"--------------------\n",
"a\n",
"--------------------\n",
"0 a\n",
"1 b\n",
"dtype: object\n",
"--------------------\n",
"1 b\n",
"2 c\n",
"dtype: object\n"
]
}
],
"source": [
"series = pd.Series(['a','b','c'])\n",
"print(series)\n",
"print('-'*20)\n",
"print(series[0])\n",
"print('-'*20)\n",
"print(series[0:2])\n",
"print('-'*20)\n",
"print(series[series!='a'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 序列对象的运算"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"序列对象:\n",
" 0 1\n",
"1 2\n",
"2 3\n",
"dtype: int64\n",
"加法运算:\n",
" 0 11\n",
"1 12\n",
"2 13\n",
"dtype: int64\n",
"乘法运算:\n",
" 0 10\n",
"1 20\n",
"2 30\n",
"dtype: int64\n",
"两个序列对象的运算:\n",
" 0 10\n",
"1 20\n",
"2 30\n",
"dtype: int64\n",
"字符串序列加法运算:\n",
" 0 1天\n",
"1 2天\n",
"2 3天\n",
"dtype: object\n",
"判断运算:\n",
" 0 False\n",
"1 True\n",
"2 False\n",
"dtype: bool\n"
]
}
],
"source": [
"series = pd.Series([1,2,3])\n",
"print('序列对象:\\n',series)\n",
"print('加法运算:\\n',series+10)\n",
"print('乘法运算:\\n',series*10)\n",
"print('两个序列对象的运算:\\n',series*6+series*4)\n",
"print('字符串序列加法运算:\\n',series.astype(str)+'天')\n",
"print('判断运算:\\n',series==2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 序列对象的常用方法"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## astype() 数据类型转换方法"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"dtype: int64\n",
"0 1.0\n",
"1 2.0\n",
"2 3.0\n",
"dtype: float64\n",
"0 1\n",
"1 2\n",
"2 3\n",
"dtype: object\n",
"0 1个\n",
"1 2个\n",
"2 3个\n",
"dtype: object\n"
]
}
],
"source": [
"series = pd.Series([1,2,3])\n",
"print(series)\n",
"print(series.astype('float'))\n",
"print(series.astype('str'))\n",
"print(series.astype('str')+'个')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## value_counts() 统计序列中元素出现的次数"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 3\n",
"dtype: int64\n",
"3 2\n",
"1 1\n",
"2 1\n",
"dtype: int64\n"
]
}
],
"source": [
"series = pd.Series([1,2,3,3])\n",
"print(series)\n",
"print(series.value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## sort_values() 对序列中的数据进行排序\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 5\n",
"3 3\n",
"4 4\n",
"dtype: int64\n",
"0 1\n",
"1 2\n",
"3 3\n",
"4 4\n",
"2 5\n",
"dtype: int64\n"
]
}
],
"source": [
"series = pd.Series([1,2,5,3,4])\n",
"print(series)\n",
"print(series.sort_values())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## rank() 返回序列中数据大小的排名序号\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 100\n",
"2 10\n",
"dtype: int64\n",
"0 1.0\n",
"1 3.0\n",
"2 2.0\n",
"dtype: float64\n",
"0 3.0\n",
"1 1.0\n",
"2 2.0\n",
"dtype: float64\n"
]
}
],
"source": [
"series = pd.Series([1,100,10])\n",
"print(series)\n",
"print(series.rank())\n",
"print(series.rank(ascending=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# round() 控制数字型序列的小数点位"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 0.333333\n",
"1 0.166667\n",
"2 0.125000\n",
"dtype: float64\n",
"0 0.33\n",
"1 0.17\n",
"2 0.12\n",
"dtype: float64\n"
]
}
],
"source": [
"series = pd.Series([1/3,1/6,1/8])\n",
"print(series)\n",
"print(series.round(2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 序列对象.str.方法名() \n",
"一系列用于批量处理字符串序列对象中元素的方法\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"原序列:\n",
" 0 hello world\n",
"1 good job\n",
"dtype: object\n",
"是否包含字符o:\n",
" 0 True\n",
"1 True\n",
"dtype: bool\n",
"根据空格分割字符:\n",
" 0 [hello, world]\n",
"1 [good, job]\n",
"dtype: object\n",
"替换字符中的空格:\n",
" 0 hello-world\n",
"1 good-job\n",
"dtype: object\n",
"删除字符中的空格:\n",
" 0 1\n",
"1 0\n",
"dtype: int64\n"
]
}
],
"source": [
"series = pd.Series(['hello world','good job'])\n",
"print('原序列:\\n',series)\n",
"print('是否包含字符o:\\n',series.str.contains('o'))\n",
"print('根据空格分割字符:\\n',series.str.split(' '))\n",
"print('替换字符中的空格:\\n',series.str.replace(' ','-'))\n",
"print('统计e字符出现的次数:\\n',series.str.count('e'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## agg() 对序列对象的元素进行加工的方法\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"原序列:\n",
" 0 15岁\n",
"1 20岁\n",
"2 30岁\n",
"dtype: object\n",
"提取数字序列:\n",
" 0 15\n",
"1 20\n",
"2 30\n",
"dtype: int64\n",
"加工为是否成年序列:\n",
" 0 未成年\n",
"1 成年\n",
"2 成年\n",
"dtype: object\n"
]
}
],
"source": [
"series = pd.Series(['15岁','20岁','30岁'])\n",
"print('原序列:\\n',series)\n",
"series2 = series.agg(lambda x:int(x.split('岁')[0]))\n",
"print('提取数字序列:\\n',series2)\n",
"series3 = series2.agg(lambda x:'成年' if x>18 else '未成年') \n",
"print('加工为是否成年序列:\\n',series3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# max()/min()/mean()/median()/count()/var()/...\n",
"计算数字型序列对象中所有元素的统计量\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"最大值: 4\n",
"最小值: 1\n",
"平均值: 2.5\n",
"中位数: 2.5\n",
"元素个数: 4\n",
"方差: 1.6666666666666667\n"
]
}
],
"source": [
"series = pd.Series([1,2,3,4])\n",
"print('最大值:',series.max())\n",
"print('最小值:',series.min())\n",
"print('平均值:',series.mean())\n",
"print('中位数:',series.median())\n",
"print('元素个数:',series.count())\n",
"print('方差:',series.var())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
表格对象实现统计分析
在Pandas库中,表格对象主要是指DataFrame(数据帧)对象。DataFrame是一个二维的、类似于表格的数据结构,由行和列组成。它可以容纳不同类型的数据,并且提供了丰富的功能来进行统计分析。
下面是一些常见的统计分析操作示例:
1.描述性统计:可以使用describe()
方法获取DataFrame中数值列的基本统计信息,如计数、均值、标准差、最小值、最大值等。
import pandas as pd
# 创建一个简单的DataFrame对象
data = {'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)
# 描述性统计
print(df.describe())
2.汇总统计:可以使用sum()
、mean()
、median()
等方法对DataFrame进行列或行的汇总统计。
# 汇总统计
print(df.sum()) # 列的总和
print(df.mean()) # 列的平均值
print(df.median()) # 列的中位数
print(df.sum(axis=1)) # 行的总和
3.分组统计:可以使用groupby()
方法按照某一列或多个列进行分组,然后应用聚合函数进行统计分析。
# 分组统计
grouped = df.groupby('A')
print(grouped.mean()) # 按照列'A'分组,计算其他列的平均值
4.相关性分析:可以使用corr()
方法计算DataFrame中各列之间的相关性。
# 相关性分析
print(df.corr()) # 计算各列之间的相关系数
5.数据透视表:可以使用pivot_table()
方法创建数据透视表,用于进行多维度的汇总和统计。
# 数据透视表
pivot = df.pivot_table(index='A', columns='B', values='C', aggfunc='sum')
print(pivot)
以上只是一些常见的统计分析操作示例,Pandas提供了更多的函数和方法用于数据清洗、转换、分析和可视化。通过利用DataFrame对象的强大功能,可以更方便地进行统计分析和探索性数据分析。