Python数据分析与可视化一课一得

什么是Python数据分析与可视化

Python数据分析与可视化是指使用Python编程语言进行数据处理、分析和可视化的过程。Python拥有许多强大的库和工具,使得数据分析和可视化变得更加简单和高效。

数据分析通常涉及对数据进行清洗、转换、整合和统计等处理,以获取有关数据的洞察和见解。Python提供了一些常用的数据分析库,如NumPy、Pandas和SciPy,它们提供了各种数据结构、函数和算法,用于处理和分析数据。

可视化是将数据以图表、图形或其他可视形式展示出来的过程。通过可视化,我们可以更直观地理解数据、发现模式和趋势,并向其他人传达数据的信息。Python中最常用的数据可视化库是Matplotlib和Seaborn,它们提供了各种绘图函数和样式,使得绘制各种类型的图表变得简单。

除了Matplotlib和Seaborn,还有其他一些流行的数据可视化库,如Plotly、Bokeh和Altair,它们提供了更交互性和高级的可视化功能。

使用Python进行数据分析和可视化的基本步骤包括:

1. 数据获取:从文件、数据库或API等来源获取数据。
2. 数据清洗和预处理:对数据进行清洗、转换和处理,去除异常值、处理缺失数据等。
3. 数据分析:使用统计方法、机器学习算法等对数据进行分析,探索数据的特征和关系。
4. 数据可视化:使用合适的图表、图形或地图等方式将数据可视化,以便更好地理解和传达数据信息。
5. 结果解释和报告:对分析和可视化结果进行解释,并撰写相应的报告或文档。

Python数据分析与可视化的优势包括语法简洁易学、强大的库和工具支持、丰富的社区资源和广泛的应用领域。无论是从事科学研究、商业分析还是数据驱动的决策,Python都是一个强大的工具。

Python数据分析依赖的两个对象

Python数据分析依赖的两个主要对象是NumPy数组和Pandas数据帧。

1. NumPy数组(N-dimensional Array):NumPy是Python中一个强大的数值计算库,提供了多维数组对象和一系列用于操作数组的函数。在数据分析中,NumPy数组是一种高效的数据结构,用于存储和处理大规模数值数据。NumPy数组具有快速的数值运算和广播功能,可以进行向量化操作,使得处理大型数据集变得更加高效。它还提供了各种数学、统计和线性代数函数,用于数据计算和分析。

2. Pandas数据帧(DataFrame):Pandas是一个基于NumPy构建的数据分析工具,提供了高级数据结构和数据操作功能。其中最常用的数据结构是DataFrame,它类似于表格或电子表格的数据结构,包含行和列的标签,以及灵活的索引和选择功能。Pandas的DataFrame可以处理异构数据,支持处理时间序列数据、缺失数据、重复数据等。它提供了丰富的数据操作、转换和处理函数,如数据过滤、排序、合并、聚合等,方便进行数据清洗、转换和分析。

这两个对象在Python数据分析中扮演着重要的角色。NumPy提供了底层的数值计算和数组操作能力,而Pandas构建在NumPy之上,提供了更高级的数据结构和数据处理功能,使得数据分析更加便捷和灵活。

除了NumPy和Pandas,Python数据分析还涉及其他一些常用的库和工具,如SciPy(科学计算库)、Matplotlib(绘图库)、Seaborn(统计可视化库)等。这些库共同构成了Python数据分析的生态系统,为数据科学家和分析师提供了丰富的工具和资源。

导入外部数据

 要导入外部数据到Python中进行数据分析,可以使用多种方法,具体取决于数据的来源和格式。下面是几种常见的导入外部数据的方法:

1.CSV文件:如果数据存储在逗号分隔值(CSV)文件中,可以使用pandas库的read_csv()函数来读取数据并创建一个数据帧(DataFrame)对象。例如:

import pandas as pd

data = pd.read_csv('data.csv')  # 替换为你的CSV文件路径

2.Excel文件:如果数据存储在Excel文件中,可以使用pandas库的read_excel()函数来读取数据。需要先安装openpyxlxlrd库。例如:

import pandas as pd

data = pd.read_excel('data.xlsx', sheet_name='Sheet1')  # 替换为你的Excel文件路径和工作表名称

3.JSON文件:如果数据存储在JSON文件中,可以使用json库来读取数据。例如:

import json

 open('data.json') as f:  # 替换为你的JSON文件路径
    data = json.load(f)

4.数据库:如果数据存储在数据库中,可以使用p库的read_sql()函数来执行SQL查询并将结果读取为数据帧。需要先安装相应的数据库驱动程序,如pyodbcpymysqlpsycopg2等。例如:

import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db')  # 替换为你的数据库连接信息
query = 'SELECT * FROM table'  # 替换为你的SQL查询语句
data = pd.read_sql(query, conn)

以上只是一些常见的导入外部数据的方法,实际情况可能因数据来源和格式而有所不同。无论是从文件、数据库、API还是其他来源导入数据,Python提供了广泛的库和工具来处理各种数据格式。

例如:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6677b697",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7266548f",
   "metadata": {},
   "source": [
    "# 1. 读取csv数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "3c15c8e6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>姓名</th>\n",
       "      <th>性别</th>\n",
       "      <th>省份</th>\n",
       "      <th>城市</th>\n",
       "      <th>年龄</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>超级管理员</td>\n",
       "      <td>男</td>\n",
       "      <td>广东广州</td>\n",
       "      <td>泰迪科技</td>\n",
       "      <td>23.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>叶亦凯</td>\n",
       "      <td>男</td>\n",
       "      <td>广东广州</td>\n",
       "      <td>广州</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>邓彬彬</td>\n",
       "      <td>女</td>\n",
       "      <td>广东广州</td>\n",
       "      <td>广州</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>张建涛</td>\n",
       "      <td>男</td>\n",
       "      <td>广东广州</td>\n",
       "      <td>广州</td>\n",
       "      <td>22.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>莫诗怡</td>\n",
       "      <td>女</td>\n",
       "      <td>广西</td>\n",
       "      <td>广州</td>\n",
       "      <td>23.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>728</th>\n",
       "      <td>杜殿雨</td>\n",
       "      <td>女</td>\n",
       "      <td>江西九江</td>\n",
       "      <td>佛山</td>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>729</th>\n",
       "      <td>杜鹦超</td>\n",
       "      <td>男</td>\n",
       "      <td>天津</td>\n",
       "      <td>广州</td>\n",
       "      <td>43.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>730</th>\n",
       "      <td>杜小悦</td>\n",
       "      <td>女</td>\n",
       "      <td>广东广州</td>\n",
       "      <td>佛山</td>\n",
       "      <td>23.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>731</th>\n",
       "      <td>杜雨玲</td>\n",
       "      <td>女</td>\n",
       "      <td>福建厦门</td>\n",
       "      <td>广州</td>\n",
       "      <td>37.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>732</th>\n",
       "      <td>杜依醇</td>\n",
       "      <td>女</td>\n",
       "      <td>广西南宁</td>\n",
       "      <td>佛山</td>\n",
       "      <td>46.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>733 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        姓名 性别    省份    城市    年龄\n",
       "0    超级管理员  男  广东广州  泰迪科技  23.0\n",
       "1      叶亦凯  男  广东广州    广州  21.0\n",
       "2      邓彬彬  女  广东广州    广州  21.0\n",
       "3      张建涛  男  广东广州    广州  22.0\n",
       "4      莫诗怡  女    广西    广州  23.0\n",
       "..     ... ..   ...   ...   ...\n",
       "728    杜殿雨  女  江西九江    佛山  48.0\n",
       "729    杜鹦超  男    天津    广州  43.0\n",
       "730    杜小悦  女  广东广州    佛山  23.0\n",
       "731    杜雨玲  女  福建厦门    广州  37.0\n",
       "732    杜依醇  女  广西南宁    佛山  46.0\n",
       "\n",
       "[733 rows x 5 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data1 = pd.read_csv(r'../../数据集/新用户表.csv',\n",
    "                   sep=',',\n",
    "                   encoding='gbk'\n",
    "                   )\n",
    "data1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0be77ea",
   "metadata": {},
   "source": [
    "# 2. 读取excel数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "63cd631c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>明细编号</th>\n",
       "      <th>订单编号</th>\n",
       "      <th>菜名</th>\n",
       "      <th>数量</th>\n",
       "      <th>单价</th>\n",
       "      <th>菜品添加时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2352</td>\n",
       "      <td>366</td>\n",
       "      <td>香酥两吃大虾</td>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "      <td>2016-08-11 11:49:43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2354</td>\n",
       "      <td>366</td>\n",
       "      <td>姜葱炒花蟹</td>\n",
       "      <td>1</td>\n",
       "      <td>45</td>\n",
       "      <td>2016-08-11 11:51:17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2356</td>\n",
       "      <td>366</td>\n",
       "      <td>香烤牛排\\r\\n\\r\\n\\r\\n</td>\n",
       "      <td>1</td>\n",
       "      <td>55</td>\n",
       "      <td>2016-08-11 11:52:48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2358</td>\n",
       "      <td>366</td>\n",
       "      <td>铁板牛肉</td>\n",
       "      <td>1</td>\n",
       "      <td>66</td>\n",
       "      <td>2016-08-11 11:53:47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2361</td>\n",
       "      <td>366</td>\n",
       "      <td>蒜香包</td>\n",
       "      <td>1</td>\n",
       "      <td>13</td>\n",
       "      <td>2016-08-11 11:54:12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3642</th>\n",
       "      <td>3856</td>\n",
       "      <td>1125</td>\n",
       "      <td>白饭/大碗</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>2016-08-20 22:06:33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3643</th>\n",
       "      <td>3857</td>\n",
       "      <td>1125</td>\n",
       "      <td>泡椒凤爪\\r\\n\\r\\n\\r\\n</td>\n",
       "      <td>1</td>\n",
       "      <td>58</td>\n",
       "      <td>2016-08-20 22:07:31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3644</th>\n",
       "      <td>3858</td>\n",
       "      <td>1125</td>\n",
       "      <td>拌土豆丝</td>\n",
       "      <td>1</td>\n",
       "      <td>25</td>\n",
       "      <td>2016-08-20 22:07:47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3645</th>\n",
       "      <td>3859</td>\n",
       "      <td>1125</td>\n",
       "      <td>牛尾汤</td>\n",
       "      <td>1</td>\n",
       "      <td>40</td>\n",
       "      <td>2016-08-20 22:08:38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3646</th>\n",
       "      <td>3860</td>\n",
       "      <td>1125</td>\n",
       "      <td>照烧鸡腿\\r\\n\\r\\n\\r\\n</td>\n",
       "      <td>1</td>\n",
       "      <td>16</td>\n",
       "      <td>2016-08-20 22:09:17</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3647 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      明细编号  订单编号                菜名  数量  单价              菜品添加时间\n",
       "0     2352   366            香酥两吃大虾   1  89 2016-08-11 11:49:43\n",
       "1     2354   366             姜葱炒花蟹   1  45 2016-08-11 11:51:17\n",
       "2     2356   366  香烤牛排\\r\\n\\r\\n\\r\\n   1  55 2016-08-11 11:52:48\n",
       "3     2358   366              铁板牛肉   1  66 2016-08-11 11:53:47\n",
       "4     2361   366               蒜香包   1  13 2016-08-11 11:54:12\n",
       "...    ...   ...               ...  ..  ..                 ...\n",
       "3642  3856  1125             白饭/大碗   1  10 2016-08-20 22:06:33\n",
       "3643  3857  1125  泡椒凤爪\\r\\n\\r\\n\\r\\n   1  58 2016-08-20 22:07:31\n",
       "3644  3858  1125              拌土豆丝   1  25 2016-08-20 22:07:47\n",
       "3645  3859  1125               牛尾汤   1  40 2016-08-20 22:08:38\n",
       "3646  3860  1125  照烧鸡腿\\r\\n\\r\\n\\r\\n   1  16 2016-08-20 22:09:17\n",
       "\n",
       "[3647 rows x 6 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data2 = pd.read_excel(r'../../数据集/新明细表.xlsx')\n",
    "data2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cac14fb",
   "metadata": {},
   "source": [
    "# 3. 将表格上传至数据库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "356f94b8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Engine(mysql+pymysql://root:***@localhost:3306/data_analysis)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sqlalchemy import create_engine \n",
    "user = 'root' #用户名\n",
    "password = '123456' #密码\n",
    "ip = 'localhost' # ip地址\n",
    "port = '3306' # 端口\n",
    "database = 'data_analysis' # 数据库名\n",
    "\n",
    "# 创建mysql引擎对象,将上述信息写入函数的参数中\n",
    "engine = create_engine(f'mysql+pymysql://{user}:{password}@{ip}:{port}/{database}') \n",
    "engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b8c95629",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将表data1写入engine中的数据库中的user表中\n",
    "data1.to_sql('user',engine,index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3ef8d997",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将表data2写入engine中的数据库中的detail表中\n",
    "data2.to_sql('detail',engine,index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba20bfdd",
   "metadata": {},
   "source": [
    "# 4. 读取Mysql数据库数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "28d2bdf8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd # 导入pandas库并重命名\n",
    "from sqlalchemy import create_engine "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1a8affae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Engine(mysql+pymysql://root:***@localhost:3306/data_analysis)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user = 'root' #用户名\n",
    "password = '123456' #密码\n",
    "ip = 'localhost' # ip地址\n",
    "port = '3306' # 端口\n",
    "database = 'data_analysis' # 数据库名\n",
    "\n",
    "# 创建mysql引擎对象,将上述信息写入函数的参数中\n",
    "engine = create_engine(f'mysql+pymysql://{user}:{password}@{ip}:{port}/{database}') \n",
    "engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1089f0b8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>订单编号</th>\n",
       "      <th>用餐人数</th>\n",
       "      <th>订单总花费</th>\n",
       "      <th>点菜总数</th>\n",
       "      <th>用餐开始时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>417</td>\n",
       "      <td>4</td>\n",
       "      <td>165</td>\n",
       "      <td>5</td>\n",
       "      <td>2016-08-01 11:05:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>301</td>\n",
       "      <td>3</td>\n",
       "      <td>321</td>\n",
       "      <td>6</td>\n",
       "      <td>2016-08-01 11:15:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>413</td>\n",
       "      <td>6</td>\n",
       "      <td>854</td>\n",
       "      <td>15</td>\n",
       "      <td>2016-08-01 12:42:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>415</td>\n",
       "      <td>4</td>\n",
       "      <td>466</td>\n",
       "      <td>10</td>\n",
       "      <td>2016-08-01 12:51:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>392</td>\n",
       "      <td>10</td>\n",
       "      <td>704</td>\n",
       "      <td>24</td>\n",
       "      <td>2016-08-01 12:58:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>930</th>\n",
       "      <td>641</td>\n",
       "      <td>8</td>\n",
       "      <td>679</td>\n",
       "      <td>12</td>\n",
       "      <td>2016-08-31 21:23:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>931</th>\n",
       "      <td>672</td>\n",
       "      <td>6</td>\n",
       "      <td>800</td>\n",
       "      <td>24</td>\n",
       "      <td>2016-08-31 21:24:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>932</th>\n",
       "      <td>692</td>\n",
       "      <td>8</td>\n",
       "      <td>735</td>\n",
       "      <td>10</td>\n",
       "      <td>2016-08-31 21:25:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>933</th>\n",
       "      <td>647</td>\n",
       "      <td>4</td>\n",
       "      <td>262</td>\n",
       "      <td>9</td>\n",
       "      <td>2016-08-31 21:37:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>934</th>\n",
       "      <td>570</td>\n",
       "      <td>8</td>\n",
       "      <td>589</td>\n",
       "      <td>13</td>\n",
       "      <td>2016-08-31 21:41:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>935 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     订单编号  用餐人数  订单总花费  点菜总数              用餐开始时间\n",
       "0     417     4    165     5 2016-08-01 11:05:00\n",
       "1     301     3    321     6 2016-08-01 11:15:00\n",
       "2     413     6    854    15 2016-08-01 12:42:00\n",
       "3     415     4    466    10 2016-08-01 12:51:00\n",
       "4     392    10    704    24 2016-08-01 12:58:00\n",
       "..    ...   ...    ...   ...                 ...\n",
       "930   641     8    679    12 2016-08-31 21:23:00\n",
       "931   672     6    800    24 2016-08-31 21:24:00\n",
       "932   692     8    735    10 2016-08-31 21:25:00\n",
       "933   647     4    262     9 2016-08-31 21:37:00\n",
       "934   570     8    589    13 2016-08-31 21:41:00\n",
       "\n",
       "[935 rows x 5 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sql查询语句\n",
    "sql = 'select * from data_analysis.order'\n",
    "# 调用pandas库的read_sql_query函数查询数据库数据\n",
    "df = pd.read_sql_query(sql,engine) # 传入sql语句和engine对象\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b67a09a5",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

Series序列对象

Series是Pandas库中的一种数据结构,它表示一维标记的、带有轴标签的数据序列。Series可以包含任意类型的数据(整数、浮点数、字符串等),类似于一维数组或列表。

创建Series对象时,需要传入一个数据序列(如列表、NumPy数组等)和一个可选的索引序列。索引序列用于标识和访问Series中的数据,默认情况下,索引序列将自动创建为整数序列。

下面是创建和操作Series对象的示例:

import pandas as pd

# 创建一个简单的Series对象
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
# 输出:
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# dtype: int64

# 指定索引序列
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
print(series)
# 输出:
# a    1
# b    2
# c    3
# d    4
# e    5
# dtype: int64

# 访问Series中的数据
print(series[2])  # 输出:3
print(series['c'])  # 输出:3

# Series运算
print(series + 1)
# 输出:
# a    2
# b    3
# c    4
# d    5
# e    6
# dtype: int64

# Series统计计算
print(series.mean())  # 输出:3.0
print(series.max())  # 输出:5

Series对象可以像数组一样进行索引、切片和运算,同时还提供了许多方便的函数和方法用于数据处理和分析。通过使用Series对象,可以更方便地对一维数据进行操作和分析。

下面是操作Series对象的示例:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 概念\n",
    "- 利用pandas库从外部读取数据到python中形成的表格叫做DataFrame表格对象\n",
    "- Series序列对象即DataFrame表格对象中的某一列数据\n",
    "- Series序列对象简称序列对象,其数据类型是Series\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 获取序列对象的方法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     67663\n",
       "1      6016\n",
       "2      3148\n",
       "3      1470\n",
       "4      1409\n",
       "5      1183\n",
       "6       993\n",
       "7       768\n",
       "8       749\n",
       "9       524\n",
       "10      453\n",
       "11      443\n",
       "12      372\n",
       "13      351\n",
       "14      351\n",
       "15      324\n",
       "16      305\n",
       "17      295\n",
       "18      248\n",
       "19      207\n",
       "20      201\n",
       "21       99\n",
       "22       18\n",
       "23        9\n",
       "24        5\n",
       "25        3\n",
       "26        2\n",
       "27        1\n",
       "28        1\n",
       "Name: 销售总数, dtype: int64"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 1. 通过表格对象获取\n",
    "df = pd.read_excel('../数据集/各省份车辆销售数量.xlsx')\n",
    "df['销售总数']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    a\n",
       "1    b\n",
       "2    c\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过pd.Series类生成\n",
    "series = pd.Series(['a','b','c'])\n",
    "series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(series)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Series序列对象的属性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['a' 'b' 'c']\n",
      "RangeIndex(start=0, stop=3, step=1)\n",
      "None\n",
      "object\n",
      "3\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series(['a','b','c'])\n",
    "print(series.values) # 值\n",
    "print(series.index) # 索引\n",
    "print(series.name) # 名称\n",
    "print(series.dtypes) # 数据类型\n",
    "print(series.size) # 数据个数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Series序列对象中局部数据的访问方法\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    a\n",
      "1    b\n",
      "2    c\n",
      "dtype: object\n",
      "--------------------\n",
      "a\n",
      "--------------------\n",
      "0    a\n",
      "1    b\n",
      "dtype: object\n",
      "--------------------\n",
      "1    b\n",
      "2    c\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series(['a','b','c'])\n",
    "print(series)\n",
    "print('-'*20)\n",
    "print(series[0])\n",
    "print('-'*20)\n",
    "print(series[0:2])\n",
    "print('-'*20)\n",
    "print(series[series!='a'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 序列对象的运算"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "序列对象:\n",
      " 0    1\n",
      "1    2\n",
      "2    3\n",
      "dtype: int64\n",
      "加法运算:\n",
      " 0    11\n",
      "1    12\n",
      "2    13\n",
      "dtype: int64\n",
      "乘法运算:\n",
      " 0    10\n",
      "1    20\n",
      "2    30\n",
      "dtype: int64\n",
      "两个序列对象的运算:\n",
      " 0    10\n",
      "1    20\n",
      "2    30\n",
      "dtype: int64\n",
      "字符串序列加法运算:\n",
      " 0    1天\n",
      "1    2天\n",
      "2    3天\n",
      "dtype: object\n",
      "判断运算:\n",
      " 0    False\n",
      "1     True\n",
      "2    False\n",
      "dtype: bool\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series([1,2,3])\n",
    "print('序列对象:\\n',series)\n",
    "print('加法运算:\\n',series+10)\n",
    "print('乘法运算:\\n',series*10)\n",
    "print('两个序列对象的运算:\\n',series*6+series*4)\n",
    "print('字符串序列加法运算:\\n',series.astype(str)+'天')\n",
    "print('判断运算:\\n',series==2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 序列对象的常用方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## astype() 数据类型转换方法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "dtype: int64\n",
      "0    1.0\n",
      "1    2.0\n",
      "2    3.0\n",
      "dtype: float64\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "dtype: object\n",
      "0    1个\n",
      "1    2个\n",
      "2    3个\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series([1,2,3])\n",
    "print(series)\n",
    "print(series.astype('float'))\n",
    "print(series.astype('str'))\n",
    "print(series.astype('str')+'个')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## value_counts() 统计序列中元素出现的次数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3    3\n",
      "dtype: int64\n",
      "3    2\n",
      "1    1\n",
      "2    1\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series([1,2,3,3])\n",
    "print(series)\n",
    "print(series.value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## sort_values() 对序列中的数据进行排序\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    5\n",
      "3    3\n",
      "4    4\n",
      "dtype: int64\n",
      "0    1\n",
      "1    2\n",
      "3    3\n",
      "4    4\n",
      "2    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series([1,2,5,3,4])\n",
    "print(series)\n",
    "print(series.sort_values())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  rank() 返回序列中数据大小的排名序号\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      1\n",
      "1    100\n",
      "2     10\n",
      "dtype: int64\n",
      "0    1.0\n",
      "1    3.0\n",
      "2    2.0\n",
      "dtype: float64\n",
      "0    3.0\n",
      "1    1.0\n",
      "2    2.0\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series([1,100,10])\n",
    "print(series)\n",
    "print(series.rank())\n",
    "print(series.rank(ascending=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# round() 控制数字型序列的小数点位"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    0.333333\n",
      "1    0.166667\n",
      "2    0.125000\n",
      "dtype: float64\n",
      "0    0.33\n",
      "1    0.17\n",
      "2    0.12\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series([1/3,1/6,1/8])\n",
    "print(series)\n",
    "print(series.round(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 序列对象.str.方法名() \n",
    "一系列用于批量处理字符串序列对象中元素的方法\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "原序列:\n",
      " 0    hello world\n",
      "1       good job\n",
      "dtype: object\n",
      "是否包含字符o:\n",
      " 0    True\n",
      "1    True\n",
      "dtype: bool\n",
      "根据空格分割字符:\n",
      " 0    [hello, world]\n",
      "1       [good, job]\n",
      "dtype: object\n",
      "替换字符中的空格:\n",
      " 0    hello-world\n",
      "1       good-job\n",
      "dtype: object\n",
      "删除字符中的空格:\n",
      " 0    1\n",
      "1    0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series(['hello world','good job'])\n",
    "print('原序列:\\n',series)\n",
    "print('是否包含字符o:\\n',series.str.contains('o'))\n",
    "print('根据空格分割字符:\\n',series.str.split(' '))\n",
    "print('替换字符中的空格:\\n',series.str.replace(' ','-'))\n",
    "print('统计e字符出现的次数:\\n',series.str.count('e'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## agg() 对序列对象的元素进行加工的方法\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "原序列:\n",
      " 0    15岁\n",
      "1    20岁\n",
      "2    30岁\n",
      "dtype: object\n",
      "提取数字序列:\n",
      " 0    15\n",
      "1    20\n",
      "2    30\n",
      "dtype: int64\n",
      "加工为是否成年序列:\n",
      " 0    未成年\n",
      "1     成年\n",
      "2     成年\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series(['15岁','20岁','30岁'])\n",
    "print('原序列:\\n',series)\n",
    "series2 = series.agg(lambda x:int(x.split('岁')[0]))\n",
    "print('提取数字序列:\\n',series2)\n",
    "series3 = series2.agg(lambda x:'成年' if x>18 else '未成年')                     \n",
    "print('加工为是否成年序列:\\n',series3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# max()/min()/mean()/median()/count()/var()/...\n",
    "计算数字型序列对象中所有元素的统计量\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "最大值: 4\n",
      "最小值: 1\n",
      "平均值: 2.5\n",
      "中位数: 2.5\n",
      "元素个数: 4\n",
      "方差: 1.6666666666666667\n"
     ]
    }
   ],
   "source": [
    "series = pd.Series([1,2,3,4])\n",
    "print('最大值:',series.max())\n",
    "print('最小值:',series.min())\n",
    "print('平均值:',series.mean())\n",
    "print('中位数:',series.median())\n",
    "print('元素个数:',series.count())\n",
    "print('方差:',series.var())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

表格对象实现统计分析

在Pandas库中,表格对象主要是指DataFrame(数据帧)对象。DataFrame是一个二维的、类似于表格的数据结构,由行和列组成。它可以容纳不同类型的数据,并且提供了丰富的功能来进行统计分析。

下面是一些常见的统计分析操作示例:

1.描述性统计:可以使用describe()方法获取DataFrame中数值列的基本统计信息,如计数、均值、标准差、最小值、最大值等。

import pandas as pd

# 创建一个简单的DataFrame对象
data = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50],
        'C': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)

# 描述性统计
print(df.describe())

2.汇总统计:可以使用sum()mean()median()等方法对DataFrame进行列或行的汇总统计。

# 汇总统计
print(df.sum())  # 列的总和
print(df.mean())  # 列的平均值
print(df.median())  # 列的中位数
print(df.sum(axis=1))  # 行的总和

3.分组统计:可以使用groupby()方法按照某一列或多个列进行分组,然后应用聚合函数进行统计分析。

# 分组统计
grouped = df.groupby('A')
print(grouped.mean())  # 按照列'A'分组,计算其他列的平均值

4.相关性分析:可以使用corr()方法计算DataFrame中各列之间的相关性。

# 相关性分析
print(df.corr())  # 计算各列之间的相关系数

5.数据透视表:可以使用pivot_table()方法创建数据透视表,用于进行多维度的汇总和统计。

# 数据透视表
pivot = df.pivot_table(index='A', columns='B', values='C', aggfunc='sum')
print(pivot)

以上只是一些常见的统计分析操作示例,Pandas提供了更多的函数和方法用于数据清洗、转换、分析和可视化。通过利用DataFrame对象的强大功能,可以更方便地进行统计分析和探索性数据分析。

  • 18
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: Python数据分析与可是指使用Python语言进行数据收集、清理、转换、统计和可的过程。常用的Python数据分析和可工具包括pandas、numpy、matplotlib、seaborn等。通过这些工具包,可以对结构和非结构数据进行处理和可,从而发现数据中的规律和趋势,做出更准确的数据分析和业务决策。 ### 回答2: Python数据分析和可是使用Python编程语言进行数据处理和分析的过程。Python是一种通用的编程语言,它具有易学易用和强大的数据处理能力,因此被广泛应用于数据分析和可。 在数据分析方面,Python提供了许多强大的数据处理和统计分析库,如NumPy、Pandas和SciPy等。这些库可以用于加载、清洗、分析和处理各种类型的数据。NumPy提供了高效的多维数组和矩阵操作,Pandas则提供了灵活的数据结构和数据分析工具,SciPy则提供了常用的科学计算和统计函数。这些库的组合使得Python变得特别适合用于数据预处理和数据分析。 此外,在数据可方面,Python也有很多流行的库,如Matplotlib、Seaborn和Plotly。这些库提供了绘制各种类型图表和可效果的功能,如折线图、柱状图、散点图、热图等。通过这些库,我们可以将数据可,展示数据的趋势、关联以及其他有用的信息。 Python数据分析和可的流程通常包括以下几个步骤:数据准备,如加载和清洗数据;数据分析,如统计描述、数据挖掘和机器学习算法;数据可,如绘制图表和可效果;结果解释和报告,如解读数据结果和撰写报告。 总的来说,Python数据分析和可提供了一个强大的工具集合,使得我们可以通过编程的方式更好地理解和分析数据。无论是在科学研究、商业分析还是其他领域,Python都是一个强大且受欢迎的选择。 ### 回答3: Python数据分析与可是利用Python编程语言进行数据处理、分析和可的一种方法。Python以其简洁、易读且功能强大的特点,成为了数据科学领域的首选语言。 在Python中,有许多强大的数据处理和分析库,如NumPy、Pandas和SciPy。NumPy提供了高性能的多维数组对象,以及用于数组操作的工具。Pandas提供了数据结构和数据分析工具,简了数据处理的流程。而SciPy则提供了许多科学计算和统计分析的功能。 Python的可库也非常丰富,其中最著名的就是Matplotlib和Seaborn。Matplotlib提供了各种绘图功能,包括线图、散点图、柱状图等。Seaborn则提供了一些更高级的统计可工具,例如热力图和分类散点图。 除了以上提到的库,还有一些其他数据分析和可工具,如Scikit-learn用于机器学习、iPython用于交互式计算和Jupyter Notebook用于数据报告等。 Python数据分析与可的主要步骤包括数据收集与清洗、数据处理与分析、数据可以及结果报告。通过Python的强大库和工具,可以用较少的代码完成复杂的数据分析与可任务,从而更好地理解和展示数据。 总而言之,Python数据分析与可是一种用Python进行数据处理、分析和可的方法。通过使用各种强大的库和工具,可以轻松地进行数据分析和可,并从中获得有价值的信息。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值