Python Pandas读取文件
当使用 Pandas 做数据分析的时,需要读取事先准备好的数据集,这是做数据分析的第一步。Panda 提供了多种读取数据的方法:
- read_csv() 用于读取文本文件
- read_json() 用于读取 json 文件
- read_sql_query() 读取 sql 语句的,
本节将对上述方法做详细介绍。
CSV文件读取
CSV 又称逗号分隔值文件,是一种简单的文件格式,以特定的结构来排列表格数据。 CSV 文件能够以纯文本形式存储表格数据,比如电子表格、数据库文件,并具有数据交换的通用格式。CSV 文件会在 Excel 文件中被打开,其行和列都定义了标准的数据格式。
将 CSV 中的数据转换为 DataFrame 对象是非常便捷的。和一般文件读写不一样,它不需要你做打开文件、读取文件、关闭文件等操作。相反,您只需要一行代码就可以完成上述所有步骤,并将数据存储在 DataFrame 中。
下面进行实例演示,首先您需要创建一组数据,并将其保存为 CSV 格式,数据如下:
EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
7369,SMITH,CLERK,7902.0,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698.0,1981-02-20,1600,300.0,30
7521,WARD,SALESMAN,7698.0,1981-02-22,1250,500.0,30
7566,JONES,MANAGER,7839.0,1981-04-02,2975,,20
7654,MARTIN,SALESMAN,7698.0,1981-09-28,1250,1400.0,30
7698,BLAKE,MANAGER,7839.0,1981-05-01,2850,,30
7782,CLARK,MANAGER,7839.0,1981-06-09,2450,,10
7788,SCOTT,ANALYST,7566.0,1987-04-19,3000,,20
7839,KING,PRESIDENT,,1981-11-17,5000,,10
7844,TURNER,SALESMAN,7698.0,1981-09-08,1500,0.0,30
7876,ADAMS,CLERK,7788.0,1987-05-23,1100,,20
7900,JAMES,CLERK,7698.0,1981-12-03,950,,30
7902,FORD,ANALYST,7566.0,1981-12-03,3000,,20
7934,MILLER,CLERK,7782.0,1982-01-23,1300,,10
注意:将上述数据保存到.txt
的文本文件中,然后将文件的扩展名后缀修改为 csv,即可完成 csv 文件的创建。
接下来,我们使用下列代码读写数据:
import pandas as pd
import matplotlib.pyplot as plt
#仅仅一行代码就完成了数据读取,但是注意文件路径不要写错
df = pd.read_csv("C:\\Users\\qwy\\Desktop\\data\\empdata.csv")
print(df)
输出结果:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
1 7499 ALLEN SALESMAN 7698.0 1981-02-20 1600 300.0 30
2 7521 WARD SALESMAN 7698.0 1981-02-22 1250 500.0 30
3 7566 JONES MANAGER 7839.0 1981-04-02 2975 NaN 20
4 7654 MARTIN SALESMAN 7698.0 1981-09-28 1250 1400.0 30
5 7698 BLAKE MANAGER 7839.0 1981-05-01 2850 NaN 30
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
7 7788 SCOTT ANALYST 7566.0 1987-04-19 3000 NaN 20
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
9 7844 TURNER SALESMAN 7698.0 1981-09-08 1500 0.0 30
10 7876 ADAMS CLERK 7788.0 1987-05-23 1100 NaN 20
11 7900 JAMES CLERK 7698.0 1981-12-03 950 NaN 30
12 7902 FORD ANALYST 7566.0 1981-12-03 3000 NaN 20
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
json读取文件
emp.json:
{"EMPNO":{"0":7369,"1":7499,"2":7521,"3":7566,"4":7654,"5":7698,"6":7782,"7":7788,"8":7839,"9":7844,"10":7876,"11":7900,"12":7902,"13":7934},"ENAME":{"0":"SMITH","1":"ALLEN","2":"WARD","3":"JONES","4":"MARTIN","5":"BLAKE","6":"CLARK","7":"SCOTT","8":"KING","9":"TURNER","10":"ADAMS","11":"JAMES","12":"FORD","13":"MILLER"},"JOB":{"0":"CLERK","1":"SALESMAN","2":"SALESMAN","3":"MANAGER","4":"SALESMAN","5":"MANAGER","6":"MANAGER","7":"ANALYST","8":"PRESIDENT","9":"SALESMAN","10":"CLERK","11":"CLERK","12":"ANALYST","13":"CLERK"},"MGR":{"0":7902.0,"1":7698.0,"2":7698.0,"3":7839.0,"4":7698.0,"5":7839.0,"6":7839.0,"7":7566.0,"8":null,"9":7698.0,"10":7788.0,"11":7698.0,"12":7566.0,"13":7782.0},"HIREDATE":{"0":"1980-12-17","1":"1981-02-20","2":"1981-02-22","3":"1981-04-02","4":"1981-09-28","5":"1981-05-01","6":"1981-06-09","7":"1987-04-19","8":"1981-11-17","9":"1981-09-08","10":"1987-05-23","11":"1981-12-03","12":"1981-12-03","13":"1982-01-23"},"SAL":{"0":800,"1":1600,"2":1250,"3":2975,"4":1250,"5":2850,"6":2450,"7":3000,"8":5000,"9":1500,"10":1100,"11":950,"12":3000,"13":1300},"COMM":{"0":null,"1":300.0,"2":500.0,"3":null,"4":1400.0,"5":null,"6":null,"7":null,"8":null,"9":0.0,"10":null,"11":null,"12":null,"13":null},"DEPTNO":{"0":20,"1":30,"2":30,"3":20,"4":30,"5":30,"6":10,"7":20,"8":10,"9":30,"10":20,"11":30,"12":20,"13":10}}
您可以通过下列方法来读取一个 json 文件,如下所示:
import pandas as pd
df = pd.read_json("C:\\Users\\qwy\\Desktop\data\\empdata.json")
print(df)
输出结果:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
1 7499 ALLEN SALESMAN 7698.0 1981-02-20 1600 300.0 30
2 7521 WARD SALESMAN 7698.0 1981-02-22 1250 500.0 30
3 7566 JONES MANAGER 7839.0 1981-04-02 2975 NaN 20
4 7654 MARTIN SALESMAN 7698.0 1981-09-28 1250 1400.0 30
5 7698 BLAKE MANAGER 7839.0 1981-05-01 2850 NaN 30
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
7 7788 SCOTT ANALYST 7566.0 1987-04-19 3000 NaN 20
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
9 7844 TURNER SALESMAN 7698.0 1981-09-08 1500 0.0 30
10 7876 ADAMS CLERK 7788.0 1987-05-23 1100 NaN 20
11 7900 JAMES CLERK 7698.0 1981-12-03 950 NaN 30
12 7902 FORD ANALYST 7566.0 1981-12-03 3000 NaN 20
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
SQL数据库读取
如果想要从 SQL 数据库读取数据,首先您应该使用 Python 和数据库建立连接,然后将查询语句传递给 read_sql_query() 方法,下面做简单地演示:
1) 安装pymysql模块
pip install pymysql
2) 建立数据连接
from sqlalchemy import create_engine
# 创建一个MySQL连接器,用户名为root,密码为1234# 地址为127.0.0.1,数据库名称为testdb,编码为UTF-8
engine = create_engine('mysql+pymysql://root:1234@127.0.0.1:3306/testdb?charset=utf8')
print(engine)
3) 数据库读取数据
数据库中的表结构:
在 SQLite 数据库中创建一张信息表,您可以随意添加一些信息,最后使用下列方法读取数据即可:
from sqlalchemy import create_engine
import pandas as pd
# 创建一个MySQL连接器,用户名为root,密码为1234# 地址为127.0.0.1,数据库名称为testdb,编码为UTF-8
engine = create_engine('mysql+pymysql://root:admin1@127.0.0.1:3306/mydb?charset=utf8')
df = pd.read_sql('emp',con=engine)
print(df)
运行结果:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
1 7499 ALLEN SALESMAN 7698.0 1981-02-20 1600 300.0 30
2 7521 WARD SALESMAN 7698.0 1981-02-22 1250 500.0 30
3 7566 JONES MANAGER 7839.0 1981-04-02 2975 NaN 20
4 7654 MARTIN SALESMAN 7698.0 1981-09-28 1250 1400.0 30
5 7698 BLAKE MANAGER 7839.0 1981-05-01 2850 NaN 30
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
7 7788 SCOTT ANALYST 7566.0 1987-04-19 3000 NaN 20
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
9 7844 TURNER SALESMAN 7698.0 1981-09-08 1500 0.0 30
10 7876 ADAMS CLERK 7788.0 1987-05-23 1100 NaN 20
11 7900 JAMES CLERK 7698.0 1981-12-03 950 NaN 30
12 7902 FORD ANALYST 7566.0 1981-12-03 3000 NaN 20
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10