1、Pandas CSV 文件
CSV(Comma-Separated Values,逗号分隔值,有时也称为字符分隔值,因为分隔字符也可以不是逗号),其文件以纯文本形式存储表格数据(数字和文本)。
CSV 是一种通用的、相对简单的文件格式,被用户、商业和科学广泛应用。
Pandas 可以很方便的处理 CSV 文件,本文以 nba.csv 为例,你可以下载 nba.csv 或打开 nba.csv 查看。
import pandas as pd
df = pd.read_csv('E:\\Edge浏览器文件\\nba.csv')
print(df.to_string())
to_string() 用于返回 DataFrame 类型的数据,如果不使用该函数,则输出结果为数据的前面 5 行和末尾 5 行,中间部分以 … 代替。
Name Team Number Position Age Height Weight College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 NaN 5000000.0
5 Amir Johnson Boston Celtics 90.0 PF 29.0 6-9 240.0 NaN 12000000.0
6 Jordan Mickey Boston Celtics 55.0 PF 21.0 6-8 235.0 LSU 1170960.0
7 Kelly Olynyk Boston Celtics 41.0 C 25.0 7-0 238.0 Gonzaga 2165160.0
8 Terry Rozier Boston Celtics 12.0 PG 22.0 6-2 190.0 Louisville 1824360.0
9 Marcus Smart Boston Celtics 36.0 PG 22.0 6-4 220.0 Oklahoma State 3431040.0
10 Jared Sullinger Boston Celtics 7.0 C 24.0 6-9 260.0 Ohio State 2569260.0
11 Isaiah Thomas Boston Celtics 4.0 PG 27.0 5-9 185.0 Washington 6912869.0
12 Evan Turner Boston Celtics 11.0 SG 27.0 6-7 220.0 Ohio State 3425510.0
13 James Young Boston Celtics 13.0 SG 20.0 6-6 215.0 Kentucky 1749840.0
14 Tyler Zeller Boston Celtics 44.0 C 26.0 7-0 253.0 North Carolina 2616975.0
15 Bojan Bogdanovic Brooklyn Nets 44.0 SG 27.0 6-8 216.0 NaN 3425510.0
16 Markel Brown Brooklyn Nets 22.0 SG 24.0 6-3 190.0 Oklahoma State 845059.0
17 Wayne Ellington Brooklyn Nets 21.0 SG 28.0 6-4 200.0 North Carolina 1500000.0
18 Rondae Hollis-Jefferson Brooklyn Nets 24.0 SG 21.0 6-7 220.0 Arizona 1335480.0
19 Jarrett Jack Brooklyn Nets 2.0 PG 32.0 6-3 200.0 Georgia Tech 6300000.0
20 Sergey Karasev Brooklyn Nets 10.0 SG 22.0 6-7 208.0 NaN 1599840.0
21 Sean Kilpatrick Brooklyn Nets 6.0 SG 26.0 6-4 219.0 Cincinnati 134215.0
22 Shane Larkin Brooklyn Nets 0.0 PG 23.0 5-11 175.0 Miami (FL) 1500000.0
23 Brook Lopez Brooklyn Nets 11.0 C 28.0 7-0 275.0 Stanford 19689000.0
24 Chris McCullough Brooklyn Nets 1.0 PF 21.0 6-11 200.0 Syracuse 1140240.0
25 Willie Reed Brooklyn Nets 33.0 PF 26.0 6-10 220.0 Saint Louis 947276.0
26 Thomas Robinson Brooklyn Nets 41.0 PF 25.0 6-10 237.0 Kansas 981348.0
27 Henry Sims Brooklyn Nets 14.0 C 26.0 6-10 248.0 Georgetown 947276.0
28 Donald Sloan Brooklyn Nets 15.0 PG 28.0 6-3 205.0 Texas A&M 947276.0
29 Thaddeus Young Brooklyn Nets 30.0 PF 27.0 6-8 221.0 Georgia Tech 11235955.0
30 Arron Afflalo New York Knicks 4.0 SG 30.0 6-5 210.0 UCLA 8000000.0
31 Lou Amundson New York Knicks 17.0 PF 33.0 6-9 220.0 UNLV 1635476.0
32 Thanasis Antetokounmpo New York Knicks 43.0 SF 23.0 6-7 205.0 NaN 30888.0
33 Carmelo Anthony New York Knicks 7.0 SF 32.0 6-8 240.0 Syracuse 22875000.0
34 Jose Calderon New York Knicks 3.0 PG 34.0 6-3 200.0 NaN 7402812.0
35 Cleanthony Early New York Knicks 11.0 SF 25.0 6-8 210.0 Wichita State 845059.0
36 Langston Galloway New York Knicks 2.0 SG 24.0 6-2 200.0 Saint Joseph's 845059.0
37 Jerian Grant New York Knicks 13.0 PG 23.0 6-4 195.0 Notre Dame 1572360.0
38 Robin Lopez New York Knicks 8.0 C 28.0 7-0 255.0 Stanford 12650000.0
39 Kyle O'Quinn New York Knicks 9.0 PF 26.0 6-10 250.0 Norfolk State 3750000.0
40 Kristaps Porzingis New York Knicks 6.0 PF 20.0 7-3 240.0 NaN 4131720.0
41 Kevin Seraphin New York Knicks 1.0 C 26.0 6-10 278.0 NaN 2814000.0
42 Lance Thomas New York Knicks 42.0 SF 28.0 6-8 235.0 Duke 1636842.0
43 Sasha Vujacic New York Knicks 18.0 SG 32.0 6-7 195.0 NaN 947276.0
44 Derrick Williams New York Knicks 23.0 PF 25.0 6-8 240.0 Arizona 4000000.0
45 Tony Wroten New York Knicks 5.0 SG 23.0 6-6 205.0 Washington 167406.0
46 Elton Brand Philadelphia 76ers 42.0 PF 37.0 6-9 254.0 Duke NaN
47 Isaiah Canaan Philadelphia 76ers 0.0 PG 25.0 6-0 201.0 Murray State 947276.0
48 Robert Covington Philadelphia 76ers 33.0 SF 25.0 6-9 215.0 Tennessee State 1000000.0
49 Joel Embiid Philadelphia 76ers 21.0 C 22.0 7-0 250.0 Kansas 4626960.0
50 Jerami Grant Philadelphia 76ers 39.0 SF 22.0 6-8 210.0 Syracuse 845059.0
51 Richaun Holmes Philadelphia 76ers 22.0 PF 22.0 6-10 245.0 Bowling Green 1074169.0
52 Carl Landry Philadelphia 76ers 7.0 PF 32.0 6-9 248.0 Purdue 6500000.0
53 Kendall Marshall Philadelphia 76ers 5.0 PG 24.0 6-4 200.0 North Carolina 2144772.0
54 T.J. McConnell Philadelphia 76ers 12.0 PG 24.0 6-2 200.0 Arizona 525093.0
55 Nerlens Noel Philadelphia 76ers 4.0 PF 22.0 6-11 228.0 Kentucky 3457800.0
56 Jahlil Okafor Philadelphia 76ers 8.0 C 20.0 6-11 275.0 Duke 4582680.0
57 Ish Smith Philadelphia 76ers 1.0 PG 27.0 6-0 175.0 Wake Forest 947276.0
58 Nik Stauskas Philadelphia 76ers 11.0 SG 22.0 6-6 205.0 Michigan 2869440.0
59 Hollis Thompson Philadelphia 76ers 31.0 SG 25.0 6-8 206.0 Georgetown 947276.0
60 Christian Wood Philadelphia 76ers 35.0 PF 20.0 6-11 220.0 UNLV 525093.0
61 Bismack Biyombo Toronto Raptors 8.0 C 23.0 6-9 245.0 NaN 2814000.0
62 Bruno Caboclo Toronto Raptors 20.0 SF 20.0 6-9 205.0 NaN 1524000.0
63 DeMarre Carroll Toronto Raptors 5.0 SF 29.0 6-8 212.0 Missouri 13600000.0
64 DeMar DeRozan Toronto Raptors 10.0 SG 26.0 6-7 220.0 USC 10050000.0
65 James Johnson Toronto Raptors 3.0 PF 29.0 6-9 250.0 Wake Forest 2500000.0
66 Cory Joseph Toronto Raptors 6.0 PG 24.0 6-3 190.0 Texas 7000000.0
67 Kyle Lowry Toronto Raptors 7.0 PG 30.0 6-0 205.0 Villanova 12000000.0
68 Lucas Nogueira Toronto Raptors 92.0 C 23.0 7-0 220.0 NaN 1842000.0
69 Patrick Patterson Toronto Raptors 54.0 PF 27.0 6-9 235.0 Kentucky 6268675.0
70 Norman Powell Toronto Raptors 24.0 SG 23.0 6-4 215.0 UCLA 650000.0
71 Terrence Ross Toronto Raptors 31.0 SF 25.0 6-7 195.0 Washington 3553917.0
72 Luis Scola Toronto Raptors 4.0 PF 36.0 6-9 240.0 NaN 2900000.0
73 Jason Thompson Toronto Raptors 1.0 PF 29.0 6-11 250.0 Rider 245177.0
74 Jonas Valanciunas Toronto Raptors 17.0 C 24.0 7-0 255.0 NaN 4660482.0
75 Delon Wright Toronto Raptors 55.0 PG 24.0 6-5 190.0 Utah 1509360.0
76 Leandro Barbosa Golden State Warriors 19.0 SG 33.0 6-3 194.0 NaN 2500000.0
77 Harrison Barnes Golden State Warriors 40.0 SF 24.0 6-8 225.0 North Carolina 3873398.0
78 Andrew Bogut Golden State Warriors 12.0 C 31.0 7-0 260.0 Utah 13800000.0
79 Ian Clark Golden State Warriors 21.0 SG 25.0 6-3 175.0 Belmont 947276.0
80 Stephen Curry Golden State Warriors 30.0 PG 28.0 6-3 190.0 Davidson 11370786.0
81 Festus Ezeli Golden State Warriors 31.0 C 26.0 6-11 265.0 Vanderbilt 2008748.0
82 Draymond Green Golden State Warriors 23.0 PF 26.0 6-7 230.0 Michigan State 14260870.0
83 Andre Iguodala Golden State Warriors 9.0 SF 32.0 6-6 215.0 Arizona 11710456.0
84 Shaun Livingston Golden State Warriors 34.0 PG 30.0 6-7 192.0 NaN 5543725.0
85 Kevon Looney Golden State Warriors 36.0 SF 20.0 6-9 220.0 UCLA 1131960.0
86 James Michael McAdoo Golden State Warriors 20.0 SF 23.0 6-9 240.0 North Carolina 845059.0
87 Brandon Rush Golden State Warriors 4.0 SF 30.0 6-6 220.0 Kansas 1270964.0
88 Marreese Speights Golden State Warriors 5.0 C 28.0 6-10 255.0 Florida 3815000.0
89 Klay Thompson Golden State Warriors 11.0 SG 26.0 6-7 215.0 Washington State 15501000.0
90 Anderson Varejao Golden State Warriors 18.0 PF 33.0 6-11 273.0 NaN 289755.0
91 Cole Aldrich Los Angeles Clippers 45.0 C 27.0 6-11 250.0 Kansas 1100602.0
92 Jeff Ayres Los Angeles Clippers 19.0 PF 29.0 6-9 250.0 Arizona State 111444.0
93 Jamal Crawford Los Angeles Clippers 11.0 SG 36.0 6-5 195.0 Michigan 5675000.0
94 Branden Dawson Los Angeles Clippers 22.0 SF 23.0 6-6 225.0 Michigan State 525093.0
95 Jeff Green Los Angeles Clippers 8.0 SF 29.0 6-9 235.0 Georgetown 9650000.0
96 Blake Griffin Los Angeles Clippers 32.0 PF 27.0 6-10 251.0 Oklahoma 18907726.0
97 Wesley Johnson Los Angeles Clippers 33.0 SF 28.0 6-7 215.0 Syracuse 1100602.0
···················
···················
···················
import pandas as pd
df = pd.read_csv('E:\\Edge浏览器文件\\nba.csv')
print(df)
输出结果为:
Name Team ... College Salary
0 Avery Bradley Boston Celtics ... Texas 7730337.0
1 Jae Crowder Boston Celtics ... Marquette 6796117.0
2 John Holland Boston Celtics ... Boston University NaN
3 R.J. Hunter Boston Celtics ... Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics ... NaN 5000000.0
.. ... ... ... ... ...
453 Shelvin Mack Utah Jazz ... Butler 2433333.0
454 Raul Neto Utah Jazz ... NaN 900000.0
455 Tibor Pleiss Utah Jazz ... NaN 2900000.0
456 Jeff Withey Utah Jazz ... Kansas 947276.0
457 NaN NaN ... NaN NaN
我们也可以使用 to_csv() 方法将 DataFrame 存储为 csv 文件:
import pandas as pd
# 三个字段 name, site, age
nme = ["Google", "Runoob", "Taobao", "Wiki"]
st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
ag = [90, 40, 80, 98]
# 字典
dict = {'name': nme, 'site': st, 'age': ag}
df = pd.DataFrame(dict)
# 保存 dataframe
df.to_csv('site.csv')
执行成功后,我们打开 site.csv 文件,显示结果如下:
2、数据处理
2.1、head()
head( n ) 方法用于读取前面的 n 行,如果不填参数 n ,默认返回 5 行。
读取前面 5 行
import pandas as pd
df = pd.read_csv('E:\\Edge浏览器文件\\nba.csv')
print(df.head())
输出结果为:
Name Team Number ... Weight College Salary
0 Avery Bradley Boston Celtics 0.0 ... 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 ... 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 ... 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 ... 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 ... 231.0 NaN 5000000.0
[5 rows x 9 columns]
读取前面 10 行
import pandas as pd
df = pd.read_csv('E:\\Edge浏览器文件\\nba.csv')
print(df.head(10))
输出结果为:
Name Team Number ... Weight College Salary
0 Avery Bradley Boston Celtics 0.0 ... 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 ... 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 ... 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 ... 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 ... 231.0 NaN 5000000.0
5 Amir Johnson Boston Celtics 90.0 ... 240.0 NaN 12000000.0
6 Jordan Mickey Boston Celtics 55.0 ... 235.0 LSU 1170960.0
7 Kelly Olynyk Boston Celtics 41.0 ... 238.0 Gonzaga 2165160.0
8 Terry Rozier Boston Celtics 12.0 ... 190.0 Louisville 1824360.0
9 Marcus Smart Boston Celtics 36.0 ... 220.0 Oklahoma State 3431040.0
[10 rows x 9 columns]
tail()
tail( n ) 方法用于读取尾部的 n 行,如果不填参数 n ,默认返回 5 行,空行各个字段的值返回 NaN。
读取末尾 5 行
import pandas as pd
df = pd.read_csv('E:\\Edge浏览器文件\\nba.csv')
print(df.tail())
输出结果为:
Name Team Number Position ... Height Weight College Salary
453 Shelvin Mack Utah Jazz 8.0 PG ... 6-3 203.0 Butler 2433333.0
454 Raul Neto Utah Jazz 25.0 PG ... 6-1 179.0 NaN 900000.0
455 Tibor Pleiss Utah Jazz 21.0 C ... 7-3 256.0 NaN 2900000.0
456 Jeff Withey Utah Jazz 24.0 C ... 7-0 231.0 Kansas 947276.0
457 NaN NaN NaN NaN ... NaN NaN NaN NaN
[5 rows x 9 columns]
读取末尾 10 行
import pandas as pd
df = pd.read_csv('E:\\Edge浏览器文件\\nba.csv')
print(df.tail(10))
输出结果为:
Name Team Number ... Weight College Salary
448 Gordon Hayward Utah Jazz 20.0 ... 226.0 Butler 15409570.0
449 Rodney Hood Utah Jazz 5.0 ... 206.0 Duke 1348440.0
450 Joe Ingles Utah Jazz 2.0 ... 226.0 NaN 2050000.0
451 Chris Johnson Utah Jazz 23.0 ... 206.0 Dayton 981348.0
452 Trey Lyles Utah Jazz 41.0 ... 234.0 Kentucky 2239800.0
453 Shelvin Mack Utah Jazz 8.0 ... 203.0 Butler 2433333.0
454 Raul Neto Utah Jazz 25.0 ... 179.0 NaN 900000.0
455 Tibor Pleiss Utah Jazz 21.0 ... 256.0 NaN 2900000.0
456 Jeff Withey Utah Jazz 24.0 ... 231.0 Kansas 947276.0
457 NaN NaN NaN ... NaN NaN NaN
[10 rows x 9 columns]
info()
info() 方法返回表格的一些基本信息:
import pandas as pd
df = pd.read_csv('E:\\Edge浏览器文件\\nba.csv')
print(df.info())
输出结果为:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 457 non-null object
1 Team 457 non-null object
2 Number 457 non-null float64
3 Position 457 non-null object
4 Age 457 non-null float64
5 Height 457 non-null object
6 Weight 457 non-null float64
7 College 373 non-null object
8 Salary 446 non-null float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB
None
non-null 为非空数据,我们可以看到上面的信息中,总共 458 行,College 字段的空值最多。