[python skill] Dataframe数据结构简析

最新推荐文章于 2024-06-14 21:08:47 发布

刀尔東

最新推荐文章于 2024-06-14 21:08:47 发布

阅读量710

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_38760323/article/details/81540045

版权

最近重新看了一下pandas下dataframe数据结构方面的东西，简单整理汇总一下：

import pandas as pd
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df.head()

output：

Cost Item Purchased Name
Store 1 22.5 Dog Food Chris
Store 1 2.5 Kitty Litter Kevyn
Store 2 5.0 Bird Seed Vinod

我们可以将dataframe看作更高维的series，它的每一行是一个series(但实际上放到dataframe里就已经变成dataframe了)，所以我们的index变成了用于检索series，而原本series中的index成为了columns的名字，即列名。

costs = df['Cost']
store = df.loc['Store 1']
print(type(costs))
print(type(store))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

检索series的方法与series自身的检索格式相同(横向dataframe的检索不能简化为，df['Store 2']，或df[2]，但可以为df[0:2] (前两行)，不太清楚原因，也许是因为df[0:2]不会产生歧义，而前两种写法可能会产生歧义吧)，仍为：

df.loc['Store 2']

Cost                      5
Item Purchased    Bird Seed
Name                  Vinod
Name: Store 2, dtype: object

需要注意的是loc或者iloc是dataframe的attribute，不是method，所以要用[ ]方括号。检索出的store2的类型为

pandas.core.series.Series

是一种series。

由于维数变高，所以dataframe支持二维的检索(必须加 .loc)，如：

df.loc['Store 1', 'Cost']

Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

你同样可以纵向检索：

df['Cost']

Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64

利用.drop()的method可以删除series（但不能用这种方法删除columns，列），但不返回（in-place）即不改变原来的dataframe：

df.drop('Store 1')

Cost Item Purchased Name
Store 2 5.0 Bird Seed Vinod

	Cost	Item Purchased	Name
Store 2	5.0	Bird Seed	Vinod

df

	Cost	Item Purchased	Name
Store 1	22.5	Dog Food	Chris
Store 1	2.5	Kitty Litter	Kevyn
Store 2	5.0	Bird Seed	Vinod

#如果想作用于原dataframe，请如下：

df = df.drop('Store 1')
df

如果想删除列，可以使用del

del df['Name']
df

Cost Item Purchased
Store 1 22.5 Dog Food
Store 1 2.5 Kitty Litter
Store 2 5.0 Bird Seed

	Cost	Item Purchased
Store 1	22.5	Dog Food
Store 1	2.5	Kitty Litter
Store 2	5.0	Bird Seed

直接增加列：

df['Location'] = None
df

Cost Item Purchased Name Location
Store 1 22.5 Dog Food Chris None
Store 1 2.5 Kitty Litter Kevyn None
Store 2 5.0 Bird Seed Vinod None

	Cost	Item Purchased	Name	Location
Store 1	22.5	Dog Food	Chris	None
Store 1	2.5	Kitty Litter	Kevyn	None
Store 2	5.0	Bird Seed	Vinod	None

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

由pandas导入csv的一些操作：

pandas.read_csv

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, doublequote=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)

ref：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

参数：

filepath_or_buffer : str，pathlib。str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)

可以是URL，可用URL类型包括：http, ftp, s3和文件。对于多文件正在准备中

本地文件读取实例：://localhost/path/to/table.csv

sep : str, default ‘,’

指定分隔符。如果不指定参数，则会尝试使用逗号分隔。分隔符长于一个字符并且不是‘\s+’,将使用python的语法分析器。并且忽略数据中的逗号。正则表达式例子：'\r\t'

delimiter : str, default None

定界符，备选分隔符（如果指定该参数，则sep参数失效）

delim_whitespace : boolean, default False.

指定空格(例如’ ‘或者’ ‘)是否作为分隔符使用，等效于设定sep='\s+'。如果这个参数设定为Ture那么delimiter 参数失效。

在新版本0.18.1支持

header : int or list of ints, default ‘infer’

（header 表示数据中是否存在列名，如果在第0行就写就写0，并且开始读数据时跳过相应的行数，不存在可以写none）

注意：如果skip_blank_lines=True 那么header参数忽略注释行和空行，所以header=0表示第一行数据而不是文件的第一行。

names : array-like, default None

用于结果的列名列表，如果数据文件中没有列标题行，就需要执行header=None。默认列表中不能出现重复，除非设定参数mangle_dupe_cols=True。

index_col : int or sequence or False, default None（感觉跟header的用差不多，但是我实验了一个程序，index_col和header对文件列名的设置都可行，但执行后面的程序时header的方法会报错，不知道为啥）

用作行索引的列编号或者列名，如果给定一个序列则有多个行索引。

如果文件不规则，行尾有分隔符，则可以设定index_col=False 来是的pandas不适用第一列作为行索引。

usecols : array-like, default None

返回一个数据子集，该列表中的值必须可以对应到文件中的位置（数字可以对应到指定的列）或者是字符传为文件中的列名。例如：usecols有效参数可能是 [0,1,2]或者是 [‘foo’, ‘bar’, ‘baz’]。使用这个参数可以加快加载速度并降低内存消耗。

as_recarray : boolean, default False

不赞成使用：该参数会在未来版本移除。请使用pd.read_csv(...).to_records()替代。

返回一个Numpy的recarray来替代DataFrame。如果该参数设定为True。将会优先squeeze参数使用。并且行索引将不再可用，索引列也将被忽略。

squeeze : boolean, default False

如果文件值包含一列，则返回一个Series

prefix : str, default None

在没有列标题时，给列添加前缀。例如：添加‘X’ 成为 X0, X1, ...

mangle_dupe_cols : boolean, default True

重复的列，将‘X’...’X’表示为‘X.0’...’X.N’。如果设定为false则会将所有重名列覆盖。

dtype : Type name or dict of column -> type, default None

每列数据的数据类型。例如 {‘a’: np.float64, ‘b’: np.int32}

engine : {‘c’, ‘python’}, optional

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

使用的分析引擎。可以选择C或者是python。C引擎快但是Python引擎功能更加完备。

converters : dict, default None

列转换函数的字典。key可以是列名或者列的序号。

true_values : list, default None

Values to consider as True

false_values : list, default None

Values to consider as False

skipinitialspace : boolean, default False

忽略分隔符后的空白（默认为False，即不忽略）.

skiprows : list-like or integer, default None

需要忽略的行数（从文件开始处算起），或需要跳过的行号列表（从0开始）。

skipfooter : int, default 0

从文件尾部开始忽略。 (c引擎不支持)

skip_footer : int, default 0

不推荐使用：建议使用skipfooter ，功能一样。

nrows : int, default None

需要读取的行数（从文件头开始算起）。

na_values : scalar, str, list-like, or dict, default None

一组用于替换NA/NaN的值。如果传参，需要制定特定列的空值。默认为‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’`.

keep_default_na : bool, default True

如果指定na_values参数，并且keep_default_na=False，那么默认的NaN将被覆盖，否则添加。

na_filter : boolean, default True

是否检查丢失值（空字符串或者是空值）。对于大文件来说数据集中没有空值，设定na_filter=False可以提升读取速度。

verbose : boolean, default False

是否打印各种解析器的输出信息，例如：“非数值列中缺失值的数量”等。

skip_blank_lines : boolean, default True

如果为True，则跳过空行；否则记为NaN。

parse_dates : boolean or list of ints or names or list of lists or dict, default False

boolean. True -> 解析索引
list of ints or names. e.g. If [1, 2, 3] -> 解析1,2,3列的值作为独立的日期列；
list of lists. e.g. If [[1, 3]] -> 合并1,3列作为一个日期列使用
dict, e.g. {‘foo’ : [1, 3]} -> 将1,3列合并，并给合并后的列起名为"foo"

infer_datetime_format : boolean, default False

如果设定为True并且parse_dates 可用，那么pandas将尝试转换为日期类型，如果可以转换，转换方法并解析。在某些情况下会快5~10倍。

keep_date_col : boolean, default False

如果连接多列解析日期，则保持参与连接的列。默认为False。

date_parser : function, default None

用于解析日期的函数，默认使用dateutil.parser.parser来做转换。Pandas尝试使用三种不同的方式解析，如果遇到问题则使用下一种方式。

1.使用一个或者多个arrays（由parse_dates指定）作为参数；

2.连接指定多列字符串作为一个列作为参数；

3.每行调用一次date_parser函数来解析一个或者多个字符串（由parse_dates指定）作为参数。

dayfirst : boolean, default False

DD/MM格式的日期类型

iterator : boolean, default False

返回一个TextFileReader 对象，以便逐块处理文件。

chunksize : int, default None

文件块的大小， See IO Tools docs for more informationon iterator and chunksize.

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

直接使用磁盘上的压缩文件。如果使用infer参数，则使用 gzip, bz2, zip或者解压文件名中以‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’这些为后缀的文件，否则不解压。如果使用zip，那么ZIP包中国必须只包含一个文件。设置为None则不解压。

新版本0.18.1版本支持zip和xz解压

thousands : str, default None

千分位分割符，如“，”或者“."

decimal : str, default ‘.’

字符中的小数点 (例如：欧洲数据使用’，‘).

float_precision : string, default None

Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the high-precision converter, and round_trip for the round-trip converter.

指定

lineterminator : str (length 1), default None

行分割符，只在C解析器下使用。

quotechar : str (length 1), optional

引号，用作标识开始和解释的字符，引号内的分割符将被忽略。

quoting : int or csv.QUOTE_* instance, default 0

控制csv中的引号常量。可选 QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)

doublequote : boolean, default True

双引号，当单引号已经被定义，并且quoting 参数不是QUOTE_NONE的时候，使用双引号表示引号内的元素作为一个元素使用。

escapechar : str (length 1), default None

当quoting 为QUOTE_NONE时，指定一个字符使的不受分隔符限值。

comment : str, default None

标识着多余的行不被解析。如果该字符出现在行首，这一行将被全部忽略。这个参数只能是一个字符，空行（就像skip_blank_lines=True）注释行被header和skiprows忽略一样。例如如果指定comment='#' 解析‘#empty\na,b,c\n1,2,3’ 以header=0 那么返回结果将是以’a,b,c'作为header。

encoding : str, default None

指定字符集类型，通常指定为'utf-8'. List of Python standard encodings

dialect : str or csv.Dialect instance, default None

如果没有指定特定的语言，如果sep大于一个字符则忽略。具体查看csv.Dialect 文档

tupleize_cols : boolean, default False

Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns)

error_bad_lines : boolean, default True

如果一行包含太多的列，那么默认不会返回DataFrame ，如果设置成false，那么会将改行剔除（只能在C解析器下使用）。

warn_bad_lines : boolean, default True

如果error_bad_lines =False，并且warn_bad_lines =True 那么所有的“bad lines”将会被输出（只能在C解析器下使用）。

low_memory : boolean, default True

分块加载到内存，再低内存消耗中解析。但是可能出现类型混淆。确保类型不被混淆需要设置为False。或者使用dtype 参数指定类型。注意使用chunksize 或者iterator 参数分块读入会将整个文件读入到一个Dataframe，而忽略类型（只能在C解析器中有效）

buffer_lines : int, default None

不推荐使用，这个参数将会在未来版本移除，因为他的值在解析器中不推荐使用

compact_ints : boolean, default False

不推荐使用，这个参数将会在未来版本移除

如果设置compact_ints=True ，那么任何有整数类型构成的列将被按照最小的整数类型存储，是否有符号将取决于use_unsigned 参数

use_unsigned : boolean, default False

不推荐使用：这个参数将会在未来版本移除

如果整数列被压缩(i.e. compact_ints=True)，指定被压缩的列是有符号还是无符号的。

memory_map : boolean, default False

如果使用的文件在内存内，那么直接map文件使用。使用这种方式可以避免文件再次进行IO操作。

ref：https://www.cnblogs.com/datablog/p/6127000.html

感谢~

查询列名：

df.columns

前面说到了，dataframe中每一行是一个dataframe，而每一列呢，则是一个series，所以可以应用很多series的性质来完成dataframe的查询，如：

df['Gold'] > 0

Afghanistan (AFG)                               False
Algeria (ALG)                                    True
Argentina (ARG)                                  True
Armenia (ARM)                                    True
Australasia (ANZ) [ANZ]                          True
Australia (AUS) [AUS] [Z]                        True
Austria (AUT)                                    True
Azerbaijan (AZE)                                 True
Bahamas (BAH)                                    True
Bahrain (BRN)                                   False
Barbados (BAR) [BAR]                            False
Belarus (BLR)                                    True
Belgium (BEL)                                    True

返回一个Boolean串，然后再用这组串完成对dataframe某些数据的过滤：

df.where[df['Gold'] > 0)
only_gold.head(100)

# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total
Afghanistan (AFG) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Algeria (ALG) 12.0 5.0 2.0 8.0 15.0 3.0 0.0 0.0 0.0 0.0 15.0 5.0 2.0 8.0 15.0
Argentina (ARG) 23.0 18.0 24.0 28.0 70.0 18.0 0.0 0.0 0.0 0.0 41.0 18.0 24.0 28.0 70.0
Armenia (ARM) 5.0 1.0 2.0 9.0 12.0 6.0 0.0 0.0 0.0 0.0 11.0 1.0 2.0 9.0 12.0
Australasia (ANZ) [ANZ] 2.0 3.0 4.0 5.0 12.0 0.0 0.0 0.0 0.0 0.0 2.0 3.0 4.0 5.0 12.0
Australia (AUS) [AUS] [Z] 25.0 139.0 152.0 177.0 468.0 18.0 5.0 3.0 4.0 12.0 43.0 144.0 155.0 181.0 480.0
Austria (AUT) 26.0 18.0 33.0 35.0 86.0 22.0 59.0 78.0 81.0 218.0 48.0 77.0 111.0 116.0 304.0
Azerbaijan (AZE) 5.0 6.0 5.0 15.0 26.0 5.0 0.0 0.0 0.0 0.0 10.0 6.0 5.0 15.0 26.0
Bahamas (BAH) 15.0 5.0 2.0 5.0 12.0 0.0 0.0 0.0 0.0 0.0 15.0 5.0 2.0 5.0 12.0
Bahrain (BRN) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Barbados (BAR) [BAR] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Belarus (BLR) 5.0 12.0 24.0 39.0 75.0 6.0 6.0 4.0 5.0 15.0 11.0 18.0 28.0 44.0 90.0
Belgium (BEL) 25.0 37.0 52.0 53.0 142.0 20.0 1.0 1.0 3.0 5.0 45.0 38.0 53.0 56.0 147.0
Bermuda (BER) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Bohemia (BOH) [BOH] [Z] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

	# Summer	Gold	Silver	Bronze	Total	# Winter	Gold.1	Silver.1	Bronze.1	Total.1	# Games	Gold.2	Silver.2	Bronze.2	Combined total
Afghanistan (AFG)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Algeria (ALG)	12.0	5.0	2.0	8.0	15.0	3.0	0.0	0.0	0.0	0.0	15.0	5.0	2.0	8.0	15.0
Argentina (ARG)	23.0	18.0	24.0	28.0	70.0	18.0	0.0	0.0	0.0	0.0	41.0	18.0	24.0	28.0	70.0
Armenia (ARM)	5.0	1.0	2.0	9.0	12.0	6.0	0.0	0.0	0.0	0.0	11.0	1.0	2.0	9.0	12.0
Australasia (ANZ) [ANZ]	2.0	3.0	4.0	5.0	12.0	0.0	0.0	0.0	0.0	0.0	2.0	3.0	4.0	5.0	12.0
Australia (AUS) [AUS] [Z]	25.0	139.0	152.0	177.0	468.0	18.0	5.0	3.0	4.0	12.0	43.0	144.0	155.0	181.0	480.0
Austria (AUT)	26.0	18.0	33.0	35.0	86.0	22.0	59.0	78.0	81.0	218.0	48.0	77.0	111.0	116.0	304.0
Azerbaijan (AZE)	5.0	6.0	5.0	15.0	26.0	5.0	0.0	0.0	0.0	0.0	10.0	6.0	5.0	15.0	26.0
Bahamas (BAH)	15.0	5.0	2.0	5.0	12.0	0.0	0.0	0.0	0.0	0.0	15.0	5.0	2.0	5.0	12.0
Bahrain (BRN)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Barbados (BAR) [BAR]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Belarus (BLR)	5.0	12.0	24.0	39.0	75.0	6.0	6.0	4.0	5.0	15.0	11.0	18.0	28.0	44.0	90.0
Belgium (BEL)	25.0	37.0	52.0	53.0	142.0	20.0	1.0	1.0	3.0	5.0	45.0	38.0	53.0	56.0	147.0
Bermuda (BER)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Bohemia (BOH) [BOH] [Z]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

再结合dropna等method可以完成对数据的过滤，如：

only_gold = only_gold.dropna()
only_gold.head()

# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total
Algeria (ALG) 12.0 5.0 2.0 8.0 15.0 3.0 0.0 0.0 0.0 0.0 15.0 5.0 2.0 8.0 15.0
Argentina (ARG) 23.0 18.0 24.0 28.0 70.0 18.0 0.0 0.0 0.0 0.0 41.0 18.0 24.0 28.0 70.0
Armenia (ARM) 5.0 1.0 2.0 9.0 12.0 6.0 0.0 0.0 0.0 0.0 11.0 1.0 2.0 9.0 12.0
Australasia (ANZ) [ANZ] 2.0 3.0 4.0 5.0 12.0 0.0 0.0 0.0 0.0 0.0 2.0 3.0 4.0 5.0 12.0
Australia (AUS) [AUS] [Z] 25.0 139.0 152.0 177.0 468.0 18.0 5.0 3.0 4.0 12.0 43.0 144.0 155.0 181.0 480.0

	# Summer	Gold	Silver	Bronze	Total	# Winter	Gold.1	Silver.1	Bronze.1	Total.1	# Games	Gold.2	Silver.2	Bronze.2	Combined total
Algeria (ALG)	12.0	5.0	2.0	8.0	15.0	3.0	0.0	0.0	0.0	0.0	15.0	5.0	2.0	8.0	15.0
Argentina (ARG)	23.0	18.0	24.0	28.0	70.0	18.0	0.0	0.0	0.0	0.0	41.0	18.0	24.0	28.0	70.0
Armenia (ARM)	5.0	1.0	2.0	9.0	12.0	6.0	0.0	0.0	0.0	0.0	11.0	1.0	2.0	9.0	12.0
Australasia (ANZ) [ANZ]	2.0	3.0	4.0	5.0	12.0	0.0	0.0	0.0	0.0	0.0	2.0	3.0	4.0	5.0	12.0
Australia (AUS) [AUS] [Z]	25.0	139.0	152.0	177.0	468.0	18.0	5.0	3.0	4.0	12.0	43.0	144.0	155.0	181.0	480.0

#only_gold['Gold'].count()

#100

#.count()记录有多少项，.sum()记录所有项内数字求和

dataframe用.set_index()来设置索引（就是最开始栗子中的store1，store2，相当于每一个样本的名字），可以用一组columns作为index，也可以用多个columns作为index，如：

#df['country'] = df.index#方法1

#df = df.set_index('Gold')#方法2

df = df.set_index(['STNAME', 'CTYNAME'])#两个columns作为index
df.head()

BIRTHS2010 BIRTHS2011 BIRTHS2012 BIRTHS2013 BIRTHS2014 BIRTHS2015 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015
STNAME CTYNAME
Alabama Autauga County 151 636 615 574 623 600 54660 55253 55175 55038 55290 55347
Baldwin County 517 2187 2092 2160 2186 2240 183193 186659 190396 195126 199713 203709
Barbour County 70 335 300 283 260 269 27341 27226 27159 26973 26815 26489
Bibb County 44 266 245 259 247 253 22861 22733 22642 22512 22549 22583
Blount County 183 744 710 646 618 603 57373 57711 57776 57734 57658 57673

		BIRTHS2010	BIRTHS2011	BIRTHS2012	BIRTHS2013	BIRTHS2014	BIRTHS2015	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
Alabama	Autauga County	151	636	615	574	623	600	54660	55253	55175	55038	55290	55347
Baldwin County	517	2187	2092	2160	2186	2240	183193	186659	190396	195126	199713	203709
Barbour County	70	335	300	283	260	269	27341	27226	27159	26973	26815	26489
Bibb County	44	266	245	259	247	253	22861	22733	22642	22512	22549	22583
Blount County	183	744	710	646	618	603	57373	57711	57776	57734	57658	57673

这里CTYNAME作为二级查找，对数据做了更加精确的分类，可以通过df.loc['Alabama']一级index查找，也可以利用一二级复合index查找df.loc['Alabama','Autauga County']，df.loc[ [('Michigan', 'Washtenaw County'), ('Michigan', 'Wayne County')] ]，但不能直接利用二级index查找：df.loc['Autauga County']

而.reset_index()则是清除之前设置的index，让他回重新回到columns中，转而使用0~n这样的整数作为index。

查询一个series有多少个不同的量：

df['SUMLEV'].unique()

array([40, 50])

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

缺失值处理

video playback position paused volume
time user
1469974424 cheryl intro.html 5 False 10.0
sue advanced.html 23 False 10.0
1469974454 cheryl intro.html 6 NaN NaN
sue advanced.html 24 NaN NaN
1469974484 cheryl intro.html 7 NaN NaN
1469974514 cheryl intro.html 8 NaN NaN
1469974524 sue advanced.html 25 NaN NaN
1469974544 cheryl intro.html 9 NaN NaN
1469974554 sue advanced.html 26 NaN NaN
1469974574 cheryl intro.html 10 NaN NaN
1469974604 cheryl intro.html 11 NaN NaN
1469974624 sue advanced.html 27 NaN NaN
1469974634 cheryl intro.html 12 NaN NaN
1469974654 sue advanced.html 28 NaN 5.0
1469974664 cheryl intro.html 13 NaN NaN
1469974694 cheryl intro.html 14 NaN NaN
1469974724 cheryl intro.html 15 NaN NaN
sue advanced.html 29 NaN NaN
1469974754 sue advanced.html 30 NaN NaN
1469974824 sue advanced.html 31 NaN NaN
1469974854 sue advanced.html 32 NaN NaN
1469974924 sue advanced.html 33 NaN NaN
1469977424 bob intro.html 1 True 10.0
1469977454 bob intro.html 1 NaN NaN
1469977484 bob intro.html 1 NaN NaN
1469977514 bob intro.html 1 NaN NaN
1469977544 bob intro.html 1 NaN NaN
1469977574 bob intro.html 1 NaN NaN
1469977604 bob intro.html 1 NaN NaN
1469977634 bob intro.html 1 NaN NaN
1469977664 bob intro.html 1 NaN NaN
1469977694 bob intro.html 1 NaN NaN
1469977724 bob intro.html 1 NaN NaN

		video	playback position	paused	volume
1469974424	cheryl	intro.html	5	False	10.0
sue	advanced.html	23	False	10.0
1469974454	cheryl	intro.html	6	NaN	NaN
sue	advanced.html	24	NaN	NaN
1469974484	cheryl	intro.html	7	NaN	NaN
1469974514	cheryl	intro.html	8	NaN	NaN
1469974524	sue	advanced.html	25	NaN	NaN
1469974544	cheryl	intro.html	9	NaN	NaN
1469974554	sue	advanced.html	26	NaN	NaN
1469974574	cheryl	intro.html	10	NaN	NaN
1469974604	cheryl	intro.html	11	NaN	NaN
1469974624	sue	advanced.html	27	NaN	NaN
1469974634	cheryl	intro.html	12	NaN	NaN
1469974654	sue	advanced.html	28	NaN	5.0
1469974664	cheryl	intro.html	13	NaN	NaN
1469974694	cheryl	intro.html	14	NaN	NaN
1469974724	cheryl	intro.html	15	NaN	NaN
sue	advanced.html	29	NaN	NaN
1469974754	sue	advanced.html	30	NaN	NaN
1469974824	sue	advanced.html	31	NaN	NaN
1469974854	sue	advanced.html	32	NaN	NaN
1469974924	sue	advanced.html	33	NaN	NaN
1469977424	bob	intro.html	1	True	10.0
1469977454	bob	intro.html	1	NaN	NaN
1469977484	bob	intro.html	1	NaN	NaN
1469977514	bob	intro.html	1	NaN	NaN
1469977544	bob	intro.html	1	NaN	NaN
1469977574	bob	intro.html	1	NaN	NaN
1469977604	bob	intro.html	1	NaN	NaN
1469977634	bob	intro.html	1	NaN	NaN
1469977664	bob	intro.html	1	NaN	NaN
1469977694	bob	intro.html	1	NaN	NaN
1469977724	bob	intro.html	1	NaN	NaN

可以看到上面一组数据有多处缺失值，可以利用df.fillna 方法进行筛选过滤：

pandas.DataFrame.fillna¶

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)[source]

Fill NA/NaN values using the specified method

Parameters:
value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0 or ‘index’, 1 or ‘columns’}

inplace : boolean, default False

If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcast : dict, default is None

a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:
filled : DataFrame

Parameters:	value : scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list. method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap axis : {0 or ‘index’, 1 or ‘columns’} inplace : boolean, default False If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame). limit : int, default None If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None. downcast : dict, default is None a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
Returns:	filled : DataFrame

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, 5],
...                    [np.nan, 3, np.nan, 4]],
...                    columns=list('ABCD'))
>>> df
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

Replace all NaN elements with 0s.

>>> df.fillna(0)#0值填充
    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

We can also propagate non-null values forward or backward.

>>> df.fillna(method='ffill')#临值填充
    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   3.0 4.0 NaN 5
3   3.0 3.0 NaN 4

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
>>> df.fillna(value=values)#自定义值填充
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 2.0 1
2   0.0 1.0 2.0 5
3   0.0 3.0 2.0 4

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)#自定义值填充（限定填充数量）
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 NaN 1
2   NaN 1.0 NaN 5
3   NaN 3.0 NaN 4

df = df.fillna(method='ffill')#临值填充

video playback position paused volume
time user
1469974424 cheryl intro.html 5 False 10.0
sue advanced.html 23 False 10.0
1469974454 cheryl intro.html 6 False 10.0
sue advanced.html 24 False 10.0
1469974484 cheryl intro.html 7 False 10.0
1469974514 cheryl intro.html 8 False 10.0
1469974524 sue advanced.html 25 False 10.0
1469974544 cheryl intro.html 9 False 10.0
1469974554 sue advanced.html 26 False 10.0
1469974574 cheryl intro.html 10 False 10.0
1469974604 cheryl intro.html 11 False 10.0
1469974624 sue advanced.html 27 False 10.0
1469974634 cheryl intro.html 12 False 10.0
1469974654 sue advanced.html 28 False 5.0
1469974664 cheryl intro.html 13 False 5.0
1469974694 cheryl intro.html 14 False 5.0
1469974724 cheryl intro.html 15 False 5.0
sue advanced.html 29 False 5.0
1469974754 sue advanced.html 30 False 5.0
1469974824 sue advanced.html 31 False 5.0
1469974854 sue advanced.html 32 False 5.0
1469974924 sue advanced.html 33 False 5.0
1469977424 bob intro.html 1 True 10.0
1469977454 bob intro.html 1 True 10.0
1469977484 bob intro.html 1 True 10.0
1469977514 bob intro.html 1 True 10.0
1469977544 bob intro.html 1 True 10.0
1469977574 bob intro.html 1 True 10.0
1469977604 bob intro.html 1 True 10.0
1469977634 bob intro.html 1 True 10.0
1469977664 bob intro.html 1 True 10.0
1469977694 bob intro.html 1 True 10.0
1469977724 bob intro.html 1 True 10.0

		video	playback position	paused	volume
1469974424	cheryl	intro.html	5	False	10.0
sue	advanced.html	23	False	10.0
1469974454	cheryl	intro.html	6	False	10.0
sue	advanced.html	24	False	10.0
1469974484	cheryl	intro.html	7	False	10.0
1469974514	cheryl	intro.html	8	False	10.0
1469974524	sue	advanced.html	25	False	10.0
1469974544	cheryl	intro.html	9	False	10.0
1469974554	sue	advanced.html	26	False	10.0
1469974574	cheryl	intro.html	10	False	10.0
1469974604	cheryl	intro.html	11	False	10.0
1469974624	sue	advanced.html	27	False	10.0
1469974634	cheryl	intro.html	12	False	10.0
1469974654	sue	advanced.html	28	False	5.0
1469974664	cheryl	intro.html	13	False	5.0
1469974694	cheryl	intro.html	14	False	5.0
1469974724	cheryl	intro.html	15	False	5.0
sue	advanced.html	29	False	5.0
1469974754	sue	advanced.html	30	False	5.0
1469974824	sue	advanced.html	31	False	5.0
1469974854	sue	advanced.html	32	False	5.0
1469974924	sue	advanced.html	33	False	5.0
1469977424	bob	intro.html	1	True	10.0
1469977454	bob	intro.html	1	True	10.0
1469977484	bob	intro.html	1	True	10.0
1469977514	bob	intro.html	1	True	10.0
1469977544	bob	intro.html	1	True	10.0
1469977574	bob	intro.html	1	True	10.0
1469977604	bob	intro.html	1	True	10.0
1469977634	bob	intro.html	1	True	10.0
1469977664	bob	intro.html	1	True	10.0
1469977694	bob	intro.html	1	True	10.0
1469977724	bob	intro.html	1	True	10.0

刀尔東

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

		BIRTHS2010	BIRTHS2011	BIRTHS2012	BIRTHS2013	BIRTHS2014	BIRTHS2015	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
STNAME	CTYNAME
Alabama	Autauga County	151	636	615	574	623	600	54660	55253	55175	55038	55290	55347
	Baldwin County	517	2187	2092	2160	2186	2240	183193	186659	190396	195126	199713	203709
	Barbour County	70	335	300	283	260	269	27341	27226	27159	26973	26815	26489
	Bibb County	44	266	245	259	247	253	22861	22733	22642	22512	22549	22583
	Blount County	183	744	710	646	618	603	57373	57711	57776	57734	57658	57673

		video	playback position	paused	volume
time	user
1469974424	cheryl	intro.html	5	False	10.0
1469974424	sue	advanced.html	23	False	10.0
1469974454	cheryl	intro.html	6	NaN	NaN
1469974454	sue	advanced.html	24	NaN	NaN
1469974484	cheryl	intro.html	7	NaN	NaN
1469974514	cheryl	intro.html	8	NaN	NaN
1469974524	sue	advanced.html	25	NaN	NaN
1469974544	cheryl	intro.html	9	NaN	NaN
1469974554	sue	advanced.html	26	NaN	NaN
1469974574	cheryl	intro.html	10	NaN	NaN
1469974604	cheryl	intro.html	11	NaN	NaN
1469974624	sue	advanced.html	27	NaN	NaN
1469974634	cheryl	intro.html	12	NaN	NaN
1469974654	sue	advanced.html	28	NaN	5.0
1469974664	cheryl	intro.html	13	NaN	NaN
1469974694	cheryl	intro.html	14	NaN	NaN
1469974724	cheryl	intro.html	15	NaN	NaN
1469974724	sue	advanced.html	29	NaN	NaN
1469974754	sue	advanced.html	30	NaN	NaN
1469974824	sue	advanced.html	31	NaN	NaN
1469974854	sue	advanced.html	32	NaN	NaN
1469974924	sue	advanced.html	33	NaN	NaN
1469977424	bob	intro.html	1	True	10.0
1469977454	bob	intro.html	1	NaN	NaN
1469977484	bob	intro.html	1	NaN	NaN
1469977514	bob	intro.html	1	NaN	NaN
1469977544	bob	intro.html	1	NaN	NaN
1469977574	bob	intro.html	1	NaN	NaN
1469977604	bob	intro.html	1	NaN	NaN
1469977634	bob	intro.html	1	NaN	NaN
1469977664	bob	intro.html	1	NaN	NaN
1469977694	bob	intro.html	1	NaN	NaN
1469977724	bob	intro.html	1	NaN	NaN