一、数据结构和序列
1.1、元组:有一种固定长度,不可修改的python对象序列
tup = 1,2,3 tup : (1,2,3)
tup = tuple([4,0,2]) tup : (4,0,2)
tup[0] = 4
元组添加元素:tup = (["foo",[1,2],True])
tup[1].append(3)
tup : ("foo",[1,2,3],True)
同样可以用 + 和 * 生成更长的元组:(1,2) + ("bar") = (1,2,"bar")
("foo","bar") * 2 = ("foo","bar","foo","bar")
元组拆包:tup = (1,2,3)
a,b,c = tup
b = 2
1.2、列表:长度可变,也可修改,使用[ ]和list函数创建
a_list= [2,3,7,None]
b = ("foo","bar","baz")
b_list = list(b)
b_list : ["foo","bar","baz"]
增加元素:b_list.append(2) b_list : ["foo","bar","baz",2]
b_list.extend(["a","b"]) b_list : ["foo","bar","baz","a","b"]
b_list.insert(1,"two") b_list : ["foo","two","bar","baz","a","b"]
删除元素:b_list.pop(3) b_list : ["foo","two","bar","a","b"]
b_list.remove("two") b_list : ["foo","bar","a","b"]
排序:sort
a = [2,3,1,5,4]
a.sort() a : [1,2,3,4,5]
b = ["as","swa","smaill","c","six"]
b.sort(key = len) b : ["c","as","swa","six","smaill"]
1.3、内建序列函数
enumerate:同时返回(i,value)元组的序列,i是元素的索引 ,value是元素的值
list = ["foo","bar","baz"]
mapping = { }
for i , value in enumerate(list):
mapping[value] = i
mapping : {"foo":0,"bar":1,"baz":2}
sorted:返回一个根据任意序列中的元素新建的已排序列表
sorted([7,3,5,1])
[1,3,5,7]
sorted(mapping.items(), key = lammbda x : x[1],reverse = True)
{"baz":2,"bar":1,"foo":0}
zip:将列表、元组、或其他元素配对,新建一个元组构成的列表
a = ["foo","bar","baz"]
b = ["one","two","three"]
配对:zipped = zip(a,b)
list(zipped) = [("foo","one"),("bar","two"),("baz","three")]
拆分:firstname , lastname = zip(*list(zipped))
firstname : ("foo","bar","baz")
lastname : ("one","two","three")
reversed:函数将序列的元素倒序排列
list(reversed(range(10)))
[9,8,7,6,5,4,3,2,1,0]
1.4、字典:字典的键不能重复
data = {"a":"some value","b":[1,2,3,4],"c":"an integer"}
字典删除:data.pop("b") data : {"a":"some value","c":"an integer"}
del data[0] data : {"c":"an integer"}
将两个字典合并:data.update({"d":12,"e":"foo"})
{"c":"an integer","d":12,"e":"foo"}
从序列生成字典:mapping = dict(zip(range(5),reversed(range(5))))
mapping : {0:4,1:3,2:2,3:1,4:0}
字典取值(默认值):value = some_dict.get(key,default_value) 没有值时返回default_value
data.keys() 取键;data.values() 取值;data.items() 同时取
一个常见的场景为字典中的值集合通过设置,成为另一种集合
words = ["apple","bat","bar","stom","book"]
by_letter = {}
for word in words:
letter = word[0]
if letter not in by_letter:
by_letter[letter] = [word]
else:
by_letter[letter].append(word)
by_letter:{"a":["apple"],"b":["bat","bar","book"],"s":["stom"]}
上述代码可替换为:
by_letter = {}
for word in words:
letter = word[0]
by_letter.setdefault(letter,[]).append(word)
1.5、集合:一种无序且元素唯一的容器,通过set或者大括号创建
set([1,1,2,3,4]) {1,2,3,4}
{1,1,2,3,3,4} {1,2,3,4}
1.6、列表、集合和字典的推导式
1.6.1 列表:
[expr for val in collection if condition]
等价于
result = []
for val in collection:
if condition
result.append(expr)
例:string = ["a","as","bat","car","dove","python"]
[x.upper() for x in string if len(x) > 2]
["BAT","CAR","DOVE","PYTHON"]
1.6.2 集合:集合推导式和列表一样,只需把中括号换成大括号
{expr foe val in collection if condition}
例:string = ["a","as","bat","car","dove","python"]
unique = {len(x) for x in string}
{1,2,3,4,6}
1.6.3 字典
{key_expr :value_expr for value in collection if condition}
例:string = ["a","as","bat","car","dove","python"]
{val : index for index , val in enumerate(string)}
{"a":0,"as":1,"bat":2,"car":3,"dove":4,"python":5}
1.6.4 嵌套列表推导式
all_data = [["john","emily","mary"],["maria","juan","natalis"]]
[name for names in all_data for name in names if name.count("a") >= 2]
["maria","natalis"]
或者:[[name for name in names if name.count("a") >=2] for names in all_data]
二、Numpy
生成一个数组:data = np.array([[-0.2,0.4,-0.5],
[0.6,0.99,0.3]])
随机生成一个数组:data = np.random.randn(2,3)
[[-0.2,0.4,-0.5],
[0.6,0.99,0.3]]
查看数组维度:data.ndim 2
查看数组属性:data.shape (2,3)
查看数组类型:data.dtype dtype("float64")
数组生成函数
arange python内建函数range的数组版,返回一个数组
ones 根据给定形状和数据类型生成全1数组
ones_like 根据给定的数组生成一个形状一样的全1数组
zeros 根据给定形状和数据类型生成全0数组
zeros_like 根据给定的数组生成一个形状一样的全0数组
empty 根据给定形状生成一个没有初始化数值的空数组
empty_like 跟据给定的数组生成一个形状一样但没有初始化数值的空数组
full 根据给定的形态和数据类型生成指定数值的数组
full_like 根据所给定的数组生成一个形态一样但内容是指定数值的数组
astype 显示的转换数组的数据类型
arr = np.array([1,2,3,4,5])
arr.dtype dtype("int64")
float_arr = arr.astype(np.float64)
flaot_arr.dtype dtype("float64")
如果想要一份数组切片的拷贝而不是一份视图的话,就必须显示的负责这个数组,例如arr[5:8].copy()
布尔索引
names = np.array(["bob","joe","will","bob"])
data =np.random.randn(4,4)
data
[[0.1,0.3,0.2,-0.8],
[0.4,-0.2,0.5,0.1],
[0.1,0.9,0.5,-0.4],
[0.5,0.3,-0,2,0.3]]
data[names == "bob"]
[[0.1,0.3,0.2,-0.8],
[0.5,0.3,-0,2,0.3]]
data[names == "bob",2:]
[[0.2,-0.8],
[-0.2,0.3]]
取反时在表达式前面加上~
data[~(names == "bob")]
[[0.4,-0.2,0.5,0.1],
[0.1,0.9,0.5,-0.4]]
当多个名字时,可以对多个布尔值条件进行联合,需要使用数字操作,&(and)和|(or)
如:mask = (names == "bob") | (names == "joe")
data[mask];或者data[names.isin(["bob","joe"])]
data[data < 0] = 0 将 < 0 的数值替换为0
数组的转置:data.T
一元数组通用函数
例:np.sqrt(data)
函数名 描述
abs,fabs 计算每个元素的绝对值
sqrt 计算每个元素的平方根(等价于data ** 0.5)
square 计算每个元素的平方(等价于data ** 2)
exp 计算每个元素的自然指数e
log,log10,log2,log1p 分别对应:自然对数(e为底),10为底,2为底,log(1+x)
rint 将元素保留到整数位,并保持dtype
isnan 返回数组中的元素是否时一个NaN(不是一个数值),形式为布尔值数组
... ...
条件逻辑:np.where(cond,xarr,yarr) 类似excel中的if语句
数学和统计方法
data.sum()或np.sum(data)
方法 描述
sum 沿轴向计算所有元素的累和
mean 数学平均,0长度的数组平均值为NaN
std,var 标准差和方差
min,max 最小值和最大值
argmin,argmax 最小值和最大值的位置
cumsum 从0开始元素累计和
cumprod 从1开始元素累计积
布尔值数组
arr = np.random.randn(100)
(arr > 0).sum() #正值的个数
42
any检查数组中是否至少有一个True,而all检查是否每个值都是True
bools = np.array([Flase,Flase,True,Flase])
bools.any() True
bools.all() Flase
唯一值与其他集合逻辑
np.unique:唯一
names = np.array(["a","b","b","c"])
np.unique(names) ["a","b","c"]
np.in1d:检查一个数组中的值是否在另外一个数组中,并返回布尔值
arr = np.array([1,2,3])
np.in1d(arr,[1,2]) [True,True,Flase]
方法 描述
unique(x) 计算x的唯一值,并排序
intersectid(x,y) 计算x和y的交集 x|y
union1d(x,y) 计算x和y的并集 x&y
in1d(x,y) 计算x的元素是否在y中,并返回布尔值
setdiff1d(x,y) 计算在x但不在y中的元素
setxor1d(x,y) 异域集,在x中或在y中,但不属于x、y交集的元素
示例:生成一个随机漫步
import random
start = 0
walk = [start]
n = 1000
for i in range(n):
step = 1 if random.randint(0,1) else -1
start += step
walk.append(start)
三、Pandas
DataFrame的创建
data = pd.DataFrame(data,index = [],columns = [])
使用isnull和notnull函数来检查缺失数据
obj = pd.Series({"a":np.nan,"b":1200,"c":1400})
obj
a NaN
b 1200
c 1400
pd.isnull(obj) 或obj.isnull()
a True
b Flase
c Flase
notnull则相反
删除创建的列
del data["列名"] 或者 data.drop("列名" , axis = 1)
从DataFrame中选取的列时数据的视图,而不是拷贝,因此对Series的修改会映射到DataFrame中,如果需要复制,则应当使用显示的copy方法。
3.1 重建索引 reindex
reindex 是pandas的重要方法,该方法用于创建一个符合新索引的新对象,并按新索引进行排列,如果某个索引值之前并不存在,则会引入缺失值。methon允许我们使用ffill(向前填充)bfill(向后填充);也适应于columns。
obj = pd.Series([4,7,-5,3],index = ["d","b","a","c"])
d 4
b 7
a -5
c 3
obj1 = obj.reindex(index = ["a","b","c","d","e"])
a -5
b 7
c 3
d 4
e NaN
obj2 = obj.reindex(index = ["a","b","c","d","e"],methon = "ffill")
a -5
b 7
c 3
d 4
e 4
reindex方法的参数
参数 描述
index/columns 新建作为索引的序列,可以是索引实例或任意其他序列型Python数据结构
methon 插值方式,“ffill”为向前填充,“bfill”为向后填充
fill_value 通过重新索引引入缺失数据时使用的替换值
limit 当向前或向后填充时,所需要填充的最大尺寸间隙
tolerance 当向前或向后填充时,所需要填充的不精准匹配下的最大尺寸间隙
level