numpy与数据科学中的一行流
基础二维数组计算
创建一维\二维\三维数组
import numpy as np
a = np.array([1,2,3,4,5,6,7])
b=np.array([[1,2],
[3,4]])
c=np.array([[[1,2],[3,4]],
[[5,6],[7,8]]])
#查看数组的维度
print(a.ndim)
1
#查看数组的维度
print(b.ndim)
print(c.ndim)
2
3
二维数组的基本算术运算
a=np.array([[1,0,0],
[1,1,1],
[2,0,0]])
b=np.array([[1,1,1],
[1,2,1],
[1,0,2]])
a+b
array([[2, 1, 1],
[2, 3, 2],
[3, 0, 2]])
a-b
array([[ 0, -1, -1],
[ 0, -1, 0],
[ 1, 0, -2]])
a*b
array([[1, 0, 0],
[1, 2, 1],
[2, 0, 0]])
# 产生除0错误 但没有报错,而是以结果nan表示.
a/b
C:\Users\A1\AppData\Local\Temp/ipykernel_7944/1348051284.py:1: RuntimeWarning: invalid value encountered in true_divide
a/b
array([[1. , 0. , 0. ],
[1. , 0.5, 1. ],
[2. , nan, 0. ]])
#np数组的算术运算都是在元素层面上进行的
聚合函数np.max(),np.min(),np.average()
np.max(a)
2
np.min(b)
0
np.average(a)
0.6666666666666666
给定一群人的年薪和税率,找到其中税后收入最高的人
# 数据 [2017,2018,2019] 这三年的年收入
alice = [99,101,103]
bob = [ 110,108,105]
tim = [90,88,85]
salaries = np.array([alice,bob,tim])
taxation = np.array([[0.2,0.25,.22],
[.4,.5,.5],
[.1,.2,.1]])
# 一行流
max_income=np.max(salaries-salaries*taxation)
max_income
81.0
print(salaries-salaries*taxation) # 扣税后的收入情况
[[79.2 75.75 80.34]
[66. 54. 52.5 ]
[81. 70.4 76.5 ]]
Numpy数组的切片\广播\数组类型
markdown简介
markdown是一种亲轻量级标记语言
分割线
或者
分割线
或者
加粗
斜体
加粗加斜体
删除线
[超链接文字](http://www.baidu.com “title”
加粗
斜体
加粗加斜体
删除线
超链接文字
段落 两个enter
这是一个新段落
段落 两个enter
这是一个新段落
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bVVEtcVS-1640597322816)(图片地址 “图片title”)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LIerB0v3-1640597322817)(图片地址 “图片title”)]
块引用
这里是引用的文字
块引用
这里是引用的文字
列表
- 无序列表1
- 子列表1
- 无需列表1
- 自列表1
- 无需列表1
- 自列表1
- 有序列表
- 自列表
列表
- 无序列表1
- 子列表1
- 无需列表1
- 自列表1
- 无需列表1
- 自列表1
- 有序列表
- 自列表
切片和索引
numpy支持多数组的多个维度同时索引,每一个维度用逗号分隔
一维切片的例子
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9,0])
a
array([1, 2, 3, 4, 5, 6, 7, 8, 9, 0])
a[:]
array([1, 2, 3, 4, 5, 6, 7, 8, 9, 0])
a[2:]
array([3, 4, 5, 6, 7, 8, 9, 0])
a[1:4]
array([2, 3, 4])
a[2:-2]
array([3, 4, 5, 6, 7, 8])
a[::2]
array([1, 3, 5, 7, 9])
a[2::2]
array([3, 5, 7, 9])
a[::-1]
array([0, 9, 8, 7, 6, 5, 4, 3, 2, 1])
a[:1:-2]
array([0, 8, 6, 4])
a[-1:1:-2]
array([0, 8, 6, 4])
二维切片的例子
a = np.array([[0,1,2,3],
[4,5,6,7],
[8,9,10,11],
[12,13,14,15]]
)
#所有行,第3列
a[:,2]
array([ 2, 6, 10, 14])
# 第二行所有列
a[1,:]
array([4, 5, 6, 7])
# 第三行,间隔取值
a[2,::2]
array([ 8, 10])
# 所有的行,但不带最后一列
a[:,:-1]
array([[ 0, 1, 2],
[ 4, 5, 6],
[ 8, 9, 10],
[12, 13, 14]])
# 只有一个slice表示默认另一个轴,取全部
a[:-2]
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
**总结:**多维切片的基本格式ndarray[slice1,slice2,slice3…] slice=start:stop:step
广播
广播(broadcasting)是指numpy的一种自动处理的过程,它把两个ndarray变成相同的形状(shape)"
为了让两个不同形状的数组进行运算,numpy会通过广播把一个低维的数组通过填充的方式扩展成高维的数组,以进行运算
#数组的维度
import numpy as np
a = np.array([1,2,3])
b = np.array([[1,2,3],
[4,5,6]])
c=np.array([[[1,2,3],[4,5,6]],
[[7,8,9],[10,11,12]]])
a.ndim
1
b.ndim
2
c.ndim
3
# 数组的形状shape,返回每个维度上元素的个数组成的元组
a.shape
(3,)
b.shape
(2, 3)
c.shape
(2, 2, 3)
总结 每新增一个维度新的轴将变成第0轴,而原来的低维数组的第i轴变成高维数组的第i+1轴
同质:指的是数组中的所有元素必须是相同的类型
bool 1字节
int 默认4字节或8字节
np.int8 1字节
np.int16 2字节
np.int32 4字节
np.int64 8字节
float 默认大小8字节
np.float16 2字节
np.float32 4字节
np.float64 8字节
complex 默认大小 16字节
# 指定元素类型
a=np.array([1,2,3],dtype=np.int16)
b=np.array([11,22,33],dtype=np.float32)
b.dtype
dtype('float32')
将数据科学家的工资每隔一年提高10%
现有一个二维数组,保存各职业25,26,27年的工资数据,要求将数据科学家的工资每隔一年提高10%
import numpy as np
# 数据 年收入[2025,2026,2027]
datascientist = [130,132,137]
productmanager=[127,140,145]
designer = [118,118,127]
softwareEngineer=[129,131,137]
employees=np.array([datascientist,
productmanager,
designer,
softwareEngineer])
# 一行流
employees[0,::2]=employees[0,::2]*1.1
employees
array([[143, 132, 150],
[127, 140, 145],
[118, 118, 127],
[129, 131, 137]])
**总结:**使用到了切片与切片赋值,还有广播让数组和浮点数相乘,结果并没有改变元素的类型,还是整型
使用条件数组查询\过滤和广播检测异常值
背景知识
#nonzero():可以得到数组非0元素的索引
x=np.array([[1,0,0],
[0,2,2],
[3,0,3]])
print(np.nonzero(x))
(array([0, 1, 1, 2, 2], dtype=int64), array([0, 1, 2, 0, 2], dtype=int64))
结果是一个元组,由两个np数组构成的元素组成,第一个数组保存非0值的行索引,第二个保存非0数组的列索引
y=np.array([[[1,2,3,0],[1,2,0,4]],
[[0,1,3,0],[0,0,4,5]]])
print(np.nonzero(y))
(array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int64), array([0, 0, 0, 1, 1, 1, 0, 0, 1, 1], dtype=int64), array([0, 1, 2, 0, 1, 3, 1, 2, 2, 3], dtype=int64))
# 利用广播进行布尔级操作
a= np.array([[1,0,0],
[0,2,2],
[3,0,0]])
print(a==2)
[[False False False]
[False True True]
[False False False]]
找出污染峰值超过平均值的城市
# 数据:空气质量指数(行=dity)
x=np.array([
[42,40,41,43,44,43],#Hong Kong
[30,31,29,29,29,30],#New York
[8,13,31,11,11,9],# Berlin
[11,11,12,13,11,12]] )# Montreal
cities = np.array(['Hong Kong','New York','Berlin','Montreal'])
## 一行流
polluted = set(cities[np.nonzero(x>np.average(x))[0]])
polluted
{'Berlin', 'Hong Kong', 'New York'}
x>np.average(x)
array([[ True, True, True, True, True, True],
[ True, True, True, True, True, True],
[False, False, True, False, False, False],
[False, False, False, False, False, False]])
np.nonzero(x>np.average(x))
(array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2], dtype=int64),
array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 2], dtype=int64))
高级索引:numpy允许定义一个序列作为索引,而不用是连续的切片,
就可以通过指定一个整数序列(代表要选择的索引)或者一个布尔值的序列(选择对应值是True的那些索引)来获取数组中的元素
x=[1,2,3,4,5,6]
x=np.array(x)
x[[1,4,5]]
array([2, 5, 6])
x[[0,1,1,0]]
array([1, 2, 2, 1])
x[[True,False,False,True,True,True]]
array([1, 4, 5, 6])
使用布尔索引过滤二维数组
背景知识
# 数据数组和索引数组
a = np.array([[1,2,3], # 数据数组
[4,5,6],
[7,8,9]])
indices = np.array([[False,False,True], # 索引数组
[True,False,False],
[False,True,True]])
a[indices]
array([3, 4, 8, 9])
# 给定一个二维数组,每行是一个影响者的数据,第一列代表名字,第二列代表粉丝数量,求出粉丝超过一亿的影响者名字
inst = np.array([[232,"李佳琪"],
[133,'老罗'],
[120,"薇薇安"],
[111,"唐糖"],
[76,"一地鸡毛"]])
superstar = inst[inst[:,0].astype(float)>100,1]
superstar
array(['李佳琪', '老罗', '薇薇安', '唐糖'], dtype='<U11')
# astype(float)用于把切片生成的数组转换为浮点型,因为原始数组中有整数和字符串,np自动将所有类型转化为了字符串
inst.dtype
dtype('<U11')
inst[:,0].astype(float)>100
array([ True, True, True, True, False])
使用广播\切片赋值和重塑清洗固定步长的数组元素
基础知识
# 切片赋值
a=np.array([4]*16)
a
array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4])
a[1:]=[32]*15
a
array([ 4, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32])
a[2:10:3]=16
a
array([ 4, 32, 16, 32, 32, 16, 32, 32, 16, 32, 32, 32, 32, 32, 32, 32])
# reshape重塑
a=np.array([1,2,3,4,5,6,7,8,9])
a.reshape((3,3))
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
a
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
a.reshape((3,-1)) #当某个维度的参数为-1时,np会自动计算
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# 轴参数(axis argument)
a=np.array(range(10))
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
b=a.reshape((2,-1))
b
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
np.average(b,axis=0) # 列方向求平均值
array([2.5, 3.5, 4.5, 5.5, 6.5])
np.average(b,axis=1) # 行方向求平均值
array([2., 7.])
# 给定一个温度值数组,将第7天的数据用过去7天的平均值代替
data = [1,2,3,4,5,6,12,
2,3,4,5,4,3,1,
3,5,3,2,3,4,9]
tmp=np.array(data)
#一行流
data[6::7]=np.average(tmp.reshape((-1,7)),axis=1)
data
[1,
2,
3,
4,
5,
6,
4.714285714285714,
2,
3,
4,
5,
4,
3,
3.142857142857143,
3,
5,
3,
2,
3,
4,
4.142857142857143]
tmp.reshape((-1,7))
array([[ 1, 2, 3, 4, 5, 6, 12],
[ 2, 3, 4, 5, 4, 3, 1],
[ 3, 5, 3, 2, 3, 4, 9]])
np.average(tmp.reshape((-1,7)),axis=1)
array([4.71428571, 3.14285714, 4.14285714])
Numpy中的排序
sort()和argsort()
# argsort()返回在排序后创建一个原数组的索引组成的数组
import numpy as np
a=list(reversed([0,1,2,3,4,5,6,7]))
a=np.array(a)
np.sort(a)
array([0, 1, 2, 3, 4, 5, 6, 7])
np.argsort(a)
array([7, 6, 5, 4, 3, 2, 1, 0], dtype=int64)
a=np.array([10,6,8,2,5,4,9,10])
np.sort(a)
array([ 2, 4, 5, 6, 8, 9, 10, 10])
np.argsort(a)
array([3, 5, 4, 1, 2, 6, 0, 7], dtype=int64)
# 指定轴参数的排序
a=np.array([[1,6,2],
[5,1,1],
[8,4,3]])
np.sort(a,axis=0)
array([[1, 1, 1],
[5, 4, 2],
[8, 6, 3]])
np.sort(a,axis=1)
array([[1, 2, 6],
[1, 1, 5],
[3, 4, 8]])
分数最高的三名学生的名字
#数据 不同学生的分数
score=np.array([1100,1256,1543,1043,998,1200,1533])
students=np.array(['bob','tom','kohn','john','jim','anmi','rose'])
# 一行流
top_3=students[
np.argsort(score)[:-4:-1]]
top_3
array(['kohn', 'rose', 'tom'], dtype='<U4')
使用lambda函数和布尔索引来过滤数组
创建一个过滤函数
# 创建一个过滤函数
# 传入一个图书列表x和最低评分y,返回一个评分高于最低评分的潜在畅销书列表
# 数据 (row=[书名,评分])
books = np.array([['笨办法学python',4.9],
['流畅的python',5.9],
['python从入门到项目实践',5.2],
['从0到1javascript快速上手',4.7],
['从1到无穷大',5.1]])
# 一行流
predict_bestseller=lambda x,y:x[x[:,1].astype(float)>y]
print(predict_bestseller(books,5.0))
[['流畅的python' '5.9']
['python从入门到项目实践' '5.2']
['从1到无穷大' '5.1']]
**总结:**使用lambda创建一个函数,它有两个参数,分别是x,y
高级数组过滤器
# 异常值检测:如果一个观测值与平均值的偏差超过标准差,那么它就被定义为异常值
均值和标准差
import numpy as np
import matplotlib.pyplot as plt
# np.random.normal(mean,deviation,shape)以给定的平均值和标准差以正态分布的随机抽样创建一个np数组
sequence = np.random.normal(10.0,1.0,500)
sequence
array([11.60273162, 6.65293584, 10.12605379, 11.85986644, 8.63110802,
10.06998214, 10.45973071, 9.98725049, 9.09705239, 10.13015269,
9.85089427, 9.99808394, 8.78441828, 9.63904572, 10.85481516,
7.35075106, 10.94845724, 11.0869893 , 9.28596955, 8.97815481,
9.26186245, 9.53746784, 10.87090498, 9.12047918, 9.02657023,
10.0868035 , 9.20006406, 8.73189255, 9.13076184, 11.10047448,
10.43225348, 9.58469227, 10.40723005, 10.55369215, 9.87143636,
10.29960485, 10.25535646, 9.79499688, 9.46107401, 7.7014166 ,
9.68435112, 9.4686142 , 8.46912804, 9.35320587, 12.15193188,
10.50262112, 9.38940117, 9.47157601, 10.25920525, 12.05583665,
11.43411283, 9.13550521, 9.93624269, 8.63198423, 9.29192518,
8.36295787, 9.97559304, 11.97734564, 9.6559952 , 10.90758338,
9.03529344, 6.91678558, 9.23626327, 11.17717063, 10.55490176,
12.24268514, 10.10515331, 8.20452911, 9.6762657 , 9.56359262,
13.02888554, 10.06103927, 10.71648254, 9.29524475, 9.93546077,
10.8254731 , 9.77770707, 10.11214372, 11.11175554, 9.30043339,
10.7485596 , 10.97295057, 10.66158835, 7.892232 , 9.93887982,
10.7188748 , 11.62641763, 9.22422878, 11.96244947, 11.37277716,
12.02332692, 8.90228317, 9.43515463, 10.70487583, 10.23028306,
9.66090627, 10.94709562, 8.6961016 , 10.41037259, 10.17828131,
11.80130789, 9.31310925, 10.09872767, 8.20013718, 10.92970956,
10.2495051 , 10.36134533, 8.91298481, 10.02369332, 10.68721058,
9.00054432, 10.32584858, 8.75465787, 12.21659052, 9.95105381,
10.28374084, 9.09521556, 11.75475578, 9.75949037, 9.89331747,
9.23478178, 7.90975969, 9.4273364 , 8.9090529 , 11.10110534,
10.36954102, 10.19726741, 10.28052735, 10.30538537, 8.70120177,
9.3671505 , 10.11196977, 8.36794355, 10.32525939, 9.65441533,
9.51908772, 10.8800997 , 10.09716578, 10.91563748, 10.72492996,
11.10679298, 10.02064013, 11.07823158, 9.5317347 , 10.2914028 ,
10.01976486, 9.72845379, 10.65084245, 12.41439712, 9.69910187,
9.75108477, 9.9845896 , 9.81770095, 11.29157327, 10.15456955,
10.60837797, 9.45917681, 10.24010858, 10.5761626 , 9.55445776,
9.57869162, 10.80719746, 11.5905502 , 9.56478353, 9.65956002,
11.30053638, 10.59873521, 10.28842162, 8.54243158, 10.33120558,
9.59322875, 11.58458479, 10.09302003, 11.15638722, 12.23678871,
8.15472985, 10.42502666, 10.04885823, 10.81404769, 8.15788842,
11.68225804, 9.36783949, 9.49919482, 10.52601385, 11.89667602,
9.67034409, 10.36543152, 7.83546884, 10.63937759, 11.47507461,
10.13818656, 9.51148501, 11.52854464, 11.26931747, 10.99338663,
9.50413836, 7.24050328, 9.6154753 , 9.63745882, 10.94203385,
8.86364438, 8.27484969, 11.29900913, 10.63114097, 10.67904167,
8.6837591 , 8.88040556, 10.58496528, 9.33096014, 8.52821296,
9.5348672 , 10.75886848, 9.51366472, 11.28135789, 9.12082393,
9.97388793, 9.82316507, 9.88920019, 10.24871057, 8.75774533,
8.33304482, 8.65544812, 10.00809262, 10.62840715, 10.11816525,
9.9628467 , 10.04342218, 8.48637003, 9.33254844, 9.76771249,
8.38893789, 11.05047808, 9.67126876, 9.83964206, 9.17303963,
10.57315043, 10.38521662, 10.84684624, 7.85330557, 10.20538821,
10.81687365, 8.64936151, 10.12903228, 10.56758503, 9.14424382,
9.64866383, 10.9616145 , 9.98213592, 10.92951974, 7.47230314,
10.63895034, 11.0604198 , 9.72761195, 10.60446029, 10.43152824,
9.00839484, 9.83700604, 12.45059843, 10.43414501, 12.34487213,
10.75545494, 10.27786507, 12.55689347, 10.34912244, 9.29060352,
10.72588034, 9.94346514, 10.54777849, 11.72420947, 9.8708743 ,
9.22212126, 9.68541625, 9.59774448, 9.11221574, 9.91278983,
10.4820126 , 8.25422937, 10.53147771, 10.04705301, 9.05978545,
10.49055762, 12.42477809, 12.07271904, 7.61849858, 8.3178447 ,
10.55941704, 10.38182936, 11.0665193 , 11.28441137, 9.66078923,
9.38680616, 10.38885545, 8.23828454, 10.13555809, 9.30452756,
9.99692358, 10.46199192, 9.77339638, 11.3772616 , 9.1032097 ,
9.66978 , 10.89886416, 11.7536681 , 9.59221274, 10.73252456,
7.0888786 , 9.45314876, 8.86301785, 8.75563987, 8.3921786 ,
11.4712737 , 8.9499308 , 11.45150041, 10.24883007, 9.92543423,
10.30065446, 8.84153306, 10.78120675, 11.97610638, 8.18525749,
11.66789819, 10.33936961, 10.54462992, 10.79140861, 9.28616639,
9.99606121, 9.37771233, 7.69844409, 10.50744526, 10.05817203,
9.90328269, 9.21993504, 8.49566261, 8.59472478, 8.88297634,
9.05656056, 11.1155842 , 8.71964028, 10.3287043 , 11.03014754,
10.74346088, 9.6133147 , 8.94829591, 11.10567045, 11.53995834,
10.79425045, 8.29533343, 11.73033694, 9.05390855, 9.96661056,
9.65544677, 9.28845565, 9.73081175, 10.79029066, 8.71006262,
10.21652827, 11.55566433, 8.40294093, 8.81970601, 9.32839453,
9.72528028, 10.49817569, 11.32272509, 11.40344254, 10.15429825,
8.77501545, 10.17652889, 7.49672875, 11.23734041, 10.35285293,
10.83609291, 9.39994134, 9.79438055, 11.60539785, 11.1956015 ,
10.0348099 , 8.64140396, 9.35232112, 10.33737302, 11.2767812 ,
10.35535787, 9.22550179, 9.53862746, 11.1766378 , 9.59710827,
8.68138898, 11.14109352, 9.6750869 , 11.29629946, 7.92121702,
8.93229853, 9.46711141, 10.53834349, 9.12461323, 10.48423828,
8.16854633, 9.5213865 , 12.62846863, 9.65370518, 8.66493306,
10.26409626, 8.80259482, 10.72081245, 11.77263016, 9.75088966,
8.99110743, 9.88220114, 10.40220736, 9.00647373, 9.71239656,
9.49302085, 8.98309053, 9.66597825, 10.27313298, 9.73224669,
10.98600021, 8.42882756, 10.54587761, 9.3863515 , 10.63129959,
9.54279797, 11.91960088, 10.95716162, 10.34511289, 9.00471249,
9.64651796, 10.07223761, 9.47935434, 9.61415037, 8.82044289,
9.25640485, 9.17837895, 10.72889737, 11.68353699, 10.29589292,
11.35906728, 10.92925076, 8.72458937, 12.56023535, 9.72806574,
10.6106498 , 10.84630502, 11.29221825, 10.45026532, 9.40195773,
10.56191066, 9.25029589, 8.81880033, 8.10918461, 9.64208128,
9.8698753 , 9.03411736, 11.51661176, 9.94319404, 10.72636352,
10.42417327, 10.15396807, 10.87273094, 7.84001872, 10.71706689,
10.25032915, 9.8371806 , 10.9336922 , 9.52777884, 11.45964879,
8.97052672, 10.73160785, 9.60847006, 11.13603076, 11.35955011,
9.85793642, 9.61844113, 9.54059302, 10.27081728, 9.38568048,
9.94275759, 11.35352371, 7.89166765, 9.30537635, 9.46898646,
10.89801433, 10.0963657 , 9.52512553, 9.67269072, 9.46775335,
9.4408703 , 9.88111873, 9.00232895, 9.74406787, 12.19007662,
10.01493886, 10.61546349, 10.52380967, 9.05287545, 9.72946157,
8.28731793, 9.2750119 , 10.76406343, 10.15305887, 9.77640742])
plt.xkcd() # 绘图样式
plt.hist(sequence) # 绘制直方图
plt.annotate(r"$\omega_1=9$",(9,70))
plt.annotate(r"$\omega_2=11$",(11,70))
plt.annotate(r"$mu=10$",(10,90))
plt.savefig("plot.jpg")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qkAyGv3H-1640597322818)(output_144_0.png)]
绝对值
a=np.array([-1,2,-3,4,-2])
np.abs(a)
array([1, 2, 3, 4, 2])
逻辑与运算
a=np.array([False,True,False,True])
b=np.array([True,True,False,True])
np.logical_and(a,b)# 也可以把两个布尔数组相乘,结果相同
array([False, True, False, True])
找出统计数据和统计平均值偏离一个标准差的异常日期
# 网站分析数据(每一行为1天,每列为活跃用户数,跳出数,平均会话时长)
a = np.array([[815,70,115],
[767,80,50],
[554,88,70],
[1008,65,128]])
mean,stdev=np.mean(a,axis=0),np.std(a,axis=0)
out = ((np.abs(a[:,0]-mean[0])>stdev[0])
*(np.abs(a[:,1]-mean[1])>stdev[1])
*(np.abs(a[:,2]-mean[2])>stdev[2]))
out
array([False, False, False, True])
a[out]
array([[1008, 65, 128]])
简单关联分析
# 数据 每行是一个顾客的购物篮
# 行=[course1,course2,ebook1,ebook2]
# 数值1 代表已购买
basket = np.array([[0,1,1,0],
[0,0,0,1],
[1,1,0,0],
[0,1,1,1],
[1,1,1,0],
[0,1,1,0],
[1,1,0,1],
[1,1,1,1]])
# 一行流
res=np.sum(np.all(basket[:,2:],axis=1))/basket.shape[0]
res
0.25
basket[:,2:]
array([[1, 0],
[0, 1],
[0, 0],
[1, 1],
[1, 0],
[1, 0],
[0, 1],
[1, 1]])
np.all(basket[:,2:],axis=1)
array([False, False, False, True, False, False, False, True])
# 最后得到购买这两本书的顾客所占的比例
res=[(i,j,np.sum(basket[:,i] + basket[:,j] == 2))
for i in range(4) for j in range(i+1,4)]
res
[(0, 1, 4), (0, 2, 2), (0, 3, 2), (1, 2, 5), (1, 3, 3), (2, 3, 2)]
max(res,key=lambda x:x[2])
(1, 2, 5)
天冷吃货吃火锅,空调冷天不制热