目录
最近在实习,用的编程语言都是Python,记录一下在具体的业务场景下使用Python时遇到的一些很好用的函数以及遇到的大坑!(含泪debug 太伤了–|)
如何判断一个多维数组是否某一行全大于另外几行?
数据
import numpy as np
centroid = np.array([[0.93817355, 0.72219008, 0.87606612],
[0.94956682, 0.66898157, 0.60680184],
[0.6135 , 0.6666 , 0.72787 ]])
centroid
array([[0.93817355, 0.72219008, 0.87606612],
[0.94956682, 0.66898157, 0.60680184],
[0.6135 , 0.6666 , 0.72787 ]])
需求
如果上述结果为聚类中心的结果(分为三类),如何判断聚类的有效性呢?我们认为如果某一个聚类中心的每一个维度的值都大于另外的聚类中心对应的值,那么这次聚类就是有效的!
核心问题来了?如何判断呢?如果仅仅是二维的,直接判断就好了 if 啥的 但是3 * 3 的就不好办了!
不要怕!numpy中有个很好用的函数! where函数!
c1 = np.where(centroid == np.max(centroid, axis=0)) # 每一列的最大值
print(c1)
c2 = c1[0]
print(c2)
if c2[0] == c2[1] == c2[2] :
print('此次聚类结果有效!!!鼓掌!撒花!')
else:
print('此次聚类结果无效!!!')
(array([0, 0, 1]), array([1, 2, 0]))
[0 0 1]
此次聚类结果无效!!!
centroid2 = np.array([[0.94020979,0.69494406, 0.81282517],
[0.60467857, 0.65509821, 0.69383036],
[0.93487778, 0.61891111, 0.59659444]])
centroid2
array([[0.94020979, 0.69494406, 0.81282517],
[0.60467857, 0.65509821, 0.69383036],
[0.93487778, 0.61891111, 0.59659444]])
c1 = np.where(centroid2 == np.max(centroid2, axis=0)) # 每一列的最大值 所在第几行的数组!
print(c1)
c2 = c1[0]
print(c2) # 说明三列的最大值都在第一行!
if c2[0] == c2[1] == c2[2] :
print('此次聚类结果有效!!!鼓掌!撒花!')
else:
print('此次聚类结果无效!!!')
(array([0, 0, 0]), array([0, 1, 2]))
[0 0 0]
此次聚类结果有效!!!鼓掌!撒花!
Python相比SQL优势
- SQL无法对多维数组进行筛选
- Python处理多维数组的筛选毫无压力!见下例
遇到的大坑!
- 在windows下用pandas读入数据的时候不要在文件名中包含中文名称!否则会报错 OSError: Initializing from file failed
- 在对字符串处理的时候,如果发现自己逻辑思路代码啥的全都没问题,但还是报错,与自己想的不一样,这时候就考虑是否有**空格!!!!空格!!!!**卧槽!上次groupby少了数据也是因为存在空格 所以类别就会不一样 不是一类的!
def Filid(df, label):
s1 = str(label)
s2 = s1.replace("'", '"').replace(",", ",").replace(" ","")
df_1 = df[df['interests_array'] == s2]
df_1.to_csv('./data/[%d]_%s_%s.csv' % (df_1.shape[0], label[0], label[1]),
index=False, encoding='gbk')
- 上述函数为啥要定义?定义的背景是什么呢?我们来好好说一说!
导入数据
import pandas as pd
df_test = pd.read_csv('../example_data.csv', encoding='gbk')
print(df_test.shape)
df_test.head()
(774, 2)
interests_array | device_uuid | |
---|---|---|
0 | ["高考","两性","互联网","美女"] | CQk5NDZiMjM1NGVmMjQyZTAyCTNIWDAyMTcxMTcwMDI4ND... |
1 | ["高考","两性","养生","医疗"] | a0000059cb1d20 |
2 | ["高考","两性","医疗","房产"] | CQk5NzFlYWI2ZWZhNzRhYTAJM0hYMDIxNzgxMDAwNjIxMA... |
3 | ["高考","两性","古代史"] | 7D665A51-DD3D-4231-BE82-2A21D5A31D60 |
4 | ["高考","两性","教育","育儿"] | 4EB70A7C-3DA9-4B76-9710-FE1C50552E7B |
df_test_list = df_test['interests_array'].value_counts()[:5].index.tolist()
df_test_list
['["高考","高校"]',
'["高考","中小学教育"]',
'["高考","国际足球"]',
'["高考","购房"]',
'["高考","养生"]']
需求
- 现在想要把[“高考”,“高校”]的id全部筛选出来
# 轻松实现!
df_test[df_test['interests_array'] == '["高考","高校"]'].head()
interests_array | device_uuid | |
---|---|---|
554 | ["高考","高校"] | 014C4B39-8B4F-4E24-87BD-C8006A8CF989 |
555 | ["高考","高校"] | 1E62BEB3-1F23-459B-ACA1-ACEA6BD9E352 |
556 | ["高考","高校"] | 28BBAE46-8EEC-4B22-9C36-8C1CBC4F9A47 |
557 | ["高考","高校"] | 293CB176-C76C-4028-AA0B-2053E6AFBF25 |
558 | ["高考","高校"] | 2CB42104-BCB4-41E0-BC28-E201FE4671ED |
新需求
现在想要去批量的做这件事情(有很多label的需求),也就是把这个写成一个函数,并且输出相应的csv文件,方便策略上线
def Filid(df, label):
df_1 = df[df['interests_array'] == label]
print(df_1.shape)
df_1.to_csv('[%d]_%s_%s.csv' % (df_1.shape[0], label[0], label[1]),
index=False, encoding='gbk')
- 现在假设我们要考虑的label有:
df_test_list
['["高考","高校"]',
'["高考","中小学教育"]',
'["高考","国际足球"]',
'["高考","购房"]',
'["高考","养生"]']
df_test_list[4]
'["高考","养生"]'
for i in range(len(df_test_list)):
Filid(df_test, df_test_list[i])
(65, 2)
(13, 2)
(10, 2)
(9, 2)
(9, 2)
输出的文件截图如下图所示:
咦 为啥是4个文件呢?为啥不是5个呢?因为有两个输出都是9 * 2 的,所以重复了!这个明显有bug 得修改函数,让文件取名更加合理!
def Filid2(df, label):
df_1 = df[df['interests_array'] == label]
print(df_1.shape)
df_1.to_csv('[%d]_%s_%s.csv' % (df_1.shape[0], eval(label)[0], eval(label)[1]),
index=False, encoding='gbk')
for i in range(len(df_test_list)):
Filid2(df_test, df_test_list[i])
(65, 2)
(13, 2)
(10, 2)
(9, 2)
(9, 2)
问题解决!
那么坑在哪里?
可能大家觉得,诶,上面的工作不是很容易吗?哪里有坑了?客观不要急!
坑是这样的,下午这边leader给我的数据并不是上面统计频数之后的label列表啊,而是Excel里的!下面给出!
all_list = [["高考","高校"]
,["高考","中小学教育"]
,["高考","购房"]
,["高考","国际足球"]
,["高考","养生"]
]
all_list
[['高考', '高校'], ['高考', '中小学教育'], ['高考', '购房'], ['高考', '国际足球'], ['高考', '养生']]
all_list[0]
['高考', '高校']
print(all_list[0])
Filid(df_test, all_list[0])
['高考', '高校']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-fd5af72927ba> in <module>()
1 print(all_list[0])
----> 2 Filid(df_test, all_list[0])
<ipython-input-34-4bc0ca4b91e4> in Filid(df, label)
1 def Filid(df, label):
----> 2 df_1 = df[df['interests_array'] == label]
3 print(df_1.shape)
4 df_1.to_csv('[%d]_%s_%s.csv' % (df_1.shape[0], label[0], label[1]),
5 index=False, encoding='gbk')
~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
1281
1282 with np.errstate(all='ignore'):
-> 1283 res = na_op(values, other)
1284 if is_scalar(res):
1285 raise TypeError('Could not compare {typ} type with Series'
~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
1141
1142 elif is_object_dtype(x.dtype):
-> 1143 result = _comp_method_OBJECT_ARRAY(op, x, y)
1144
1145 elif is_datetimelike_v_numeric(x, y):
~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in _comp_method_OBJECT_ARRAY(op, x, y)
1118 y = y.values
1119
-> 1120 result = libops.vec_compare(x, y, op)
1121 else:
1122 result = libops.scalar_compare(x, y, op)
pandas/_libs/ops.pyx in pandas._libs.ops.vec_compare()
ValueError: Arrays were different lengths: 774 vs 2
你看!报错了!
主要原因是:
- 直接输入all_list[0]是 [‘高考’, ‘高校’] 而不是与原来数据框一致的 ‘[“高考”, “高校”]’
- 怎么办呢?
- 首先将单引号replace成双引号,然后还有一个坑就是逗号的问题!逗号也得保持一致!中文状态下和英文状态下不一样!最后,究极大坑来了!!!空格一定记得也得给去掉!!!空格!!!感谢leader杨哥!一下子就发现了问题!牛逼!
def Filid_plus(df, label):
s1 = str(label)
s2 = s1.replace("'", '"').replace(",", ",").replace(" ","")
df_1 = df[df['interests_array'] == s2]
df_1.to_csv('./data/[%d]_%s_%s.csv' % (df_1.shape[0], label[0], label[1]),
index=False, encoding='gbk')
for i in range(len(all_list)):
Filid_plus(df_test, all_list[i])
没问题了!棒棒哒!!!
今天的分享就到这!想要自己手动实现的可以下载下面的数据集试一试!晚安!
数据
- example_data: https://pan.baidu.com/s/1jApSNlAnYQaaA1cK2Dq3MQ