Python | 记录最近遇到的大坑！

最新推荐文章于 2022-08-03 06:00:21 发布

写代码的阿呆

最新推荐文章于 2022-08-03 06:00:21 发布

阅读量1.1k

点赞数

分类专栏： Python SQL 文章标签：数据预处理 numpy的where SQL 字符串处理 to_csv

本文链接：https://blog.csdn.net/qq_27782503/article/details/88859439

版权

Python 同时被 2 个专栏收录

86 篇文章 6 订阅

订阅专栏

SQL

10 篇文章 0 订阅

订阅专栏

如何判断一个多维数组是否某一行全大于另外几行？

数据

import numpy as np
centroid = np.array([[0.93817355, 0.72219008, 0.87606612],
       [0.94956682, 0.66898157, 0.60680184],
       [0.6135    , 0.6666    , 0.72787   ]])
centroid

array([[0.93817355, 0.72219008, 0.87606612],
       [0.94956682, 0.66898157, 0.60680184],
       [0.6135    , 0.6666    , 0.72787   ]])

需求

如果上述结果为聚类中心的结果（分为三类），如何判断聚类的有效性呢？我们认为如果某一个聚类中心的每一个维度的值都大于另外的聚类中心对应的值，那么这次聚类就是有效的！

核心问题来了？如何判断呢？如果仅仅是二维的，直接判断就好了 if 啥的但是3 * 3 的就不好办了！

不要怕！numpy中有个很好用的函数！ where函数！

c1 = np.where(centroid == np.max(centroid, axis=0)) # 每一列的最大值
print(c1)
c2 = c1[0]
print(c2)
if c2[0] == c2[1] == c2[2] :
    print('此次聚类结果有效！！！鼓掌！撒花！')
else:
    print('此次聚类结果无效！！！')

(array([0, 0, 1]), array([1, 2, 0]))
[0 0 1]
此次聚类结果无效！！！

centroid2 = np.array([[0.94020979,0.69494406, 0.81282517],
 [0.60467857, 0.65509821, 0.69383036],
 [0.93487778, 0.61891111, 0.59659444]])
centroid2

array([[0.94020979, 0.69494406, 0.81282517],
       [0.60467857, 0.65509821, 0.69383036],
       [0.93487778, 0.61891111, 0.59659444]])

c1 = np.where(centroid2 == np.max(centroid2, axis=0)) # 每一列的最大值 所在第几行的数组！
print(c1)
c2 = c1[0]
print(c2) # 说明三列的最大值都在第一行！
if c2[0] == c2[1] == c2[2] :
    print('此次聚类结果有效！！！鼓掌！撒花！')
else:
    print('此次聚类结果无效！！！')

(array([0, 0, 0]), array([0, 1, 2]))
[0 0 0]
此次聚类结果有效！！！鼓掌！撒花！

Python相比SQL优势

SQL无法对多维数组进行筛选
Python处理多维数组的筛选毫无压力！见下例

遇到的大坑！

在windows下用pandas读入数据的时候不要在文件名中包含中文名称！否则会报错 OSError: Initializing from file failed
在对字符串处理的时候，如果发现自己逻辑思路代码啥的全都没问题，但还是报错，与自己想的不一样，这时候就考虑是否有**空格！！！！空格！！！！**卧槽！上次groupby少了数据也是因为存在空格所以类别就会不一样不是一类的！

def Filid(df, label):
    s1 = str(label)
    s2 = s1.replace("'", '"').replace("，", ",").replace(" ","")
    df_1 = df[df['interests_array'] == s2]
    df_1.to_csv('./data/[%d]_%s_%s.csv' % (df_1.shape[0], label[0], label[1]),
                  index=False, encoding='gbk')

上述函数为啥要定义？定义的背景是什么呢？我们来好好说一说！

导入数据

import pandas as pd
df_test = pd.read_csv('../example_data.csv', encoding='gbk')
print(df_test.shape)
df_test.head()

(774, 2)

	interests_array	device_uuid
0	["高考","两性","互联网","美女"]	CQk5NDZiMjM1NGVmMjQyZTAyCTNIWDAyMTcxMTcwMDI4ND...
1	["高考","两性","养生","医疗"]	a0000059cb1d20
2	["高考","两性","医疗","房产"]	CQk5NzFlYWI2ZWZhNzRhYTAJM0hYMDIxNzgxMDAwNjIxMA...
3	["高考","两性","古代史"]	7D665A51-DD3D-4231-BE82-2A21D5A31D60
4	["高考","两性","教育","育儿"]	4EB70A7C-3DA9-4B76-9710-FE1C50552E7B

df_test_list = df_test['interests_array'].value_counts()[:5].index.tolist()
df_test_list

['["高考","高校"]',
 '["高考","中小学教育"]',
 '["高考","国际足球"]',
 '["高考","购房"]',
 '["高考","养生"]']

需求

现在想要把[“高考”,“高校”]的id全部筛选出来

# 轻松实现！
df_test[df_test['interests_array'] == '["高考","高校"]'].head()

	interests_array	device_uuid
554	["高考","高校"]	014C4B39-8B4F-4E24-87BD-C8006A8CF989
555	["高考","高校"]	1E62BEB3-1F23-459B-ACA1-ACEA6BD9E352
556	["高考","高校"]	28BBAE46-8EEC-4B22-9C36-8C1CBC4F9A47
557	["高考","高校"]	293CB176-C76C-4028-AA0B-2053E6AFBF25
558	["高考","高校"]	2CB42104-BCB4-41E0-BC28-E201FE4671ED

新需求

现在想要去批量的做这件事情(有很多label的需求)，也就是把这个写成一个函数，并且输出相应的csv文件，方便策略上线

def Filid(df, label):
    df_1 = df[df['interests_array'] == label]
    print(df_1.shape)
    df_1.to_csv('[%d]_%s_%s.csv' % (df_1.shape[0], label[0], label[1]),
                  index=False, encoding='gbk')

现在假设我们要考虑的label有：

df_test_list

['["高考","高校"]',
 '["高考","中小学教育"]',
 '["高考","国际足球"]',
 '["高考","购房"]',
 '["高考","养生"]']

df_test_list[4]

'["高考","养生"]'

for i in range(len(df_test_list)):
    Filid(df_test, df_test_list[i])

(65, 2)
(13, 2)
(10, 2)
(9, 2)
(9, 2)

输出的文件截图如下图所示：
在这里插入图片描述
咦为啥是4个文件呢？为啥不是5个呢？因为有两个输出都是9 * 2 的，所以重复了！这个明显有bug 得修改函数，让文件取名更加合理！

def Filid2(df, label):
    df_1 = df[df['interests_array'] == label]
    print(df_1.shape)
    df_1.to_csv('[%d]_%s_%s.csv' % (df_1.shape[0], eval(label)[0], eval(label)[1]),
                  index=False, encoding='gbk')

for i in range(len(df_test_list)):
    Filid2(df_test, df_test_list[i])

(65, 2)
(13, 2)
(10, 2)
(9, 2)
(9, 2)

在这里插入图片描述
问题解决！

那么坑在哪里？

可能大家觉得，诶，上面的工作不是很容易吗？哪里有坑了？客观不要急！

坑是这样的，下午这边leader给我的数据并不是上面统计频数之后的label列表啊，而是Excel里的！下面给出！

all_list = [["高考","高校"]
,["高考","中小学教育"]
,["高考","购房"]
,["高考","国际足球"]
,["高考","养生"]
]
all_list

[['高考', '高校'], ['高考', '中小学教育'], ['高考', '购房'], ['高考', '国际足球'], ['高考', '养生']]

all_list[0]

['高考', '高校']

print(all_list[0])
Filid(df_test, all_list[0])

['高考', '高校']



---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-51-fd5af72927ba> in <module>()
      1 print(all_list[0])
----> 2 Filid(df_test, all_list[0])


<ipython-input-34-4bc0ca4b91e4> in Filid(df, label)
      1 def Filid(df, label):
----> 2     df_1 = df[df['interests_array'] == label]
      3     print(df_1.shape)
      4     df_1.to_csv('[%d]_%s_%s.csv' % (df_1.shape[0], label[0], label[1]),
      5                   index=False, encoding='gbk')


~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
   1281 
   1282             with np.errstate(all='ignore'):
-> 1283                 res = na_op(values, other)
   1284             if is_scalar(res):
   1285                 raise TypeError('Could not compare {typ} type with Series'


~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1141 
   1142         elif is_object_dtype(x.dtype):
-> 1143             result = _comp_method_OBJECT_ARRAY(op, x, y)
   1144 
   1145         elif is_datetimelike_v_numeric(x, y):


~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in _comp_method_OBJECT_ARRAY(op, x, y)
   1118             y = y.values
   1119 
-> 1120         result = libops.vec_compare(x, y, op)
   1121     else:
   1122         result = libops.scalar_compare(x, y, op)


pandas/_libs/ops.pyx in pandas._libs.ops.vec_compare()


ValueError: Arrays were different lengths: 774 vs 2

你看！报错了！
主要原因是：

直接输入all_list[0]是 [‘高考’, ‘高校’] 而不是与原来数据框一致的 ‘[“高考”, “高校”]’
怎么办呢？
首先将单引号replace成双引号，然后还有一个坑就是逗号的问题！逗号也得保持一致！中文状态下和英文状态下不一样！最后，究极大坑来了！！！空格一定记得也得给去掉！！！空格！！！感谢leader杨哥！一下子就发现了问题！牛逼！

def Filid_plus(df, label):
    s1 = str(label)
    s2 = s1.replace("'", '"').replace("，", ",").replace(" ","")
    df_1 = df[df['interests_array'] == s2]
    df_1.to_csv('./data/[%d]_%s_%s.csv' % (df_1.shape[0], label[0], label[1]),
                  index=False, encoding='gbk')

for i in range(len(all_list)):
    Filid_plus(df_test, all_list[i])

在这里插入图片描述
没问题了！棒棒哒！！！

今天的分享就到这！想要自己手动实现的可以下载下面的数据集试一试！晚安！

数据

example_data: https://pan.baidu.com/s/1jApSNlAnYQaaA1cK2Dq3MQ

写代码的阿呆

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录