TinyMind人民币面值&冠字号编码识别挑战赛 - 热身赛

最新推荐文章于 2021-12-17 09:22:14 发布

K5niper

最新推荐文章于 2021-12-17 09:22:14 发布

阅读量971

点赞数 2

本文链接：https://blog.csdn.net/zhaoyin214/article/details/90721051

版权

TinyMind人民币面值&冠字号编码识别挑战赛 - 热身赛

1 数据探索、清洗

1.1 标签

读取标签csv文件

train_lables_face_value_path = os.path.join(DATASET, TRAIN_LABELS_FACE_VALUE_FILE)
df_rmb_face_value_labels = pd.read_csv(train_lables_face_value_path,
                                       index_col=None, header=0, delimiter=", ")

统计

print("*" * 10)
print(df_rmb_face_value_labels.head(10))
print("*" * 10)
print(df_rmb_face_value_labels.columns)
print("*" * 10)
print(df_rmb_face_value_labels.dtypes)
print("*" * 10)
print(df_rmb_face_value_labels.info())
print("*" * 10)
print(df_rmb_face_value_labels["label"].describe())

**********
           name  label
0  013MNV9B.jpg  100.0
1  016ETNGG.jpg   50.0
2  018SUTBA.jpg    0.1
3  0192G5IC.jpg    5.0
4  01953EH7.jpg  100.0
5  01AUV9WG.jpg   10.0
6  01B68AKT.jpg    1.0
7  01DMQGVG.jpg    0.1
8  01E9AUX7.jpg    0.1
9  01EAXZMY.jpg    0.2
**********
Index(['name', 'label'], dtype='object')
**********
name      object
label    float64
dtype: object
**********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39620 entries, 0 to 39619
Data columns (total 2 columns):
name     39620 non-null object
label    39620 non-null float64
dtypes: float64(1), object(1)
memory usage: 619.1+ KB
None
**********
count    39620.000000
mean        19.405411
std         33.075336
min          0.100000
25%          0.500000
50%          2.000000
75%         10.000000
max        100.000000
Name: label, dtype: float64

面值类别

face_values = np.sort(df_rmb_face_value_labels["label"].unique())
print(face_values)

[  0.1   0.2   0.5   1.    2.    5.   10.   50.  100. ]

样本分布

探索数据是否平衡。

distribution_rmb_face_value = dict()

for face_value in face_values:
    distribution_rmb_face_value[str(face_value)] = len(
        df_rmb_face_value_labels[df_rmb_face_value_labels["label"] == face_value])
    
distribution_rmb_face_value = pd.Series(distribution_rmb_face_value)

print(distribution_rmb_face_value)
distribution_rmb_face_value.plot(kind="bar")
plt.xticks(rotation=45)
plt.title("distribution of face values")
plt.show()

0.1      4233
0.2      4373
0.5      4407
1.0      4424
2.0      4411
5.0      4413
10.0     4283
50.0     4408
100.0    4668
dtype: int64

在这里插入图片描述

1.2 样本

通过统计样本图像长宽比，探索样本中是否存在异常值。

df_samples = pd.DataFrame(columns=["width", "height", "ratio"])

for sample_path in glob.glob(os.path.join(DATASET, TRAIN_IMGS_DIR, "*.jpg")):
    
    str_sample_name = os.path.split(sample_path)[-1]
    # print(sample_path)
    # print(str_sample_name)
    
    img = cv2.imread(sample_path)
    height, width = img.shape[0 : 2]
    
    df_samples.loc[str_sample_name] = [width, height, width / height]

print(df_samples.sample(10))

               width  height     ratio
BCEMGK97.jpg  1139.0   579.0  1.967185
4WRU1SYD.jpg  1109.0   571.0  1.942207
V8S7YE61.jpg  1168.0   534.0  2.187266
E63QHM07.jpg   750.0   353.0  2.124646
5Z8OQ73V.jpg  1226.0   600.0  2.043333
CELA9UQV.jpg  1304.0   600.0  2.173333
TEX65YQR.jpg   825.0   394.0  2.093909
WYGL2EFU.jpg  1164.0   600.0  1.940000
3KTP4EOL.jpg  1183.0   600.0  1.971667
H3QNT5ZS.jpg  1310.0   600.0  2.183333

fig = plt.figure(figsize=(8, 6))
df_samples.plot(kind="box", ax=fig.add_subplot(111), subplots=True)
plt.show()

异常值

箱形图查找异常样本

p = df_samples["ratio"].plot(kind="box", return_type='dict')
plt.show()

outliers = p["fliers"][0].get_ydata()
outliers.sort()
print(outliers)

outlier_samples = df_samples[(df_samples["ratio"] >

最低0.47元/天解锁文章

K5niper

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫