TinyMind人民币面值&冠字号编码识别挑战赛 - 热身赛
1 数据探索、清洗
1.1 标签
- 读取标签csv文件
train_lables_face_value_path = os.path.join(DATASET, TRAIN_LABELS_FACE_VALUE_FILE)
df_rmb_face_value_labels = pd.read_csv(train_lables_face_value_path,
index_col=None, header=0, delimiter=", ")
- 统计
print("*" * 10)
print(df_rmb_face_value_labels.head(10))
print("*" * 10)
print(df_rmb_face_value_labels.columns)
print("*" * 10)
print(df_rmb_face_value_labels.dtypes)
print("*" * 10)
print(df_rmb_face_value_labels.info())
print("*" * 10)
print(df_rmb_face_value_labels["label"].describe())
**********
name label
0 013MNV9B.jpg 100.0
1 016ETNGG.jpg 50.0
2 018SUTBA.jpg 0.1
3 0192G5IC.jpg 5.0
4 01953EH7.jpg 100.0
5 01AUV9WG.jpg 10.0
6 01B68AKT.jpg 1.0
7 01DMQGVG.jpg 0.1
8 01E9AUX7.jpg 0.1
9 01EAXZMY.jpg 0.2
**********
Index(['name', 'label'], dtype='object')
**********
name object
label float64
dtype: object
**********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39620 entries, 0 to 39619
Data columns (total 2 columns):
name 39620 non-null object
label 39620 non-null float64
dtypes: float64(1), object(1)
memory usage: 619.1+ KB
None
**********
count 39620.000000
mean 19.405411
std 33.075336
min 0.100000
25% 0.500000
50% 2.000000
75% 10.000000
max 100.000000
Name: label, dtype: float64
- 面值类别
face_values = np.sort(df_rmb_face_value_labels["label"].unique())
print(face_values)
[ 0.1 0.2 0.5 1. 2. 5. 10. 50. 100. ]
- 样本分布
探索数据是否平衡。
distribution_rmb_face_value = dict()
for face_value in face_values:
distribution_rmb_face_value[str(face_value)] = len(
df_rmb_face_value_labels[df_rmb_face_value_labels["label"] == face_value])
distribution_rmb_face_value = pd.Series(distribution_rmb_face_value)
print(distribution_rmb_face_value)
distribution_rmb_face_value.plot(kind="bar")
plt.xticks(rotation=45)
plt.title("distribution of face values")
plt.show()
0.1 4233
0.2 4373
0.5 4407
1.0 4424
2.0 4411
5.0 4413
10.0 4283
50.0 4408
100.0 4668
dtype: int64
1.2 样本
通过统计样本图像长宽比,探索样本中是否存在异常值。
df_samples = pd.DataFrame(columns=["width", "height", "ratio"])
for sample_path in glob.glob(os.path.join(DATASET, TRAIN_IMGS_DIR, "*.jpg")):
str_sample_name = os.path.split(sample_path)[-1]
# print(sample_path)
# print(str_sample_name)
img = cv2.imread(sample_path)
height, width = img.shape[0 : 2]
df_samples.loc[str_sample_name] = [width, height, width / height]
print(df_samples.sample(10))
width height ratio
BCEMGK97.jpg 1139.0 579.0 1.967185
4WRU1SYD.jpg 1109.0 571.0 1.942207
V8S7YE61.jpg 1168.0 534.0 2.187266
E63QHM07.jpg 750.0 353.0 2.124646
5Z8OQ73V.jpg 1226.0 600.0 2.043333
CELA9UQV.jpg 1304.0 600.0 2.173333
TEX65YQR.jpg 825.0 394.0 2.093909
WYGL2EFU.jpg 1164.0 600.0 1.940000
3KTP4EOL.jpg 1183.0 600.0 1.971667
H3QNT5ZS.jpg 1310.0 600.0 2.183333
fig = plt.figure(figsize=(8, 6))
df_samples.plot(kind="box", ax=fig.add_subplot(111), subplots=True)
plt.show()
- 异常值
箱形图查找异常样本
p = df_samples["ratio"].plot(kind="box", return_type='dict')
plt.show()
outliers = p["fliers"][0].get_ydata()
outliers.sort()
print(outliers)
outlier_samples = df_samples[(df_samples["ratio"] >