TensorFlow 简单验证码分类(基于K-means聚类)

本人之前做过一个基于SVM的验证码识别(https://blog.csdn.net/qq_42686550/article/details/81514233)
这次用相同的数据来制作一个非监督性学习的代码看看效果(效果没SVM好)
其中聚类问题的代码参考来自:https://blog.csdn.net/qq_40077103/article/details/82747283
Content:
1.聚类
2.获取数据
3.聚类代码

1. 什么是聚类
在这里插入图片描述
简单的说,聚类就是不断的计算欧氏距离,然后不断迭代直到中心点不在发生变化。聚类是一个较为简单的非监督性学习方法,这里我们用Google的TensorFlow来处理这个问题。 首先,数据源附上:链接:https://pan.baidu.com/s/1sC5HFDMAvm9uGpjm115EeQ
提取码:s29m
我已经把数据附上去了,里面是0-9 10个数字的图片格式:
在这里插入图片描述
共10个文件夹。
2.获取数据
首先,如何将JPG的图片变成适合机器学习的数字?

def get_feature(address,assume,dir, file,f,a):
    # assume is assume predict value
    f.write(assume)
    im = Image.open(address + dir + a + file)
    count = 0
    width, height = im.size
    for i in range(height):
        c = 0
        for j in range(width):
            if im.getpixel((j, i)) == 0: c += 1 #0即为黑色
        f.write(' %d' %c)
        count += 1
    for i in range(width):
        c = 0
        for j in range(height):
            if im.getpixel((i, j)) == 0: c += 1
        f.write(' %d' % c)
        count += 1
    f.write('\n')

原理很简单,图片size是9*16,我们分别从横向纵向去数有几个黑方框(图片已被二值化),举个列子,宽9,从第一列开始数过去数到第九列分别记录下每一列的黑色格子数,9+16=25, 每个图片就被赋予了25个特征.
这样我们就获得了一个记录了862个数字特征的txt文件.
然后我们需要将这个txt文件导入Python变成数组,原理就不写了很简单:

f = open('./train.txt')
line = f.readline()
data_list = []
while line:
    num = list(map(float,line.split()))
    data_list.append(num)
    line = f.readline()
f.close()
data_array = np.array(data_list)

3.聚类代码
数据导入后便是聚类代码了:
首先我们要初始化和设定一些量

N=862
K=10
variables=25
points = tf.Variable(data_array)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))  # 样本归属聚类中心……

centroids = tf.Variable(tf.slice(points.initialized_value(), [0, 0], [K, variables]))  # 初始聚类中心……
sess = tf.Session()
sess.run(tf.global_variables_initializer())

sess.run(centroids)
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, variables])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, variables])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
reduction_indices=2)	#计算欧氏距离
best_centroids = tf.argmin(sum_squares, 1)           # 样本对应的聚类中心索引
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))

我大致解释一下,N是样本数,K是簇(0-9总共是10个数所以是10),variables是特征数.
具体过程就是初始化一个聚类中心,然后将数据reshape到TensorFlow的格式,计算欧氏距离.
其实到这里就差不多了已经,后面设置一下迭代次数条件就可以了,这里为了方便展示可以用matplotlib用来做个图.在这里插入图片描述
我们发现并没看出什么区别…其实也很正常毕竟25个维度,用张2D图看不出来什么的,这样我们来看一下assignment:
在这里插入图片描述
似乎对0,4,5划分的比较好,其余的就一般般,这也在情理之中毕竟是聚类问题,没有对每个类别进行人工分类(此处的0,4,5并不一定指的是数字‘0,4,5’的划分,而是簇的名字),有兴趣的可以自己多试试看,附上完整代码:

import tensorflow as tf
import numpy as np
import time
import matplotlib
import matplotlib.pyplot as plt
import cv2
from PIL import Image
import os

from sklearn.datasets.samples_generator import make_blobs
from sklearn.datasets.samples_generator import make_circles
start=time.time()

DATA_TYPE = 'blobs'

def _get_dynamic_binary_image(img_name):
  filename ='./out_img/' + img_name.split('.')[0] + '-binary.jpg'
  img_name = './out_img' + '/' + img_name
  print('.....' + img_name)
  image = cv2.imread(img_name)
  image2 = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
  th1 = cv2.adaptiveThreshold(image2, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 21, 1)
  cv2.imwrite(filename,th1)
  return th1


def get_feature(address,assume,dir, file,f,a):
    # assume is assume predict value
    f.write(assume)
    im = Image.open(address + dir + a + file)
    count = 0
    width, height = im.size
    for i in range(height):
        c = 0
        for j in range(width):
            if im.getpixel((j, i)) == 0: c += 1 #0即为黑色
        f.write(' %d' %c)
        count += 1
    for i in range(width):
        c = 0
        for j in range(height):
            if im.getpixel((i, j)) == 0: c += 1
        f.write(' %d' % c)
        count += 1
    f.write('\n')

'''
address = './dataset/'
f1 = open('./train.txt', 'w')
dirs = os.listdir(address)
for dir in dirs:
    files = os.listdir(address + dir)
    for file in files:
        get_feature(address, dir, dir, file, f1, '/')
f1.close()
'''
f = open('.//train.txt')
line = f.readline()
data_list = []
while line:
    num = list(map(float,line.split()))
    data_list.append(num)
    line = f.readline()
f.close()
data_array = np.array(data_list)



N=862
K=10
variables=25
points = tf.Variable(data_array)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))  # 样本归属聚类中心……

centroids = tf.Variable(tf.slice(points.initialized_value(), [0, 0], [K, variables]))  # 初始聚类中心……
sess = tf.Session()
sess.run(tf.global_variables_initializer())

sess.run(centroids)
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, variables])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, variables])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
reduction_indices=2)
best_centroids = tf.argmin(sum_squares, 1)           # 样本对应的聚类中心索引
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))


def bucket_mean(data, bucket_ids, num_buckets):
    total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
    count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
    return total / count


means = bucket_mean(points, best_centroids, K)
with tf.control_dependencies([did_assignments_change]):
    do_updates = tf.group(
    centroids.assign(means),
    cluster_assignments.assign(best_centroids))

changed = True
MAX_ITERS=1000
iters = 1

fig, ax = plt.subplots()

colourindexes = [2, 1, 4, 3,5,6,7,8,9,10]

while changed and iters < MAX_ITERS:
    fig, ax = plt.subplots()
    iters += 1
    [changed, _] = sess.run([did_assignments_change, do_updates])
    [centers, assignments] = sess.run([centroids, cluster_assignments])
    ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker='o', s=200, c=assignments,
               cmap=plt.cm.coolwarm)
ax.scatter(centers[:, 0], centers[:, 1], marker='^', s=550, c=colourindexes, cmap=plt.cm.plasma)
ax.set_title('Iteration ' + str(iters))
plt.savefig("kmeans" + str(iters) + ".png")

print(assignments)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值