Machine Learning-Captcha Recognition(one)

This is my second paper which is about IT.

I tried to simulate login to get the data onto educational administration system by using web spiders. But there are a difficult point of  this operation, which I must accurately recognize captcha when the web spiders post the login data onto the server.

I didn’t know others how to solve it, and methods that belong to ML was used because I am studying ML during this period. While SVM is a reasonable choice, it seems to be more appropriate to take Native Bayes Classifier--Obviously,this is a classifying question. The final score -I thought that it was not good-was 0.44 by using 10-cross validation.

At the first, a number of image samples were needed to train model. Therefore,it was captured from Educational administration system, the code was written as follows:

import requests
verification_code_header={
'Accept':'image/webp,image/*,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Cookie':'ASP.NET_SessionId=2me54k55qfs433jdazjuxo45',
'Host':'202.201.80.68',
'Referer':'http://202.201.80.68/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36'
 
}
verification_code_url='http://202.201.80.68/CheckCode.aspx'
for i in range(2000):
code=requests.get(verification_code_url,headers=verification_code_header)
with open(i+'.jpg','wb') as fp:
fp.write(code.content)

         A number of captcha  images were obtained by the above code.



         Fig. 1 crawled captcha  images

 

 

Next, these image were going to be cropped into 4 segments. Note that its was not samewidth each other here. The algorithm created by myself was taken to do it. The algorithm step is described as follows:

1.      The image is preprocessed with binarization.

2.       It must be carried out strictly tocheck every column from left to right, And regard one column that the number ofblack pixels that belong to a connected domain less than 2 as the start column.

3.       Go on to check the end column based on the start column, and regard one column which the number of black pixels that don’t belong to a connected domain greater than 2 as the end column.

4.       Repeat step 1-3 until the 4th partis incised.

The MATLAB codeas follows:



vertify_code_crop.m

function[img_ba_crop1,img_ba_crop2,img_ba_crop3,img_ba_surplus]=vertify_code_crop(src)
img=imread(src);
img_ba=im2bw(img);
img_ba_surplus=img_ba;
for num=1:4
   [m,n]=size(img_ba_surplus);
    for i=1:n-1
        iflength(find(img_ba_surplus(:,i)==0))>3
            ifisLink(i,img_ba_surplus,find(img_ba_surplus(:,i)==0))==0
                frist=i;
                break
            end
        end
    end
    for i=1:n-1
        iflength(find(img_ba_surplus(:,i)==0))>3
            for j = i:n-1
               
                iflength(find(img_ba_surplus(:,j)==0))<2
                   
                    ifisLink(j,img_ba_surplus,find(img_ba_surplus(:,j)==0))==1
                       last=j;
                        break
                    end
                end
            end
            break
        end
    end
    if num==1
        if frist>2
            if last<n-2
               img_ba_crop1=imcrop(img_ba_surplus,[frist-2 0 last-frist+2 m]);
            else
               img_ba_crop1=imcrop(img_ba_surplus,[frist-2 0 last-frist m]);
            end
        else
            if last<n-2
               img_ba_crop1=imcrop(img_ba_surplus,[frist 0 last-frist+2 m]);
            else
               img_ba_crop1=imcrop(img_ba_surplus,[frist 0 last-frist m]);
            end
        end
    elseif num==2
        if frist>2
            if last<n-2
               img_ba_crop2=imcrop(img_ba_surplus,[frist-2 0 last-frist+2 m]);
            else
               img_ba_crop2=imcrop(img_ba_surplus,[frist-2 0 last-frist m]);
            end
        else
            if last<n-2
               img_ba_crop2=imcrop(img_ba_surplus,[frist 0 last-frist+2 m]);
            else
               img_ba_crop2=imcrop(img_ba_surplus,[frist 0 last-frist m]);
            end
        end
    elseif num==3
        if frist>2
            if last<n-2
               img_ba_crop3=imcrop(img_ba_surplus,[frist-2 0 last-frist+2 m]);
            else
               img_ba_crop3=imcrop(img_ba_surplus,[frist-2 0 last-frist m]);
            end
        else
            if last<n-2
               img_ba_crop3=imcrop(img_ba_surplus,[frist 0 last-frist+2 m]);
            else
               img_ba_crop3=imcrop(img_ba_surplus,[frist 0 last-frist m]);
            end
        end
   
    end
    if m>0||n-last>0
   img_ba_surplus=imcrop(img_ba_surplus,[last 0 n-last m]);
    end
end


isLink.m


function is=isLink(colum_index,img_ba_surplus,index)
is=1;
for i = 1:length(index)
   if img_ba_surplus(index(i),colum_index-1)==1&&img_ba_surplus(index(i),colum_index+1)==1&&img_ba_surplus(index(i)+1,colum_index)==1&&img_ba_surplus(index(i)-1,colum_index)==1
       continue
   else
       is=0;
       break
   end
end
 



main.m

clc;clear;
for i =1:1000
   try
   src=strcat(strcat('D:\temp\',num2str(i)),'.jpg');
   [a,b,c,d]=kmeas_cluster_crop(src);
   if ~isempty(a)
   imwrite(a,strcat(strcat('D:\temp\cluster_crop\',num2str(i)),'_1.jpg'));
   end
   if ~isempty(b)
   imwrite(b,strcat(strcat('D:\temp\cluster_crop\',num2str(i)),'_2.jpg'));
   end
   if ~isempty(c)
   imwrite(c,strcat(strcat('D:\temp\cluster_crop\',num2str(i)),'_3.jpg'));
   end
   if ~isempty(d)
   imwrite(d,strcat(strcat('D:\temp\cluster_crop\',num2str(i)),'_4.jpg'));
   end
   catch
      i
   end
end


The result obtained by running codes above is as follow:


Fig.2 cropped captcha  images

 

There is a defect with the cropping algorithm that didn’t crop the sticky captcha .


Fig.3 sticky images

 

Of course, The cropping algorithm was also taken that the kennel is K-means. But I am not about to explain the specific step, and the code is as follows:

kmeas_cluster_crop.m

function [img_ba_crop1,img_ba_crop2,img_ba_crop3,img_ba_crop4]=kmeas_cluster_crop(src)
img=imread(src);
img_ba=im2bw(img);
[m,n]=size(double(img_ba));
img_ba=imcrop(img_ba,[0 0 n-18 m]);
[m,n]=size(double(img_ba));
[i_black,j_black]=ind2sub([m,n],find(img_ba==0));
[i_white,j_white]=ind2sub([m,n],find(img_ba==1));
data=[i_black j_blackzeros(length(i_black),1);i_white j_white ones(length(i_white),1)];
[idx,C,sumd,D]=kmeans(data,8);
img_ba_clus=zeros(m,n);
for i = 1:size(data,1)
   img_ba_clus(data(i,1),data(i,2))=idx(i);
end
%imshow(label2rgb(img_ba_clus));
values=unique(img_ba_clus(1,:));
count=1;
img_ba_surplus=img_ba;
margin=1;
for value = values
   [i,j]=ind2sub([m,n],find(img_ba_clus==value));
   if count==1
        if min(j)>margin
           if min(j)<n-margin
               
               img_ba_crop1=imcrop(img_ba_surplus,[ min(j)-margin  0 max(j)-min(j)+margin  m]);
           else
               img_ba_crop1=imcrop(img_ba_surplus,[ min(j)-margin  0 max(j)-min(j)  m]);
           end
       else
           if min(j)<n-margin
               img_ba_crop1=imcrop(img_ba_surplus,[ min(j)-margin  0 max(j)-min(j)+margin  m]);
           else
               img_ba_crop1=imcrop(img_ba_surplus,[ min(j)  0 max(j)-min(j)+margin  m]);
           end
       end
       count=count+1;
   elseif count==2
       if min(j)>margin
           if min(j)<n-margin
               
                img_ba_crop2=imcrop(img_ba_surplus,[min(j)-margin  0max(j)-min(j)+margin  m]);
           else
               img_ba_crop2=imcrop(img_ba_surplus,[ min(j)-margin  0 max(j)-min(j)  m]);
           end
       else
           if min(j)<n-margin
               img_ba_crop2=imcrop(img_ba_surplus,[min(j)-margin  0max(j)-min(j)+margin  m]);
           else
               img_ba_crop2=imcrop(img_ba_surplus,[ min(j)  0 max(j)-min(j)+margin  m]);
           end
       end
       count=count+1;
       
   elseif count==3
              if min(j)>margin
           if min(j)<n-margin
               
               img_ba_crop3=imcrop(img_ba_surplus,[ min(j)-margin  0 max(j)-min(j)+margin  m]);
           else
                img_ba_crop3=imcrop(img_ba_surplus,[min(j)-margin  0 max(j)-min(j)  m]);
           end
       else
           if min(j)<n-margin
               img_ba_crop3=imcrop(img_ba_surplus,[ min(j)-margin  0 max(j)-min(j)+margin  m]);
           else
                img_ba_crop3=imcrop(img_ba_surplus,[min(j)  0 max(j)-min(j)+margin  m]);
           end
       end
       
       count=count+1;
   elseif count==4
       if min(j)>margin
           if min(j)<n-margin
               
                img_ba_crop4=imcrop(img_ba_surplus,[min(j)-margin  0max(j)-min(j)+margin  m]);
           else
               img_ba_crop4=imcrop(img_ba_surplus,[ min(j)-margin  0 max(j)-min(j)  m]);
           end
       else
           if min(j)<n-margin
               img_ba_crop4=imcrop(img_ba_surplus,[ min(j)-margin  0 max(j)-min(j)+margin  m]);
           else
               img_ba_crop4=imcrop(img_ba_surplus,[ min(j)  0 max(j)-min(j)+margin  m]);
           end
       end
   end
end



 

Fig.4 K-means cropping display(1)


 

Fig. 5 K-means cropping display(2)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值