This is my second paper which is about IT.
I tried to simulate login to get the data onto educational administration system by using web spiders. But there are a difficult point of this operation, which I must accurately recognize captcha when the web spiders post the login data onto the server.
I didn’t know others how to solve it, and methods that belong to ML was used because I am studying ML during this period. While SVM is a reasonable choice, it seems to be more appropriate to take Native Bayes Classifier--Obviously,this is a classifying question. The final score -I thought that it was not good-was 0.44 by using 10-cross validation.
At the first, a number of image samples were needed to train model. Therefore,it was captured from Educational administration system, the code was written as follows:
import requests
verification_code_header={
'Accept':'image/webp,image/*,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Cookie':'ASP.NET_SessionId=2me54k55qfs433jdazjuxo45',
'Host':'202.201.80.68',
'Referer':'http://202.201.80.68/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36'
}
verification_code_url='http://202.201.80.68/CheckCode.aspx'
for i in range(2000):
code=requests.get(verification_code_url,headers=verification_code_header)
with open(i+'.jpg','wb') as fp:
fp.write(code.content)
A number of captcha images were obtained by the above code.
Fig. 1 crawled captcha images
Next, these image were going to be cropped into 4 segments. Note that its was not samewidth each other here. The algorithm created by myself was taken to do it. The algorithm step is described as follows:
1. The image is preprocessed with binarization.
2. It must be carried out strictly tocheck every column from left to right, And regard one column that the number ofblack pixels that belong to a connected domain less than 2 as the start column.
3. Go on to check the end column based on the start column, and regard one column which the number of black pixels that don’t belong to a connected domain greater than 2 as the end column.
4. Repeat step 1-3 until the 4th partis incised.
The MATLAB codeas follows:
vertify_code_crop.m
function[img_ba_crop1,img_ba_crop2,img_ba_crop3,img_ba_surplus]=vertify_code_crop(src)
img=imread(src);
img_ba=im2bw(img);
img_ba_surplus=img_ba;
for num=1:4
[m,n]=size(img_ba_surplus);
for i=1:n-1
iflength(find(img_ba_surplus(:,i)==0))>3
ifisLink(i,img_ba_surplus,find(img_ba_surplus(:,i)==0))==0
frist=i;
break
end
end
end
for i=1:n-1
iflength(find(img_ba_surplus(:,i)==0))>3
for j = i:n-1
iflength(find(img_ba_surplus(:,j)==0))<2
ifisLink(j,img_ba_surplus,find(img_ba_surplus(:,j)==0))==1
last=j;
break
end
end
end
break
end
end
if num==1
if frist>2
if last<n-2
img_ba_crop1=imcrop(img_ba_surplus,[frist-2 0 last-frist+2 m]);
else
img_ba_crop1=imcrop(img_ba_surplus,[frist-2 0 last-frist m]);
end
else
if last<n-2
img_ba_crop1=imcrop(img_ba_surplus,[frist 0 last-frist+2 m]);
else
img_ba_crop1=imcrop(img_ba_surplus,[frist 0 last-frist m]);
end
end
elseif num==2
if frist>2
if last<n-2
img_ba_crop2=imcrop(img_ba_surplus,[frist-2 0 last-frist+2 m]);
else
img_ba_crop2=imcrop(img_ba_surplus,[frist-2 0 last-frist m]);
end
else
if last<n-2
img_ba_crop2=imcrop(img_ba_surplus,[frist 0 last-frist+2 m]);
else
img_ba_crop2=imcrop(img_ba_surplus,[frist 0 last-frist m]);
end
end
elseif num==3
if frist>2
if last<n-2
img_ba_crop3=imcrop(img_ba_surplus,[frist-2 0 last-frist+2 m]);
else
img_ba_crop3=imcrop(img_ba_surplus,[frist-2 0 last-frist m]);
end
else
if last<n-2
img_ba_crop3=imcrop(img_ba_surplus,[frist 0 last-frist+2 m]);
else
img_ba_crop3=imcrop(img_ba_surplus,[frist 0 last-frist m]);
end
end
end
if m>0||n-last>0
img_ba_surplus=imcrop(img_ba_surplus,[last 0 n-last m]);
end
end
isLink.m
function is=isLink(colum_index,img_ba_surplus,index)
is=1;
for i = 1:length(index)
if img_ba_surplus(index(i),colum_index-1)==1&&img_ba_surplus(index(i),colum_index+1)==1&&img_ba_surplus(index(i)+1,colum_index)==1&&img_ba_surplus(index(i)-1,colum_index)==1
continue
else
is=0;
break
end
end
main.m
clc;clear;
for i =1:1000
try
src=strcat(strcat('D:\temp\',num2str(i)),'.jpg');
[a,b,c,d]=kmeas_cluster_crop(src);
if ~isempty(a)
imwrite(a,strcat(strcat('D:\temp\cluster_crop\',num2str(i)),'_1.jpg'));
end
if ~isempty(b)
imwrite(b,strcat(strcat('D:\temp\cluster_crop\',num2str(i)),'_2.jpg'));
end
if ~isempty(c)
imwrite(c,strcat(strcat('D:\temp\cluster_crop\',num2str(i)),'_3.jpg'));
end
if ~isempty(d)
imwrite(d,strcat(strcat('D:\temp\cluster_crop\',num2str(i)),'_4.jpg'));
end
catch
i
end
end
The result obtained by running codes above is as follow:
Fig.2 cropped captcha images
There is a defect with the cropping algorithm that didn’t crop the sticky captcha .
Fig.3 sticky images
Of course, The cropping algorithm was also taken that the kennel is K-means. But I am not about to explain the specific step, and the code is as follows:
kmeas_cluster_crop.m
function [img_ba_crop1,img_ba_crop2,img_ba_crop3,img_ba_crop4]=kmeas_cluster_crop(src)
img=imread(src);
img_ba=im2bw(img);
[m,n]=size(double(img_ba));
img_ba=imcrop(img_ba,[0 0 n-18 m]);
[m,n]=size(double(img_ba));
[i_black,j_black]=ind2sub([m,n],find(img_ba==0));
[i_white,j_white]=ind2sub([m,n],find(img_ba==1));
data=[i_black j_blackzeros(length(i_black),1);i_white j_white ones(length(i_white),1)];
[idx,C,sumd,D]=kmeans(data,8);
img_ba_clus=zeros(m,n);
for i = 1:size(data,1)
img_ba_clus(data(i,1),data(i,2))=idx(i);
end
%imshow(label2rgb(img_ba_clus));
values=unique(img_ba_clus(1,:));
count=1;
img_ba_surplus=img_ba;
margin=1;
for value = values
[i,j]=ind2sub([m,n],find(img_ba_clus==value));
if count==1
if min(j)>margin
if min(j)<n-margin
img_ba_crop1=imcrop(img_ba_surplus,[ min(j)-margin 0 max(j)-min(j)+margin m]);
else
img_ba_crop1=imcrop(img_ba_surplus,[ min(j)-margin 0 max(j)-min(j) m]);
end
else
if min(j)<n-margin
img_ba_crop1=imcrop(img_ba_surplus,[ min(j)-margin 0 max(j)-min(j)+margin m]);
else
img_ba_crop1=imcrop(img_ba_surplus,[ min(j) 0 max(j)-min(j)+margin m]);
end
end
count=count+1;
elseif count==2
if min(j)>margin
if min(j)<n-margin
img_ba_crop2=imcrop(img_ba_surplus,[min(j)-margin 0max(j)-min(j)+margin m]);
else
img_ba_crop2=imcrop(img_ba_surplus,[ min(j)-margin 0 max(j)-min(j) m]);
end
else
if min(j)<n-margin
img_ba_crop2=imcrop(img_ba_surplus,[min(j)-margin 0max(j)-min(j)+margin m]);
else
img_ba_crop2=imcrop(img_ba_surplus,[ min(j) 0 max(j)-min(j)+margin m]);
end
end
count=count+1;
elseif count==3
if min(j)>margin
if min(j)<n-margin
img_ba_crop3=imcrop(img_ba_surplus,[ min(j)-margin 0 max(j)-min(j)+margin m]);
else
img_ba_crop3=imcrop(img_ba_surplus,[min(j)-margin 0 max(j)-min(j) m]);
end
else
if min(j)<n-margin
img_ba_crop3=imcrop(img_ba_surplus,[ min(j)-margin 0 max(j)-min(j)+margin m]);
else
img_ba_crop3=imcrop(img_ba_surplus,[min(j) 0 max(j)-min(j)+margin m]);
end
end
count=count+1;
elseif count==4
if min(j)>margin
if min(j)<n-margin
img_ba_crop4=imcrop(img_ba_surplus,[min(j)-margin 0max(j)-min(j)+margin m]);
else
img_ba_crop4=imcrop(img_ba_surplus,[ min(j)-margin 0 max(j)-min(j) m]);
end
else
if min(j)<n-margin
img_ba_crop4=imcrop(img_ba_surplus,[ min(j)-margin 0 max(j)-min(j)+margin m]);
else
img_ba_crop4=imcrop(img_ba_surplus,[ min(j) 0 max(j)-min(j)+margin m]);
end
end
end
end
Fig.4 K-means cropping display(1)
Fig. 5 K-means cropping display(2)