Tesseract-OCR牛刀小试：模拟请求时的验证码识别

最新推荐文章于 2021-07-24 00:46:05 发布

kendyhj9999

最新推荐文章于 2021-07-24 00:46:05 发布

阅读量836

点赞数

分类专栏： .Net

.Net 专栏收录该内容

71 篇文章 0 订阅

订阅专栏

有个邪恶的需求，需要识别验证码，手输几千遍得残了，所以有了这篇小文章，顺便向帅气的Tesseract-OCR致敬，它果然和传说中的一样牛x！

首先，到google code下载Tesseract-OCR的dll和相关语言包。

下载下来后，把dll导入到自己项目里，把语言包解压缩到debug目录下（哪儿都行，但是要有访问权限，然后记下这个目录，后面要用它配置tesseract-ocr）。

然后就可以开始写代码了！

首先，需要模拟一个验证码图片的http请求，这个请求可能也需要cookie，所以在你模拟验证码图片的请求前，很可能需要先请求一下生成相关Cookie的那些页面，把Cookie存到CookieContainer里供后续操作使用。

下面我们先准备2个用来模拟http请求的helper，里面的cookieContainers是个静态字段，这样每次请求就可以共用同一组cookie，这一点很重要。（如果你发现返回的验证码图片是对的，识别后发送过去的值也是对的，但是依然提示验证码错误，原因很可能就是验证码图片的http请求却少cookie。）

请求页面文本的方法：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
         34 
       
         35 
       
         36 
       
         37 
       
         38 
       
         39 
       
         40 
       
         41 
       
         42 
       
         43 
       
        public 
          
        static 
          
        string 
          
        GetResponse 
        ( 
        string 
          
        url 
        , 
          
        string 
          
        method 
        , 
          
        string 
          
        data 
        , 
        string 
          
        encode 
        ) 
       
        { 
       
        try 
       
        { 
       
        HttpWebRequest  
        req 
          
        = 
          
        ( 
        HttpWebRequest 
        ) 
        WebRequest 
        . 
        Create 
        ( 
        url 
        ) 
        ; 
       
        req 
        . 
        KeepAlive 
          
        = 
          
        true 
        ; 
       
        req 
        . 
        Method 
          
        = 
          
        method 
        . 
        ToUpper 
        ( 
        ) 
        ; 
       
        req 
        . 
        AllowAutoRedirect 
          
        = 
          
        true 
        ; 
       
        req 
        . 
        CookieContainer 
          
        = 
          
        CookieContainers 
        ; 
       
        req 
        . 
        ContentType 
          
        = 
          
        "application/x-www-form-urlencoded" 
        ; 
       
        req 
        . 
        UserAgent 
          
        = 
          
        IE7 
        ; 
       
        req 
        . 
        Accept 
          
        = 
          
        "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" 
        ; 
       
        req 
        . 
        Timeout 
          
        = 
          
        50000 
        ; 
       
        if 
          
        ( 
        method 
        . 
        ToUpper 
        ( 
        ) 
          
        == 
          
        "POST" 
          
        && 
          
        data 
          
        != 
          
        null 
        ) 
       
        { 
       
        ASCIIEncoding  
        encoding 
          
        = 
          
        new 
          
        ASCIIEncoding 
        ( 
        ) 
        ; 
       
        byte 
        [ 
        ] 
          
        postBytes 
          
        = 
          
        encoding 
        . 
        GetBytes 
        ( 
        data 
        ) 
        ; 
          
        ; 
       
        req 
        . 
        ContentLength 
          
        = 
          
        postBytes 
        . 
        Length 
        ; 
       
        Stream  
        st 
          
        = 
          
        req 
        . 
        GetRequestStream 
        ( 
        ) 
        ; 
       
        st 
        . 
        Write 
        ( 
        postBytes 
        , 
          
        0 
        , 
          
        postBytes 
        . 
        Length 
        ) 
        ; 
       
        st 
        . 
        Close 
        ( 
        ) 
        ; 
       
        } 
       
        System 
        . 
        Net 
        . 
        ServicePointManager 
        . 
        ServerCertificateValidationCallback 
          
        + 
        = 
          
        ( 
        se 
        , 
          
        cert 
        , 
          
        chain 
        , 
          
        sslerror 
        ) 
          
        = 
        > 
       
        { 
       
        return 
          
        true 
        ; 
       
        } 
        ; 
       
        Encoding  
        myEncoding 
          
        = 
          
        Encoding 
        . 
        GetEncoding 
        ( 
        encode 
        ) 
        ; 
       
        HttpWebResponse  
        res 
          
        = 
          
        ( 
        HttpWebResponse 
        ) 
        req 
        . 
        GetResponse 
        ( 
        ) 
        ; 
       
        Stream  
        resst 
          
        = 
          
        res 
        . 
        GetResponseStream 
        ( 
        ) 
        ; 
       
        StreamReader  
        sr 
          
        = 
          
        new 
          
        StreamReader 
        ( 
        resst 
        , 
          
        myEncoding 
        ) 
        ; 
       
        string 
          
        str 
          
        = 
          
        sr 
        . 
        ReadToEnd 
        ( 
        ) 
        ; 
       
        return 
          
        str 
        ; 
       
        } 
       
        catch 
          
        ( 
        Exception 
        ) 
       
        { 
       
        return 
          
        string 
        . 
        Empty 
        ; 
       
        }

请求验证码图片的方法：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
        public 
          
        static 
          
        Stream  
        GetResponseImage 
        ( 
        string 
          
        url 
        ) 
       
        { 
       
        Stream  
        resst 
          
        = 
          
        null 
        ; 
       
        try 
       
        { 
       
        HttpWebRequest  
        req 
          
        = 
          
        ( 
        HttpWebRequest 
        ) 
        WebRequest 
        . 
        Create 
        ( 
        url 
        ) 
        ; 
       
        req 
        . 
        KeepAlive 
          
        = 
          
        true 
        ; 
       
        req 
        . 
        Method 
          
        = 
          
        "GET" 
        ; 
       
        req 
        . 
        AllowAutoRedirect 
          
        = 
          
        true 
        ; 
       
        req 
        . 
        CookieContainer 
          
        = 
          
        cookieContainers 
        ; 
       
        req 
        . 
        ContentType 
          
        = 
          
        "application/x-www-form-urlencoded" 
        ; 
       
        req 
        . 
        UserAgent 
          
        = 
          
        IE7 
        ; 
       
        req 
        . 
        Accept 
          
        = 
          
        "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" 
        ; 
       
        req 
        . 
        Timeout 
          
        = 
          
        50000 
        ; 
       
        System 
        . 
        Net 
        . 
        ServicePointManager 
        . 
        ServerCertificateValidationCallback 
          
        + 
        = 
          
        ( 
        se 
        , 
          
        cert 
        , 
          
        chain 
        , 
          
        sslerror 
        ) 
          
        = 
        > 
       
        { 
       
        return 
          
        true 
        ; 
       
        } 
        ; 
       
        Encoding  
        myEncoding 
          
        = 
          
        Encoding 
        . 
        GetEncoding 
        ( 
        "UTF-8" 
        ) 
        ; 
       
        HttpWebResponse  
        res 
          
        = 
          
        ( 
        HttpWebResponse 
        ) 
        req 
        . 
        GetResponse 
        ( 
        ) 
        ; 
       
        resst 
          
        = 
          
        res 
        . 
        GetResponseStream 
        ( 
        ) 
        ; 
       
        return 
          
        resst 
        ; 
       
        } 
       
        catch 
       
        { 
       
        return 
          
        null 
        ; 
       
        } 
       
        }

用fiddler抓一下浏览器请求，把相关的URL和POST数据记下来，然后就可以用GetResponse来请求页面的html文本，用GetResponseImage请求验证码图片的Stream。

下面，就该Tesseract-OCR出场了，虽然就这几行，但是最重要的就是这几句了！

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
        private 
          
        string 
          
        Recognize 
        ( 
        string 
          
        url 
        ) 
       
        { 
       
        Bitmap  
        bitmap 
          
        = 
          
        ( 
        Bitmap 
        ) 
        Bitmap 
        . 
        FromStream 
        ( 
        HttpHelper 
        . 
        GetResponseImage 
        ( 
        url 
        ) 
        ) 
        ; 
       
        //如果你的验证码干扰性比较强，可以在这儿对图片进行一些预处理，比如二值化，去噪点什么的，我这个很幸福，没什么干扰，直接就能识别……:-) 
       
        //初始化ORC 
       
        tessnet2 
        . 
        Tesseract  
        ocr 
          
        = 
          
        new 
          
        tessnet2 
        . 
        Tesseract 
        ( 
        ) 
        ; 
       
        //设置识别的内容，这设置的是0到9的数字 
       
        ocr 
        . 
        SetVariable 
        ( 
        "tessedit_char_whitelist" 
        , 
          
        "0123456789" 
        ) 
        ; 
          
        //初始化语言包，第一个参数是语言包的路径，第二参数是语言包的名字 
       
        ocr 
        . 
        Init 
        ( 
        Application 
        . 
        StartupPath 
          
        + 
          
        @ 
        "\tessdata" 
        , 
          
        "eng" 
        , 
          
        true 
        ) 
        ; 
          
        //开始识别 
       
        List 
        < 
        tessnet2 
        . 
        Word 
        > 
          
        result 
          
        = 
          
        ocr 
        . 
        DoOCR 
        ( 
        bitmap 
        , 
          
        Rectangle 
        . 
        Empty 
        ) 
        ; 
       
        //返回识别结果 
       
        return 
          
        result 
        [ 
        0 
        ] 
        . 
        Text 
        ; 
       
        }

现在调用Recognize方法就有验证码了，最后带着验证码和帐号信息发一个POST请求，分析下响应状态码或者响应内容里是否有登陆成功相关的字符串什么的就搞定了！

ps：参考了几篇文章，都在google里，搜tesseract就能找到。有空了再深入的看看tesseract-ocr，挺好玩的。

kendyhj9999

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Tesseract-OCR牛刀小试：模拟请求时的验证码识别

有个邪恶的需求，需要识别验证码，手输几千遍得残了，所以有了这篇小文章，顺便向帅气的Tesseract-OCR致敬，它果然和传说中的一样牛x！首先，到google code下载Tesseract-OCR的dll和相关语言包。下载下来后，把dll导入到自己项目里，把语言包解压缩到debug目录下（哪儿都行，但是要有访问权限，然后记下这个目录，后面要用它配置tesseract-ocr）
复制链接

扫一扫