scite自动检测文件编码

66 篇文章 2 订阅
19 篇文章 0 订阅
   
##########################文件开始fileDect.py#############################
#encoding:utf8
# Detect file encoding
# Simple method that just chacks that first 1000 lines are valid in each encoding
# and chooses first from set that is valid for all lines checked.
# A better version would allow for a small proportion of failures and rank encodings
# depending on how well they match the input.
import sys 
import os

encodings = [ 
    ['utf-8', 65001, 0], 
    ['cp932', 932, 128],
    ['cp936', 936, 134],
    ['cp949', 949, 129],
    ['cp950', 950, 136],
]

codings = [e[0] for e in encodings]

def EncodingWorks(encoding, text):
    try:
        text.decode(encoding)
        return True
    except UnicodeDecodeError:
        return False
    
# Read up to first 1000 lines of file
if len(sys.argv) > 1 and os.path.isfile(sys.argv[1]):
    with open(sys.argv[1], "rb") as f:
        lineNumber = 1 
        for line in f.readlines():
            # Filter out any encodings that fail
            codings = [c for c in codings if EncodingWorks(c, line)]
            lineNumber += 1
            if lineNumber > 1000:
                break

codingsKnow = False

comment = ''
for c in codings:
    for e in encodings:
        if e[0] == c:
            codingsKnow = True
            codePage, characterSet = e[1:]
            if codePage:
                print('%scode.page=%s' % (comment, codePage))
            if characterSet:
                print('%scharacter.set=%s' % (comment, characterSet))
            # Display other matches as comments so can check results
            comment = '#' 
#如果检测不出文件的编码,将默认编码设置成cp936(GBK)
if codingsKnow==False:
    print 'code.page=936'
    print 'character.set=134'
# Change the caret colour so we can see that something happened
print('caret.fore=#4499FF')
############################文件结束#######################################
然后在配置文件SciTEGlobal.properties中加入
command.discover.properties=python /path/to/fileDetect.py "$(FilePath)" 
即可自动检测文件编码,上面的文件可以检测utf-8,gbk,big5等编码,足够使用。

ps.上面的代码是别人写的。。。在linux上测试通过,需要安装python环境
由于是直接复制成网页的,直接拷贝到代码文件可能有问题

Encodings

SciTE will automatically detect the encoding scheme used for Unicode files that start with a Byte Order Mark (BOM). The UTF-8 and UTF-16 encodings are recognised including both Little Endian and Big Endian variants of UTF-16.

UTF-8 files will also be recognised when they contain a coding cookie on one of the first two lines. A coding cookie looks similar to "coding: utf-8" ("coding" followed by ':' or '=', optional whitespace, optional quote, "utf-8") and is normally contained in a comment:

# -*- coding: utf-8 -*-
For XML there is a declaration:
<?xml version='1.0' encoding='utf-8'?>

For other encodings set the code.page and character.set properties.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值