示例代码如下
其中,a直接回车打印的是16进制编码,print a打印的是汉字,怎么做到的?
变量名+回车的方式
首先注意我们是在交互环境,输入的内容会立即解析,其源头就是将标准输入当成了读取文件一样:
int
Py_Main(int argc, char **argv)
{
...
sts = PyRun_AnyFileExFlags(
fp,
filename == NULL ? "<stdin>" : filename,
filename != NULL, &cf) != 0;
...
然后就得到了opcode PRINT_EXPR
PyObject *
PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)
{
...
case PRINT_EXPR:
v = POP();
w = PySys_GetObject("displayhook");
if (w == NULL) {
PyErr_SetString(PyExc_RuntimeError,
"lost sys.displayhook");
err = -1;
x = NULL;
}
if (err == 0) {
x = PyTuple_Pack(1, v);
if (x == NULL)
err = -1;
}
if (err == 0) {
w = PyEval_CallObject(w, x);
Py_XDECREF(w);
if (w == NULL)
err = -1;
}
Py_DECREF(v);
Py_XDECREF(x);
break;
...
可以看到它在sys模块中找到一个displayhook函数进行输出。
static PyMethodDef sys_methods[] = {
...
{"displayhook", sys_displayhook, METH_O, displayhook_doc},
...
}
static PyObject *
sys_displayhook(PyObject *self, PyObject *o)
{
...
outf = PySys_GetObject("stdout");
if (outf == NULL) {
PyErr_SetString(PyExc_RuntimeError, "lost sys.stdout");
return NULL;
}
if (PyFile_WriteObject(o, outf, 0) != 0)
return NULL;
...
就像写文件一样,将变量o写入<stdout>(标准输出)文件中,注意这里的flags=0,而且会一直往下传递出去。
- PyFile_WriteObject
- file_PyObject_Print
- PyObject_Print
- internal_print
- PyObject_Print
- file_PyObject_Print
flags的意义就是,最终生成的字符串s,要不要去掉引号。即原生的字符串内容,不添加任何额外骚操作。
/* Flag bits for printing: */
#define Py_PRINT_RAW 1 /* No string quotes etc. */
对于str和unicode,前面的流程都是一样的,在 internal_print 开始走各自流程,原因在于 PyString_Type 的 tp_print 不为空。
str变量打印
最终走到 string_print,因为flags=0,意思就是“我不要原生字符串,你给我加工一下!”,于是一堆转义。
static int
string_print(PyStringObject *op, FILE *fp, int flags)
{
...
/* figure out which quote to use; single is preferred */
quote = '\'';
if (memchr(op->ob_sval, '\'', Py_SIZE(op)) &&
!memchr(op->ob_sval, '"', Py_SIZE(op)))
quote = '"';
str_len = Py_SIZE(op);
Py_BEGIN_ALLOW_THREADS
fputc(quote, fp);
for (i = 0; i < str_len; i++) {
/* Since strings are immutable and the caller should have a
reference, accessing the interal buffer should not be an issue
with the GIL released. */
c = op->ob_sval[i];
if (c == quote || c == '\\')
fprintf(fp, "\\%c", c);
else if (c == '\t')
fprintf(fp, "\\t");
else if (c == '\n')
fprintf(fp, "\\n");
else if (c == '\r')
fprintf(fp, "\\r");
else if (c < ' ' || c >= 0x7f)
fprintf(fp, "\\x%02x", c & 0xff);
else
fputc(c, fp);
}
fputc(quote, fp);
Py_END_ALLOW_THREADS
return 0;
}
a的打印为什么能看到'\xba\xba'?核心代码就是 fprintf(fp, "\\x%02x", c & 0xff);
a因为是str的汉字,1个汉字占用2个字节,每个字节按两位的16进制输出%02x,再加个前缀\\x表示是16进制。
unicode变量打印
flags最终在这里起了作用,走了PyObject_Repr,将变量op变成字符串s,再通过 internal_print 打印到fp(也就是<stdout>)上。
static int
internal_print(PyObject *op, FILE *fp, int flags, int nesting)
{
...
else if (Py_TYPE(op)->tp_print == NULL) {
PyObject *s;
if (flags & Py_PRINT_RAW)
s = PyObject_Str(op);
else
s = PyObject_Repr(op);
if (s == NULL)
ret = -1;
else {
ret = internal_print(s, fp, Py_PRINT_RAW,
nesting+1);
}
Py_XDECREF(s);
...
PyObject_Repr(PyObject *v)
{
...
if (v == NULL)
return PyString_FromString("<NULL>");
else if (Py_TYPE(v)->tp_repr == NULL)
return PyString_FromFormat("<%s object at %p>",
Py_TYPE(v)->tp_name, v);
else {
PyObject *res;
res = (*Py_TYPE(v)->tp_repr)(v);
...
- PyObject_Repr
- unicode_repr
- unicodeescape_string
- unicode_repr
关于unicode的骚操作都在这里了
const Py_ssize_t expandsize = 6;
...
repr = PyString_FromStringAndSize(NULL,
2
+ expandsize*size
+ 1);
我们看到返回值repr的空间是这么开辟的
前面的2:开头的 u'
后面的1:结尾的 '
中间:每6个字节(expandsize)显示1个编码,格式为 \uXXXX,X也就是4个比特位填充
if (ch >= 256) {
*p++ = '\\';
*p++ = 'u';
*p++ = hexdigit[(ch >> 12) & 0x000F];
*p++ = hexdigit[(ch >> 8) & 0x000F];
*p++ = hexdigit[(ch >> 4) & 0x000F];
*p++ = hexdigit[ch & 0x000F];
} // 可以发现,ch编码是大端存储的
最终我们有了 u'\u6c49'
print操作
首先要确定的是print对应的opcode是PRINT_ITEM
>>> def foo():
... print a, b
...
>>> import dis
>>> dis.dis(foo)
2 0 LOAD_GLOBAL 0 (a)
3 PRINT_ITEM
4 LOAD_GLOBAL 1 (b)
7 PRINT_ITEM
8 PRINT_NEWLINE
9 LOAD_CONST 0 (None)
12 RETURN_VALUE
千万不要误解成从LOAD_GLOBAL找到print函数,然后CALL_FUNCTION之类的,除非你这么写
>>> def foo():
... getattr(__builtins__,'print')('a')
...
>>> dis.dis(foo)
2 0 LOAD_GLOBAL 0 (getattr)
3 LOAD_GLOBAL 1 (__builtins__)
6 LOAD_CONST 1 ('print')
9 CALL_FUNCTION 2
12 LOAD_CONST 2 ('a')
15 CALL_FUNCTION 1
18 POP_TOP
19 LOAD_CONST 0 (None)
22 RETURN_VALUE
言归正传,那我们看看PRINT_ITEM是怎么打印成汉字的?
TARGET_NOARG(PRINT_ITEM)
{
v = POP();
if (stream == NULL || stream == Py_None) {
w = PySys_GetObject("stdout");
if (w == NULL) {
PyErr_SetString(PyExc_RuntimeError,
"lost sys.stdout");
err = -1;
}
}
/* PyFile_SoftSpace() can exececute arbitrary code
if sys.stdout is an instance with a __getattr__.
If __getattr__ raises an exception, w will
be freed, so we need to prevent that temporarily. */
Py_XINCREF(w);
// 如果前面都没问题的话,看看有没有必要插入一个空格
if (w != NULL && PyFile_SoftSpace(w, 0))
err = PyFile_WriteString(" ", w);
if (err == 0)
err = PyFile_WriteObject(v, w, Py_PRINT_RAW);
if (err == 0) {
/* XXX move into writeobject() ? */
if (PyString_Check(v)) {
// 如果打印的对象是str的空字符串
// 调用一下 PyFile_SoftSpace(w, 1);
...
}
#ifdef Py_USING_UNICODE
else if (PyUnicode_Check(v)) {
// 如果打印的对象是unicode的空字符串
// 调用一下 PyFile_SoftSpace(w, 1);
...
}
#endif
else
// 总之就是要调用一下啦
PyFile_SoftSpace(w, 1);
}
Py_XDECREF(w);
Py_DECREF(v);
Py_XDECREF(stream);
stream = NULL;
if (err == 0) DISPATCH();
break;
}
后面的 PyFile_SoftSpace 是为了给print元素之间加空格,加空格的逻辑就是上面的 PyFile_WriteString(" ", w);
/* Interface for the 'soft space' between print items. */
int
PyFile_SoftSpace(PyObject *f, int newflag)
{
...
else if (PyFile_Check(f)) {
oldflag = ((PyFileObject *)f)->f_softspace;
((PyFileObject *)f)->f_softspace = newflag;
}
...
}
核心代码在 PyFile_WriteObject(v, w, Py_PRINT_RAW); 这里总算要求,打印不加工的原生字符串了,是不是这个原因导致打出汉字的呢?
print str
调用流程如下,我们又回到了string_print
- PyFile_WriteObject
- file_PyObject_Print
- PyObject_Print
- internal_print
- string_print
- internal_print
- PyObject_Print
- file_PyObject_Print
这次是这段代码起了作用
if (flags & Py_PRINT_RAW) {
char *data = op->ob_sval;
Py_ssize_t size = Py_SIZE(op);
Py_BEGIN_ALLOW_THREADS
while (size > INT_MAX) {
// 对于很长的字符串,按14比特位为1个单位分批输出
// 显然这也是有问题的:
// 1. 可能会有内存对齐问题
// 2. 字节都拆开了,输出会有误吧!(需验证)
const int chunk_size = INT_MAX & ~0x3FFF;
fwrite(data, 1, chunk_size, fp);
data += chunk_size;
size -= chunk_size;
}
fwrite(data, 1, (size_t)size, fp);
Py_END_ALLOW_THREADS
return 0;
}
在这里fwrite代替了,非原生字符串输出的fputc和fprintf
print unicode
print unicode走了完全不一样的路子,在PyFile_WriteObject就将unicode对象value转换成str了
#ifdef Py_USING_UNICODE
if ((flags & Py_PRINT_RAW) &&
PyUnicode_Check(v) && enc != Py_None) {
char *cenc = PyString_AS_STRING(enc);
char *errors = fobj->f_errors == Py_None ?
"strict" : PyString_AS_STRING(fobj->f_errors);
value = PyUnicode_AsEncodedString(v, cenc, errors);
if (value == NULL)
return -1;
} else {
...
}
result = file_PyObject_Print(value, fobj, flags);
Py_DECREF(value);
return result;
cenc取的终端字符编码'cp936',errors=’strict',核心代码就在 PyUnicode_AsEncodedString(v, cenc, errors); 这句了
- PyUnicode_AsEncodedString
- _PyCodec_EncodeText(_PyCodec_TextEncoder,拿到encoding对应的encoder文件)
- _PyCodec_EncodeInternal
- _PyCodec_EncodeText(_PyCodec_TextEncoder,拿到encoding对应的encoder文件)
_PyCodec_EncodeInternal的处理很简单,就是调用encoder函数,所以我们得聚焦到 _PyCodec_TextEncoder 怎么找到这个encoder的
其实核心代码就是调用了_PyCodec_Lookup(encoding),初始化了一个叫encodings的模块,其目录就在
{PythonDir}\Lib\encodings\
__init__.py初始化流程中,定义了search_function搜索函数赋值给codecs,又回到了C模块中
[__init__.py]
codecs.register(search_function)
[_codecsmodule.c]
static PyMethodDef _codecs_functions[] = {
{"register", codec_register, METH_O,
register__doc__},
...
}
static
PyObject *codec_register(PyObject *self, PyObject *search_function)
{
if (PyCodec_Register(search_function))
return NULL;
Py_RETURN_NONE;
}
int PyCodec_Register(PyObject *search_function)
{
PyInterpreterState *interp = PyThreadState_GET()->interp;
if (interp->codec_search_path == NULL && _PyCodecRegistry_Init())
goto onError;
if (search_function == NULL) {
PyErr_BadArgument();
goto onError;
}
if (!PyCallable_Check(search_function)) {
PyErr_SetString(PyExc_TypeError, "argument must be callable");
goto onError;
}
return PyList_Append(interp->codec_search_path, search_function);
onError:
return -1;
}
search_function的逻辑,就是用了_aliases,找到encoding对应的文件名aliased_encoding,优先用文件名加载模块,否则就用encoding作为文件名。
我们在aliases.py中,可以确定的是cp936对应gbk.py文件。
search_function返回类型是codecs.CodecInfo,它是tuple的子类,详见{PythonDir}\Lib\codecs.py
要最终找到encoder函数也不容易,它其实是CPyFunction,而_codecs_cn和gbk函数都是宏定义的
[gbk.py]
codec = _codecs_cn.getcodec('gbk')
class Codec(codecs.Codec):
encode = codec.encode
decode = codec.decode
[_codecs_cn.c]
BEGIN_CODECS_LIST
CODEC_STATELESS(gb2312)
CODEC_STATELESS(gbk)
CODEC_STATELESS(gb18030)
CODEC_STATEFUL(hz)
END_CODECS_LIST
I_AM_A_MODULE_FOR(cn)
// 翻译过来就是这样,定义了几个map,还有encode/decode函数
static const struct dbcs_map _mapping_list[] = {
{ "gb2312", NULL, (void*)gb2312_decmap },
{ "gbkext", NULL, (void*)gbkext_decmap },
{ "gbcommon", (void*)gbcommon_encmap, NULL },
{ "gb18030ext", (void*)gb18030ext_encmap, NULL },
{ "", NULL, NULL } };
static const struct dbcs_map *mapping_list = (const struct dbcs_map *)_mapping_list;
static const MultibyteCodec _codec_list[] = {
{ "gb2312", NULL, NULL, gb2312_encode, NULL, NULL, gb2312_decode, NULL, NULL },
{ "gbk", NULL, NULL, gbk_encode, NULL, NULL, gbk_decode, NULL, NULL },
{ "gb18030", NULL, NULL, gb18030_encode, NULL, NULL, gb18030_decode, NULL, NULL },
{ "hz", NULL, NULL, hz_encode, NULL, NULL, hz_decode, NULL, NULL },
{ "", NULL, } };
static const MultibyteCodec *codec_list = (const MultibyteCodec *)_codec_list;
void
init_codecs_cn(void)
{
PyObject *m = Py_InitModule("_codecs_cn", __methods);
if (m != NULL)
(void)register_maps(m);
}
[gbk.py]第一行代码基本上可以翻译成:
1. 在_codecs_cn模块的codec_list里,找到encoding=‘gbk’的MultibyteCodec对象codec
2. 创建PyCapsule对象codecobj,将这个codec包住(capsule->pointer = pointer;)
3. 调用_multibytecodec.__create_codec(codecobj),将MultibyteCodec对象codec取出来,用MultibyteCodecObject对象包住(self->codec = codec;)
4. 返回这个MultibyteCodecObject对象,就是gbk.py中的codec
5. 以encode = codec.encode为例,MultibyteCodecObject对象的函数就看MultibyteCodec_Type的定义了:
static struct PyMethodDef multibytecodec_methods[] = {
{"encode", (PyCFunction)MultibyteCodec_Encode,
METH_VARARGS | METH_KEYWORDS,
MultibyteCodec_Encode__doc__},
{"decode", (PyCFunction)MultibyteCodec_Decode,
METH_VARARGS | METH_KEYWORDS,
MultibyteCodec_Decode__doc__},
{NULL, NULL},
};
为了确定encode逻辑是怎么把unicode变成str的,我们得关心这个gbk是怎么实现的了(把宏定义弄懂)。
看v2.7.15的代码还是有挑战性,v2.7.9的逻辑就很清晰,这样的改动,纯粹只是为了给python3的加个检查
/* Text encoding/decoding API */
PyObject * _PyCodec_LookupTextEncoding(const char *encoding,
const char *alternate_command)
{
...
if (Py_Py3kWarningFlag && !PyTuple_CheckExact(codec)) {
attr = PyObject_GetAttrString(codec, "_is_text_encoding");
if (attr == NULL) {
if (!PyErr_ExceptionMatches(PyExc_AttributeError))
goto onError;
PyErr_Clear();
} else {
is_text_codec = PyObject_IsTrue(attr);
Py_DECREF(attr);
if (is_text_codec < 0)
goto onError;
if (!is_text_codec) {
PyObject *msg = PyString_FromFormat(
"'%.400s' is not a text encoding; "
"use %s to handle arbitrary codecs",
encoding, alternate_command);
if (msg == NULL)
goto onError;
if (PyErr_WarnPy3k(PyString_AS_STRING(msg), 1) < 0) {
Py_DECREF(msg);
goto onError;
}
Py_DECREF(msg);
}
}
}
...
方便起见,你可以先了解这些函数:
[_codecsmodule.c]
ascii_encode/ascii_decode
ascii.py中的逻辑就写的很清晰,函数引用直指这两个!
[_codecs_iso2022.c]
所以我们有理由相信,encoder就是这个MultibyteCodecObject对象的成员函数,调用结果最终会进入
static PyObject *
multibytecodec_encode(MultibyteCodec *codec,
MultibyteCodec_State *state,
const Py_UNICODE **data, Py_ssize_t datalen,
PyObject *errors, int flags)
它一手拿着和编译码协议相关的codec结构体({ "gbk", NULL, NULL, gbk_encode, NULL, NULL, gbk_decode, NULL, NULL }),一手拿着我们传入的unicode对象data
通过中间对象MultibyteEncodeBuffer buf;将输入信息buf.inbuf捣腾成buf.outobj。
说白了,就是把unicode的0x6c49变成gbk的0xbaba,然后fwrite到<stdout>。
[multibytecodec.c]
static PyObject *
multibytecodec_encode(MultibyteCodec *codec,
MultibyteCodec_State *state,
const Py_UNICODE **data, Py_ssize_t datalen,
PyObject *errors, int flags)
{
...
while (buf.inbuf < buf.inbuf_end) {
Py_ssize_t inleft, outleft;
/* we don't reuse inleft and outleft here.
* error callbacks can relocate the cursor anywhere on buffer*/
inleft = (Py_ssize_t)(buf.inbuf_end - buf.inbuf);
outleft = (Py_ssize_t)(buf.outbuf_end - buf.outbuf);
r = codec->encode(state, codec->config, &buf.inbuf, inleft,
&buf.outbuf, outleft, flags);
if ((r == 0) || (r == MBERR_TOOFEW && !(flags & MBENC_FLUSH)))
break;
else if (multibytecodec_encerror(codec, state, &buf, errors,r))
goto errorexit;
else if (r == MBERR_TOOFEW)
break;
}
[_codecs_cn.c]
ENCODER(gbk)
{
while (inleft > 0) {
Py_UNICODE c = IN1;
DBCHAR code;
if (c < 0x80) {
WRITE1((unsigned char)c)
NEXT(1, 1)
continue;
}
UCS4INVALID(c)
REQUIRE_OUTBUF(2)
GBK_ENCODE(c, code)
else return 1;
OUT1((code >> 8) | 0x80)
if (code & 0x8000)
OUT2((code & 0xFF)) /* MSB set: GBK */
else
OUT2((code & 0xFF) | 0x80) /* MSB unset: GB2312 */
NEXT(1, 2)
}
return 0;
}
/* GBK and GB2312 map differently in few codepoints that are listed below:
*
* gb2312 gbk
* A1A4 U+30FB KATAKANA MIDDLE DOT U+00B7 MIDDLE DOT
* A1AA U+2015 HORIZONTAL BAR U+2014 EM DASH
* A844 undefined U+2015 HORIZONTAL BAR
*/
#define GBK_DECODE(dc1, dc2, assi) \
if ((dc1) == 0xa1 && (dc2) == 0xaa) (assi) = 0x2014; \
else if ((dc1) == 0xa8 && (dc2) == 0x44) (assi) = 0x2015; \
else if ((dc1) == 0xa1 && (dc2) == 0xa4) (assi) = 0x00b7; \
else TRYMAP_DEC(gb2312, assi, dc1 ^ 0x80, dc2 ^ 0x80); \
else TRYMAP_DEC(gbkext, assi, dc1, dc2);
#define GBK_ENCODE(code, assi) \
if ((code) == 0x2014) (assi) = 0xa1aa; \
else if ((code) == 0x2015) (assi) = 0xa844; \
else if ((code) == 0x00b7) (assi) = 0xa1a4; \
else if ((code) != 0x30fb && TRYMAP_ENC_COND(gbcommon, assi, code));
#define _TRYMAP_ENC(m, assi, val) \
((m)->map != NULL && (val) >= (m)->bottom && \
(val)<= (m)->top && ((assi) = (m)->map[(val) - \
(m)->bottom]) != NOCHAR)
#define TRYMAP_ENC_COND(charset, assi, uni) \
_TRYMAP_ENC(&charset##_encmap[(uni) >> 8], assi, (uni) & 0xff)
都是一系列的查map、移位、计算,这里就不深入了。
有意思的是,查看控制台属性,能发现cp936和gbk的来源,感兴趣的同学可以试试修改这个值再开客户端,看看python又是怎么处理中文的。
总结:
1. fwrite和fputc、fprintf属于两组不同的输出接口
2. 变量名+回车的打印(repr方式)会要求字符串加工,print则打印原生字符串(str方式)
3. print unicode才需要转码(encode),print str则不需要
4. 转码,是通过encoding找到对应的py文件,再找到encoder和decoder的,核心逻辑都是查表和位运算。
遗留问题:
1. 字符编码的历史和规则
ascii cp936 gbk gb2312 utf8 utf16 这些都是怎么来的?
2. python的字符编码识别和转码
什么时候decode,什么时候encode?
怎么识别文件的字符编码?
怎么识别字符串的字符编码?
3. 文件读写接口的具体不同
fgetc和fputc
fgets和fputs
fread和fwrite
fscanf和fprinf
4. 文本分段fwrite,会不会乱码?