python 输入字符串堆栈溢出检查_关于python：如何检查字符串是unicode还是ascii？...-CSDN博客

本文链接：https://blog.csdn.net/weixin_39522103/article/details/111450673

在python中，我需要做什么来确定一个字符串有哪些编码？

相关：stackoverflow.com/questions/196345/&hellip；

相关：stackoverflow.com/q/1303243/64633

Unicode不是编码。

更重要的是，你为什么要在意？

对于不同的看法，"unicode不是编码"反驳：blog.reverberate.org/2009/01/is-not-encoding.html

@文斯那篇博文已经不存在了。

@约翰西韦布是因为江户十一〔0〕。

在Python3中，所有字符串都是Unicode字符序列。有一个包含原始字节的bytes类型。

在python 2中，字符串的类型可以是str或unicode类型。您可以使用类似这样的代码来区分：

def whatisthis(s):

if isinstance(s, str):

print"ordinary string"

elif isinstance(s, unicode):

print"unicode string"

else:

print"not a string"

这不区分"unicode"或"ascii"；它只区分python类型。Unicode字符串可以由ASCII范围内的纯字符组成，字节串可以包含ASCII、编码的Unicode甚至非文本数据。

I GET:名称错误：未定义名称"unicode"

@ProsperousHeart：您可能正在使用Python3。

如何判断对象是Unicode字符串还是字节字符串

您可以使用type或isinstance。

在Python 2中：

>>> type(u'abc') # Python 2 unicode string literal

>>> type('abc') # Python 2 byte string literal

在python 2中，str只是一个字节序列。Python不知道什么它的编码是。unicode类型是存储文本的更安全的方式。如果您想了解更多信息，我建议您访问http://farmdev.com/talks/unicode/。

在Python 3中：

>>> type('abc') # Python 3 unicode string literal

>>> type(b'abc') # Python 3 byte string literal

在python 3中，str与python 2的unicode类似，用于存储文本。在python 2中称为str的东西在python 3中称为bytes。如何判断字节字符串是有效的UTF-8还是ASCII

你可以打电话给decode。如果它引发了unicodedecodeerror异常，则它无效。

>>> u_umlaut = b'\xc3\x9c' # UTF-8 representation of the letter 'ü'

>>> u_umlaut.decode('utf-8')

u'\xdc'

>>> u_umlaut.decode('ascii')

Traceback (most recent call last):

File"", line 1, in

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

只供其他人参考-str.decode在python 3中不存在。看来你得去unicode(s,"ascii")之类的地方

@影子：unicode也不存在

对不起，我是说str(s,"ascii")。

这对python 3来说不准确

@ProsperousHeart更新以覆盖python 3。试图解释字节字符串和Unicode字符串之间的区别。

在python 3.x中，所有字符串都是Unicode字符序列。对str执行isInstance检查(默认情况下这意味着Unicode字符串)就足够了。

isinstance(x, str)

关于python 2.x，大多数人似乎在使用一个有两个检查的if语句。一个用于str，一个用于unicode。

如果您想用一条语句检查是否有一个"类似字符串"的对象，您可以执行以下操作：

isinstance(x, basestring)

这是错误的。在python 2.7中，isinstance(u"x",basestring)返回True。

@我相信这就是重点。使用isInstance(x，baseString)就足以替换上述不同的双测试。

不，但是对于unicode和常规字符串来说，isinstance(x, basestring)都是正确的，这使得测试毫无用处。

它在许多情况下都很有用，但显然不是发问者的意思。

这是问题的答案。所有其他人都误解了OP所说的，并给出了关于Python中类型检查的一般性答案。

不回答OP的问题。这个问题的标题(单独)可以被解释为这个答案是正确的。然而，op在问题描述中明确地说"找出哪个"，而这个答案并没有解决这个问题。

Unicode不是编码-引用Kumar McMillan:

If ASCII, UTF-8, and other byte strings are"text" ...

...then Unicode is"text-ness";

it is the abstract form of text

读一读麦克米兰在python中的unicode，从pycon 2008中完全解开了谜团，它比堆栈溢出的大多数相关答案解释得更好。

这些幻灯片可能是我迄今为止所遇到的Unicode的最佳介绍。

如果您的代码需要与python 2和python 3兼容，那么如果不将它们包装在try/except或python版本测试中，就不能直接使用isinstance(s,bytes)或isinstance(s,unicode)，因为在python 2中bytes是未定义的，而在python 3中unicode是未定义的。

有一些丑陋的解决办法。一个非常难看的方法是比较类型的名称，而不是比较类型本身。下面是一个例子：

# convert bytes (python 3) or unicode (python 2) to str

if str(type(s)) =="":

# only possible in Python 3

s = s.decode('ascii') # or s = str(s)[2:-1]

elif str(type(s)) =="":

# only possible in Python 2

s = str(s)

可以说，稍微不那么难看的解决方法是检查python版本号，例如：

if sys.version_info >= (3,0,0):

# for Python 3

if isinstance(s, bytes):

s = s.decode('ascii') # or s = str(s)[2:-1]

else:

# for Python 2

if isinstance(s, unicode):

s = str(s)

这两种都是不合拍的，而且大多数时候可能有更好的方法。

更好的方法可能是使用six，并对six.binary_type和six.text_type进行测试。

您可以使用类型名称来探测类型名称。

我不太确定这段代码的用例，除非存在逻辑错误。我认为在python 2代码中应该有一个"not"。否则，您将把python 3的所有内容都转换为unicode字符串，而python 2则相反！

是的，利奥弗伦，就是这样。标准的内部字符串是Python3中的Unicode和Python2中的ASCII。因此，代码段将文本转换为标准的内部字符串类型(Unicode或ASCII)。

使用：

import six

if isinstance(obj, six.text_type)

这是一个图书馆的内幕的陈述：

if PY3:

string_types = str,

else:

string_types = basestring,

应该是if isinstance(obj, six.text_type)。但是的，这是我的正确答案。

不回答OP的问题。这个问题的标题(单独)可以被解释为这个答案是正确的。然而，op在问题描述中明确地说"找出哪个"，而这个答案并没有解决这个问题。

请注意，在python 3中，不太公平地说：

strs是任何x的utfx(如utf8)

strs为Unicode

strs是Unicode字符的有序集合。

python的str类型(通常)是一个Unicode代码点序列，其中一些代码点映射到字符。

即使在Python3上，回答这个问题也不像您想象的那么简单。

测试ASCII兼容字符串的一个明显方法是尝试编码：

"Hello there!".encode("ascii")

#>>> b'Hello there!'

"Hello there... ?!".encode("ascii")

#>>> Traceback (most recent call last):

#>>> File"", line 4, in

#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

这个错误区分了这些情况。

在python 3中，甚至有一些字符串包含无效的unicode代码点：

"Hello there!".encode("utf8")

#>>> b'Hello there!'

"\udcc3".encode("utf8")

#>>> Traceback (most recent call last):

#>>> File"", line 19, in

#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

使用相同的方法来区分它们。

这是python 3的正确答案。

这可能对其他人有所帮助，我开始测试变量s的字符串类型，但对于我的应用程序来说，简单地将s返回为utf-8更有意义。调用的进程返回UTF，然后知道它在处理什么，并且可以适当地处理字符串。代码并不是原始的，但我打算在没有版本测试或导入六个版本的情况下将其作为Python版本不可知论者。请对下面的示例代码进行改进以帮助其他人。

def return_utf(s):

if isinstance(s, str):

return s.encode('utf-8')

if isinstance(s, (int, float, complex)):

return str(s).encode('utf-8')

try:

return s.encode('utf-8')

except TypeError:

try:

return str(s).encode('utf-8')

except AttributeError:

return s