java utf16 to utf8_Java 8 UTF-16 isn't default charset but UTF-8

Java与Unicode编码
本文探讨了Java中使用UTF-16编码处理Unicode字符的方式,并讨论了不同编码在实际应用中的转换及潜在问题。

Let's back up a bit…

Java's text datatypes use the UTF-16 character encoding of the Unicode character set. (As do, VB4/5/6/A/Script, JavaScript, .NET, ….) You can see this in the various operations you do with the string API: indexing, length, ….

Libraries support converting between the text datatypes and byte arrays using various encodings. Some of them are categorized as "Extended ASCII", but stating that is a very poor substitute for naming the character encoding actually being used.

Some operating systems allow the user to designate a default character encoding. (Most users don't know or care, though.) Java attempts to pick this up. It is only useful when the program understands that input from the user is that character encoding or that output should be. This century, users dealing in text files prefer to use a specific encoding, communicate them unchanged across systems, don't appreciate lossy conversions and therefore don't have any use for this concept. From a program's point of view, it is never what you want unless it is exactly what you want.

Where a conversion would be lossy, you have the choice of a replacement character (such a '?'), omitting it, or throwing an exception.

A character encoding is a map between a codepoint (integer) of a character set and one or more code units, according to the definition of the encoding. A code unit is a fixed size and the number of code units needed for a codepoint, might vary by codepoint.

In libraries, it is not generally useful to have an array of code units so they take the further step of converting to/from an array of bytes. byte values do range from -128 to 127, however, that's the Java interpretation as two's complement 8-bit integers. As the bytes are understood to be encoding text, the values would be interpret according to the rules of the character encoding.

Because some Unicode encodings, have code units more than one byte long, byte order becomes important. So, at the byte array level, there is UTF-16 Big Endian and UTF-16 Little Endian. When communicating a text file or stream, you would send the bytes and well as having a shared knowledge of the encoding. This "metadata" is required for understanding. So, UTF-16BE or UTF-16LE, for example. To make that a bit easier, Unicode allows some metadata beginning of the file or stream to indicate the byte order. It is called the byte-order mark (BOM) So, the external metadata can share the encoding (say, UTF-16), while the internal metadata shares the byte order. Unicode allows the BOM to be present even when byte order is not relevant, such as UTF-8. So, if the understanding is that the bytes are text encoded with any Unicode encoding and a BOM is present, then it's a very simple matter to figure out which Unicode encoding it is and what the byte order is, if relavent.

1) You are seeing the BOM in some of your Unicode encoding outputs.

2) È is not in the ASCII character set. What would want to happen in this case? I often prefer an exception.

3) The system you were using, for your account, at the time of your tests, may have had UTF-8 as the default character encoding, Is that important to the way you want and have encoded your text files on that system?

--------------------------------------------------------------------------- OperationalError Traceback (most recent call last) <ipython-input-2-bb1780f10cf3> in <module> 27 28 # 连接 MySQL 数据库 ---> 29 conn = pymysql.connect( 30 host='localhost', 31 user='root', D:\Program Files\Anaconda3\lib\site-packages\pymysql\connections.py in __init__(self, user, password, host, database, unix_socket, port, charset, collation, sql_mode, read_default_file, conv, use_unicode, client_flag, cursorclass, init_command, connect_timeout, read_default_group, autocommit, local_infile, max_allowed_packet, defer_connect, auth_plugin_map, read_timeout, write_timeout, bind_address, binary_prefix, program_name, server_public_key, ssl, ssl_ca, ssl_cert, ssl_disabled, ssl_key, ssl_key_password, ssl_verify_cert, ssl_verify_identity, compress, named_pipe, passwd, db) 363 self._sock = None 364 else: --> 365 self.connect() 366 367 def __enter__(self): D:\Program Files\Anaconda3\lib\site-packages\pymysql\connections.py in connect(self, sock) 679 680 self._get_server_information() --> 681 self._request_authentication() 682 683 # Send "SET NAMES" query on init for: D:\Program Files\Anaconda3\lib\site-packages\pymysql\connections.py in _request_authentication(self) 956 957 self.write_packet(data) --> 958 auth_packet = self._read_packet() 959 960 # if authentication method isn't accepted the first byte D:\Program Files\Anaconda3\lib\site-packages\pymysql\connections.py in _read_packet(self, packet_type) 780 if self._result is not None and self._result.unbuffered_active is True: 781 self._result.unbuffered_active = False --> 782 packet.raise_for_error() 783 return packet 784 D:\Program Files\Anaconda3\lib\site-packages\pymysql\protocol.py in raise_for_error(self) 217 if DEBUG: 218 print("errno =", errno) --> 219 err.raise_mysql_exception(self._data) 220 221 def dump(self): D:\Program Files\Anaconda3\lib\site-packages\pymysql\err.py in raise_mysql_exception(data) 148 if errorclass is None: 149 errorclass = InternalError if errno < 1000 else OperationalError --> 150 raise errorclass(errno, errval) OperationalError: (1045, "Access denied for user 'root'@'localhost' (using password: YES)")
最新发布
10-22
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值