Unicode 与 UTF-8

最新推荐文章于 2024-05-25 17:41:43 发布

万物皆字节

最新推荐文章于 2024-05-25 17:41:43 发布

阅读量642

点赞数 1

文章标签： unicode utf-8 编码

本文链接：https://blog.csdn.net/Aqu415/article/details/74011729

版权

概念

编码方式：简单的说就是将字符与特定的编码对应起来（比如 ascii 编码表中字符a对应数字 97）

存储方式：将这些编码（比如上面说的97）用机器语言0、1（这里的0、1 并不是真正我们想象中的字符0、1，是存储介质里电位高低）存储在计算机存储介质中，电脑在存储的时候貌似由磁头改变存储介质里每一个单元的电位，也就是我们口中说的0、1

unicode与utf-8

之前在网上看到有资料说 unicode就相当于java中的接口，utf-8和utf-32就类似实现类。

这种说法我认为是错误的，因为unicode是一种实实在在的编码格式，看下面例子：

package com.fzh.test;

import java.io.UnsupportedEncodingException;

public class U {
	public static void main(String[] args) throws UnsupportedEncodingException {
		String info = "a";
		System.out.println("ascii "+info.getBytes("ascii").length);
		System.out.println("unicode "+info.getBytes("unicode").length);
		System.out.println("utf-8 "+info.getBytes("utf-8").length);
		System.out.println("utf-32 "+info.getBytes("utf-32").length);
		System.out.println((char)97);
	}
}

unicode编码方式用4个字节存储了 “a” 这个字符，的确也是百度以后也发现unicode无论存储什么字符都以4byte（4x8=32 bit）存储；

这样问题就来了，有很大一部分字符不需要4个字节就能存储

比如：

00000000 00000000 00000000 00000001

这个完全可以用

00000001 来表示，节约了空间

为解决上面节约空间的问题，utf-8 就出现了

首先看看utf-8的概念：

UTF（Unicode Transformation Format） UTF是建立在unicode的编码表（a字符对应数字 97）基础上的，只是在存储到磁盘上或者从磁盘上读取字节时没有使用unicode

的4字节方式，而是英文采用1个字节，绝大部分中文采用了3个字节（日文，韩文这些我不清楚），看例子：

package com.fzh.test;

import java.io.UnsupportedEncodingException;

public class U {
	public static void main(String[] args) throws UnsupportedEncodingException {
		String info = "你";
		System.out.println("ascii "+info.getBytes("ascii").length);
		System.out.println("unicode "+info.getBytes("unicode").length);
		System.out.println("utf-8 "+info.getBytes("utf-8").length);
		System.out.println("utf-32 "+info.getBytes("utf-32").length);
		System.out.println((char)97);
	}
}

由于acsii是8bit的存储（最初貌似是7bit，后面扩展成了8bit），所以存储中文就在它的能力范围外了。

万物皆字节

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Unicode 与 UTF-8

概念编码方式：简单的说就是将字符与特定的编码对应起来（比如 ascii 编码表中字符a对应数字 97）存储方式：将这些编码（比如上面说的97）用机器语言0、1（这里的0、1 并不是真正我们想象中的字符0、1，是存储介质里电位高低）存储在计算机存储介质中，电脑在存储的时候貌似由磁头改变存储介质里每一个单元的电位，也就是我们口中说的0、1unicode与utf-8之前在网上...
复制链接

扫一扫