使用T-SQL管理数据中的Unicode字符

最新推荐文章于 2023-11-27 15:35:29 发布

culuo4781

最新推荐文章于 2023-11-27 15:35:29 发布

阅读量1.1k

点赞数

文章标签：字符串 python java 数据库机器学习

原文链接：https://www.sqlshack.com/manage-unicode-characters-in-data-using-t-sql/

版权

本文介绍如何在SQL Server中处理Unicode字符，包括Unicode数据类型、获取特殊字符列表以及使用T-SQL删除字符串中的特殊字符，以解决编码问题。

摘要由CSDN通过智能技术生成

In this article, I’ll provide some useful information to help you understand how to use Unicode in SQL Server and address various compilation problems that arise from the Unicode characters’ text with the help of T-SQL.

在本文中，我将提供一些有用的信息，以帮助您了解如何在SQL Server中使用Unicode，并借助T-SQL解决由于Unicode字符的文本而引起的各种编译问题。

什么是Unicode？ (What is Unicode?)

The American Standard Code for Information Interchange (ASCII) was the first extensive character encoding format. Originally developed in the US, and intended for English, ASCII could only accommodate encoding for 128 characters. Character encoding simply means assigning a unique number to every character being used. As an example, we show the letters ‘A’,’a’,’1′ and the symbol ‘+’ become numbers, as shown in the table:

美国信息交换标准码（ASCII）是第一种扩展的字符编码格式。 ASCII最初在美国开发，并且旨在英语，只能容纳128个字符的编码。字符编码只是意味着为正在使用的每个字符分配一个唯一的数字。例如，我们显示字母“ A”，“ a”，“ 1”和符号“ +”成为数字，如下表所示：

ASCII(‘A’)	ASCII(‘a’)	ASCII(‘1’)	ASCII(‘+’)
65	97	49	43

ASCII（'A'）	ASCII（'a'）	ASCII（'1'）	ASCII（'+'）
65	97	49	43

The T-SQL statement below can help us find the character from the ASCII value and vice-versa:

下面的T-SQL语句可以帮助我们从ASCII值中查找字符，反之亦然：

SELECT CHAR(193) as Character

Here is the result set of ASCII value to char:

这是char的ASCII值的结果集：

SELECT ASCII('Á') as ASCII_

Here is the result set of char to ASCII value:

这是char到ASCII值的结果集：

While ASCII encoding was acceptable for most common English language characters, numbers and punctuation, it was constraining for the rest of the world’s dialects. As a result, other languages required different encoding schemes and character definitions changed according to the language. Having encoding schemes of different lengths required programs to figure out which one to apply depending on the language being used.

尽管ASCII编码对于大多数常见的英语语言字符，数字和标点符号是可以接受的，但它限制了世界其他地方的方言。结果，其他语言需要不同的编码方案，并且字符定义根据语言而改变。具有不同长度的编码方案需要程序根据所使用的语言找出要应用的编码方案。

Here is where international standards become critical. When the entire world practices the same character encoding scheme, every computer can display the same characters. This is where the Unicode Standard comes in.

这是国际标准变得至关重要的地方。当整个世界都采用相同的字符编码方案时，每台计算机都可以显示相同的字符。这是 Unicode标准进来。

Encoding is always related to a charset, so the encoding process encodes characters to bytes and decodes bytes to characters. There are several Unicode formats: UTF-8, UTF-16 and UTF-32.

编码始终与字符集相关，因此编码过程会将字符编码为字节，然后将字节解码为字符。有几种Unicode格式： UTF-8 ， UTF-16和UTF-32 。

UTF-8 uses 1 byte to encode an English character. It uses between 1 and 4 bytes per character and it has no concept of byte-order. All European languages are encoded in two bytes or less per character UTF-8使用1个字节对英语字符进行编码。每个字符使用1到4个字节，并且没有字节顺序的概念。所有欧洲语言均以每个字符两个或更少的字节编码
UTF-16 uses 2 bytes to encode an English character and it is widely used with either 2 or 4 bytes per character UTF-16使用2个字节来编码英文字符，并且广泛使用每个字符2或4个字节
UTF-32 uses 4 bytes to encode an English character. It is best for random access by character offset into a byte-array UTF-32使用4个字节来编码英文字符。最好通过字符偏移到字节数组中进行随机访问

Special characters are often problematic. When working with different source frameworks, it would be preferable if every framework agreed as to which characters were acceptable. A lot of times, it happens that developers perform missteps to identify or troubleshoot the issue, and however, those issues are identified with the odd characters in the data, which caused the error.

特殊字符通常是有问题的。当使用不同的源框架时，最好是每个框架都就可接受的字符达成一致。很多时候，开发人员会执行错误的步骤来识别或排除问题，但是，这些问题是用数据中的奇数字符识别的，从而导致了错误。

SQL Server中的Unicode数据类型 (Unicode data types in SQL Server)

Microsoft SQL Server supports the below Unicode data types:

Microsoft SQL Server支持以下Unicode数据类型：

nchar nchar
nvarchar nvarchar
ntext 文字

The Unicode terms are expressed with a prefix “N”, originating from the SQL-92 standard. The utilization of nchar, nvarchar and ntext data types are equivalent to char, varchar and text. The Unicode supports a broad scope of characters and more space is expected to store Unicode characters. The most extreme size of nchar and nvarchar columns is 4,000 characters, not 8,000 characters like char and varchar. For example:

Unicode术语以前缀“ N”表示，该前缀源自SQL-92标准。 nchar，nvarchar和ntext数据类型的使用等效于char，varchar和text。 Unicode支持广泛的字符范围，预计将存储更多空间来存储Unicode字符。 nchar和nvarchar列的最大大小为4,000个字符，而不是像char和varchar这样的8,000个字符。例如：

N’Mãrk sÿmónds’

N'Mãrksÿmónds'

All Unicode data practices the identical Unicode code page. Collations do not regulate the code page, which is being used for Unicode columns. Collations control only attributes such as comparison rules and case sensitivity.

所有Unicode数据都使用相同的Unicode代码页。排序规则不规范用于Unicode列的代码页。归类仅控制比较规则和区分大小写之类的属性。

This T-SQL statement prints the ASCII values and characters for the ASCII 193-200 range:

此T-SQL语句打印ASCII 193-200范围内的ASCII值和字符：

SELECT CHAR(193), CHAR(194), CHAR(195), CHAR(196), CHAR(197), CHAR(198), CHAR (199), CHAR (200)

CHAR(193)	CHAR(194)	CHAR(195)	CHAR(196)	CHAR(197)	CHAR(198)	CHAR(199)	CHAR(200)
Á	Â	Ã	Ä	Å	Æ	Ç	È

CHAR（193）	炭黑（194）	CHAR（195）	CHAR（196）	CHAR（197）	CHAR（198）	炭黑（199）	炭（200）
一个	一个	一个	一个	一个	Æ	C	È

获取SQL Server中的特殊字符列表 (Get a list of special characters in SQL Server)

Here are some of the Unicode character sets that can be represented in a single-byte coding scheme; however, the character sets require multi-byte encoding. For more information on character sets, check out the below function that returns the ASCII value and character with positions for each special character in the string with the help of T-SQL statements:

这是一些可以用单字节编码方案表示的Unicode字符集。但是，字符集需要多字节编码。有关字符集的更多信息，请查看以下函数，该函数借助T-SQL语句返回字符串中每个特殊字符的ASCII值和位置。

功能： (Function:)

CREATE FUNCTION [dbo].[Find_Unicode]
(
    @in_string nvarchar(max)
)
RETURNS @unicode_char TABLE(id INT IDENTITY(1,1), Char_ NVARCHAR(4), position BIGINT)
AS
BEGIN
    DECLARE @character nvarchar(1)
    DECLARE @index int
 
    SET @index = 1
    WHILE @index <= LEN(@in_string)
    BEGIN
        SET @character = SUBSTRING(@in_string, @index, 1)
        IF((UNICODE(@character) NOT BETWEEN 32 AND 127) AND UNICODE(@character) NOT IN (10,11))
        BEGIN
      INSERT INTO @unicode_char(Char_, position)
      VALUES(@character, @index)
    END
    SET @index = @index + 1
    END
    RETURN
END
GO

执行： (Execution:)

SELECT * 
FROM [Find_Unicode](N'Mãrk sÿmónds')

Here is the result set:

这是结果集：

从SQL Server中的字符串中删除特殊字符 (Remove special characters from string in SQL Server)

In the code below, we are defining logic to remove special characters from a string. We know that the basic ASCII values are 32 – 127. This includes capital letters in order from 65 to 90 and lower case letters in order from 97 to 122. Each character corresponds to its ASCII value using T-SQL. The “RemoveNonASCII” function excludes all the special characters from the string and sets up a blank of them:

在下面的代码中，我们定义了从字符串中删除特殊字符的逻辑。我们知道基本的ASCII值是32 –127。这包括从65到90的大写字母和从97到122的小写字母。每个字符对应于使用T-SQL的ASCII值。 “ RemoveNonASCII”函数从字符串中排除所有特殊字符并设置为空白：

CREATE FUNCTION [dbo].[RemoveNonASCII] 
(
    @in_string nvarchar(max)
)
RETURNS nvarchar(MAX)
AS
BEGIN
 
    DECLARE @Result nvarchar(MAX)
    SET @Result = ''
 
    DECLARE @character nvarchar(1)
    DECLARE @index int
 
    SET @index = 1
    WHILE @index <= LEN(@in_string)
    BEGIN
        SET @character = SUBSTRING(@in_string, @index, 1)
   
        IF (UNICODE(@character) between 32 and 127) or UNICODE(@character) in (10,11)
            SET @Result = @Result + @character
        SET @index = @index + 1
    END
 
    RETURN @Result
END

执行： (Execution:)

SELECT dbo.[RemoveNonASCII](N'Mãrk sÿmónds')

SQL Function to remove special characters

These SQL functions can be very useful if you’re working with large international character sets.

如果您使用大型国际字符集，那么这些SQL函数可能会非常有用。

翻译自: https://www.sqlshack.com/manage-unicode-characters-in-data-using-t-sql/

culuo4781

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用T-SQL管理数据中的Unicode字符

In this article, I’ll provide some useful information to help you understand how to use Unicode in SQL Server and address various compilation problems that arise from the Unicode characters’ text ...
复制链接

扫一扫