mysql中的连续数据,MySQL数据中连续字节无效导致catch unicodedecode错误

最新推荐文章于 2022-01-23 20:12:20 发布

魔都财观

最新推荐文章于 2022-01-23 20:12:20 发布

阅读量167

点赞数

文章标签： mysql中的连续数据

我正在将数亿行的文本数据从MySQL移到搜索引擎中,但无法成功处理其中一个检索到的字符串中的Unicode错误。我尝试显式地对检索到的字符串进行编码和解码,以使Python抛出Unicode异常并了解问题所在。

在我的笔记本电脑上运行了数千万行(叹气…)之后,就会抛出这个异常,但我无法抓住它,跳过那一行,继续前进,这就是我想要的。MySQL数据库中的所有文本都应该是UTF-8。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte

cnx = mysql.connector.connect(user='root', password='',

host='127.0.0.1',

database='bloggz',

charset='utf-8')

以下是数据库字符设置:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR

Variable_name LIKE 'collation%';

+——————————+——————————————————————————————————————————————————————————————————————————————————————————————————————---+

|变量名称值|

+——————————+——————————————————————————————————————————————————————————————————————————————————————————————————————---+

|字符_set_client_utf8|

|字符_set_connection_utf8|

|字符_set_database_utf8|

| character_set_filesystem二进制|

|字符_set_results_utf8|

|字符_set_server_utf8|

|字符_set_system_utf8|

|排序规则连接|

|排序规则数据库|

|排序规则服务器|

+——————————+——————————————————————————————————————————————————————————————————————————————————————————————————————---+

我下面的异常处理有什么问题?注意变量“last_feeds_id”也没有打印出来,但这可能只是一个证明except子句不起作用的证据。

last_feeds_id = 0

for feedsid, ts, url, bid, title, html in cursor:

try:

# to catch UnicodeErrors and see where the prolem lies

# from: https://mail.python.org/pipermail/python-list/2012-July/627441.html

# also see https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error

# feeds.URL is varchar(255) in mysql

enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')

dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')

# texts.title is varchar(600) in mysql

enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')

dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')

# texts.html is text in mysql

enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')

dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')

data = {"timestamp":ts,

"url":dec_url,

"bid":bid,

"title":dec_title,

"html":dec_html}

es.index(index="blogposts",

doc_type="blogpost",

body=data)

except UnicodeDecodeError as e:

print("Last feeds id: {}".format(last_feeds_id))

print(e)

except UnicodeEncodeError as e:

print("Last feeds id: {}".format(last_feeds_id))

print(e)

except UnicodeError as e:

print("Last feeds id: {}".format(last_feeds_id))

print(e)

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。