python格式化输出 xz,在python中逐行迭代大型.xz文件

I have a large .xz file (few gigabytes). It's full of plain text. I want to process the text to create custom dataset. I want to read it line by line because it is too big. Anyone have an idea how to do it ?

I already tried this

How to open and read LZMA file in-memory but it's not working.

EDIT:

i got this error 'ascii' codec can't decode byte 0xfd in position 0: ordinal not in range(128)

on the line for line in uncompressed: from the link

EDIT2: My code (using python 3.5)

with open(filename) as compressed:

with lzma.LZMAFile(compressed) as uncompressed:

for line in uncompressed:

print(line)

解决方案

I was faced to the same question some weeks ago. This snippet worked for me:

import lzma

with lzma.open('filename.xz', mode='rt') as file:

for line in file:

print(line)

This assumes that the text data in the compressed file was encoded in utf-8 (which was the case for my data). There is an encoding argument in function lzma.open() which allows you to set another encoding if needed

EDIT (after you own edit): try to force encoding='utf-8' in lmza.open()

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值