python3使用beautifulSoup - UnicodeDecodeError: 'utf-8' codec can't decode

最新推荐文章于 2024-03-18 20:09:04 发布

anxiaogao9537

最新推荐文章于 2024-03-18 20:09:04 发布

阅读量415

点赞数

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/cyn413/p/7940629.html

版权

将html文件转为纯文本，用Python3调用beautifulSoup 超简单的代码一直出错，用于打开本地文件： def load_data(file_path): with open(file_path, 'r') as pf: try: soup = BeautifulSoup(pf, "html.parser") table = soup.find('table') rownum = 0 entry_list = [] for row in table.findAll('tr'): rownum += 1 if rownum!=1: col = row.findAll('td') entry_list.append(SATData(hostname=col[0].getText().strip(), db_instance=col[1].getText().strip(), sat_type=col[2].getText().strip(), os_version=col[3].getText().strip(), signoff_date=col[4].getText().strip(), comment=col[5].getText().strip())) if rownum % 500 == 0: SATData.objects.bulk_create(entry_list) entry_list = [] logger.info('Insert Data %d' % rownum) SATData.objects.bulk_create(entry_list) logger.info('Insert Data %d' % (rownum-1)) except Exception as e: logger.exception(str(e)) 出现下面的错误: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2127: invalid start byte 问题出在文件读取而非BeautifulSoup的解析上！！好吧，查查为什么文件读取有问题，直接上正解，同样四行代码 from bs4 import BeautifulSoup file = open('index.html','r',encoding='iso-8859-1') soup = BeautifulSoup(file,'lxml') print (soup) 然后soup.get_text()得到标签中的文字 def download_satdata(tz_today_str, target_file): url = 'http://....../Download_SAT_Inventory.asp' try: logger.info(tz_today_str + ': Start to download OAT_SATData.') r = requests.get(url, stream=True) test = r.headers t1 = requests.utils.get_encodings_from_content(r.content) print(test, t1) with open(target_file, "wb") as code: code.write(r.content) logger.info(tz_today_str + ': Downloading OAT_SATData is completed.') except Exception as e: logger.exception(str(e)) http://xiaorui.cc/2016/02/19/%E4%BB%A3%E7%A0%81%E5%88%86%E6%9E%90python-requests%E5%BA%93%E4%B8%AD%E6%96%87%E7%BC%96%E7%A0%81%E9%97%AE%E9%A2%98/ (代码分析Python requests库中文编码问题) http://blog.csdn.net/a491057947/article/details/47292923 (Python 使用requests时的编码问题) Note: 1. 在python3中，bytes和str如何转化呢？使用decode()方法将bytes转为str 使用encode()方法将str转为bytes 2、TypeError: write() argument must be str, not bytes 将文件打开方式改变为'wb+'即可即打开读写一个二进制文件 3、TypeError: cannot use a string pattern on a bytes-like object 将文件用'rb+'打开后附上解码方式　(通常是非utf－8所致)　 f = open(fileName,"rb+") content = f.read().decode('gbk') 4. 一般情况下，文件都是用文本模式打开的，也就意味着，文件读写都是使用某种编码的，末日呢情况下都是用utf-8编码。'b'会用二进制形式打开文件。这个时候，文件读写都是以字节的形式。在文本模式下，默认会把平台相关的换行符（windows平台是\r\n，Linux平台是\n）转换成\n，在写文件时，会把\n转换成平台相关的字符写入。这种后台的操作对于文本会非常有用，但是对于二进制文件如jpeg或exe文件，则会破坏文件，因此在打开这类文件时千万要使用二进制模式打开。 links: 1. https://stackoverflow.com/questions/26612492/python-unicodedecodeerror-utf-8-codec-cant-decode-byte-invalid-continuati (Python: UnicodeDecodeError: 'utf-8' codec can't decode byte…invalid continuation byte) 2. http://blog.csdn.net/kelindame/article/details/75014485 (python 中文iso-8859-1编码转utf8编码) 3. http://blog.csdn.net/zm2714/article/details/8012474 (python读写不同编码txt文件) 4. http://outofmemory.cn/code-snippet/629/python-duxie-file-setting-file-charaeter-coding-biru-utf-8 (python读写文件，和设置文件的字符编码比如utf-8) 5. https://www.cnblogs.com/dengyg200891/p/6059277.html (python3中读取和写入文件时如何解决编码问题)

转载于:https://www.cnblogs.com/cyn413/p/7940629.html

anxiaogao9537

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python3使用beautifulSoup - UnicodeDecodeError: 'utf-8' codec can't decode

将html文件转为纯文本，用Python3调用beautifulSoup超简单的代码一直出错，用于打开本地文件：def load_data(file_path): with open(file_path, 'r') as pf: try: soup = BeautifulSoup(pf, "html.parser") ...
复制链接

扫一扫