Python爬虫中遇到的问题:
Python代码(本身没有错,是HTML文档有问题)
from lxml import etree # 获取本地HTML文档 html = etree.parse(r"测试.html") result = etree.tostring(html, encoding='utf-8').decode('utf-8') print(result)
HTML文档(出错信息)
<!DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="UTF-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>etree的parse方法</title> </head> <body> <ul> <li>1</li> <li>2</li> <li>3</li> <li>4</li> <li>5</li> <li>6</li> <li>7</li> <li>8</li> <li>9</li> <li>10</li> </ul> </body> </html>
解决方案:
重点修改该部分
修改后:
<meta charset="UTF-8" /> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" />
在每一个标签末尾加上“/”号
成功执行:
完