爬虫日记
2021-1-19
一、配置pip和python的环境
报错一:
from bs4 import BeautifulSoup
with open('D:/Coding/pycharm/jike/2021-1-18/html1/Untitled-1.html','r') as wb_data:
Soup = BeautifulSoup(wb_data,'xlml')
print(Soup)
其实这里还有一个错误,就是‘xlml’我也是错的,应该是‘lxml’才对,这里下一步也是有问题的。
报错:
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xab in position 83: illegal multibyte sequence
报错二:
from bs4 import BeautifulSoup
with open('D:/Coding/pycharm/jike/2021-1-18/html1/Untitled-1.html','r') as wb_data:
Soup = BeautifulSoup(wb_data,'xlml')
print(Soup)
Couldn’t find a tree builder with the features you requested: xlml. Do you need to install a parser library?
这时我根据网络的教程想安装lxml但是提示我无法读取pip和python,所以我继续查到了相应的解决方法。
对照CSDN中的方法在“编辑系统环境变量”里给path新增了pip.exe和python.exe的路径就OK啦。
因为我用的是Anaconda里的python3.7,所以虽然理论上3.4以上的版本就自带了pip.exe,但我还是有点慌,不过用查找功能其实也可以找到。
二、给pip安装lxml
C:\Users\xxx>pip install lxml
Requirement already satisfied: lxml in d:\anacondanew\lib\site-packages (4.2.5)
You are using pip version 10.0.1, however version 20.3.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
这里根据提示给pip升了级
C:\Users\xxx>python -m pip install --upgrade pip
Collecting pip
Using cached https://files.pythonhosted.org/packages/54/eb/4a3642e971f404d69d4f6fa3885559d67562801b99d7592487f1ecc4e017/pip-20.3.3-py2.py3-none-any.whl
Installing collected packages: pip
Found existing installation: pip 10.0.1
Uninstalling pip-10.0.1:
Successfully uninstalled pip-10.0.1
Successfully installed pip-20.3.3
C:\Users\仲天韵>pip install lxml
Requirement already satisfied: lxml in d:\anacondanew\lib\site-packages (4.2.5)
三、实现网站信息的爬取。(无分类)
配置环境变量之后确实可以在cmd中输入pip和python,看到他们的版本、路径等信息,但是还是报错,没有办法实现信息的爬取,格式化打印出网页内容。
所以我只能按照他人博客中的做法,进行修改,改的地方为两处:
一个是在打开html文件的时候进行编码方式的控制,encoding=‘utf-8’,
另一个是将‘lxml’改成了‘html.parser’
from bs4 import BeautifulSoup
with open('D:/Coding/pycharm/jike/2021-1-18/html1/Untitled-1.html','r',encoding='utf-8') as wb_data:
Soup = BeautifulSoup(wb_data,'html.parser')
print(Soup)