爬虫日记之01编辑系统环境变量

鸭脖没了

已于 2022-03-05 11:16:21 修改

阅读量238

点赞数

分类专栏：爬虫日记文章标签： python pip anaconda

于 2021-01-19 23:50:53 首次发布

本文链接：https://blog.csdn.net/zty5556666/article/details/112855139

版权

爬虫日记专栏收录该内容

7 篇文章 0 订阅

订阅专栏

爬虫日记

2021-1-19

一、配置pip和python的环境

报错一：

 from bs4 import BeautifulSoup

with open('D:/Coding/pycharm/jike/2021-1-18/html1/Untitled-1.html','r') as wb_data:
    Soup = BeautifulSoup(wb_data,'xlml')
    print(Soup)

其实这里还有一个错误，就是‘xlml’我也是错的，应该是‘lxml’才对，这里下一步也是有问题的。

报错：
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xab in position 83: illegal multibyte sequence

报错二：

from bs4 import BeautifulSoup

with open('D:/Coding/pycharm/jike/2021-1-18/html1/Untitled-1.html','r') as wb_data:
    Soup = BeautifulSoup(wb_data,'xlml')
    print(Soup)

Couldn’t find a tree builder with the features you requested: xlml. Do you need to install a parser library?

这时我根据网络的教程想安装lxml但是提示我无法读取pip和python，所以我继续查到了相应的解决方法。

对照CSDN中的方法在“编辑系统环境变量”里给path新增了pip.exe和python.exe的路径就OK啦。

因为我用的是Anaconda里的python3.7，所以虽然理论上3.4以上的版本就自带了pip.exe,但我还是有点慌，不过用查找功能其实也可以找到。

二、给pip安装lxml

C:\Users\xxx>pip install lxml
Requirement already satisfied: lxml in d:\anacondanew\lib\site-packages (4.2.5)
You are using pip version 10.0.1, however version 20.3.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

这里根据提示给pip升了级

C:\Users\xxx>python -m pip install --upgrade pip
  Collecting pip
    Using cached https://files.pythonhosted.org/packages/54/eb/4a3642e971f404d69d4f6fa3885559d67562801b99d7592487f1ecc4e017/pip-20.3.3-py2.py3-none-any.whl
  Installing collected packages: pip
    Found existing installation: pip 10.0.1
      Uninstalling pip-10.0.1:
        Successfully uninstalled pip-10.0.1
  Successfully installed pip-20.3.3

C:\Users\仲天韵>pip install lxml
Requirement already satisfied: lxml in d:\anacondanew\lib\site-packages (4.2.5)

三、实现网站信息的爬取。（无分类）

配置环境变量之后确实可以在cmd中输入pip和python，看到他们的版本、路径等信息，但是还是报错，没有办法实现信息的爬取，格式化打印出网页内容。

所以我只能按照他人博客中的做法，进行修改，改的地方为两处：

一个是在打开html文件的时候进行编码方式的控制，encoding=‘utf-8’,

另一个是将‘lxml’改成了‘html.parser’

from bs4 import BeautifulSoup

with open('D:/Coding/pycharm/jike/2021-1-18/html1/Untitled-1.html','r',encoding='utf-8') as wb_data:
    Soup = BeautifulSoup(wb_data,'html.parser')
    print(Soup)