python爬虫基础（1）

最新推荐文章于 2024-10-02 10:53:34 发布

Commencals

最新推荐文章于 2024-10-02 10:53:34 发布

阅读量183

点赞数 1

分类专栏：编程学习笔记文章标签： python

本文链接：https://blog.csdn.net/Commencals/article/details/104183311

版权

编程学习笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Python web_scraping Basic Knowledge （1）

一、用 Python 打开网页

例：打开bing搜索页面文件

二、实用工具 Beautiful soup

建立虚拟环境安装 bs4

因为如果直接在电脑环境里面安装库，容易导致库都安装在一起，不同工程需要版本不同，可能导致工程混乱，所以要创建一个虚拟环境安装需要的库。
powershell的默认安全设置禁用了执行脚本，要启用这个功能需要拥有管理员的权限。

开启：set-executionpolicy remotesigned
关闭：Set-ExecutionPolicy Restricted
Tip：cd+.+. 是退一格文件夹

C:\Users\Desktop> mkdir xy_web_scraping #在桌面创建一个新文件夹
PS C:\Users\Desktop> cd .\xy_web_scraping\ #进入该文件夹
PS C:\Users\Desktop\xy_web_scraping> python -m venv scraping_vir #在文件夹中建立python虚拟环境
PS C:\Users\Desktop\xy_web_scraping\scraping_venv\Scripts> ./activate#激活虚拟环境（在这一步之前开启权限，激活后关闭权限）
(scraping_venv) PS C:\Users\Desktop\xy_web_scraping\scraping_venv\Scripts>#已进入虚拟环境
(scraping_venv) PS C:\Users\Desktop\xy_web_scraping> pip install beautifulsoup4#安装beautifulsoup4
(scraping_venv) PS C:\Users\微软\Desktop\xy_web_scraping\scraping_code> python -m idlelib#打开自带IDE
创建一个 new file

from urllib.request import urlopen
from bs4 import BeautifulSoup#（B与S大写）

html=urlopen("https://en.wikipedia.org/wiki/Second_United_Front")
bso=BeautifulSoup(html.read())#创建一个beautifulsoupobj

print(bso.h1)#在网页源文件中找到'h1'内容

运行结果会报错，是因为没有选择合适的parser,不过没有关系，它已经帮助我们选择了最合适的Html.parser。(最后一行蓝色即为输出结果)

三、HTTP Status Codes

1xx informational response – the request was received, continuing process
2xx successful – the request was successfully received, understood and accepted
3xx redirection – further action needs to be taken in order to complete the request
4xx client error – the request contains bad syntax or cannot be fulfilled
5xx server error – the server failed to fulfil an apparently valid request
wiki链接（方便查询）

如果遇到此类问题报错，可以用一下方法handle：

from urllib.error import HTTPError
try:
    html=urlopen("https://en.wikipedia.org/wiki/Second_United_Front111")
    bso=BeautifulSoup(html.read())#create an object of beautifulsoup
except HTTPError as e: #如果报出错误
	print(e) #则将错误打印出来
else:
	print(bso.h1)

四、其他报错

URL error
Solution：

from urllib.error import HTTPError
from urllib.error import URLError
try:
    html=urlopen("https://en.wikipedia.org/wiki/Second_United_Front111")
    bso=BeautifulSoup(html.read())#create an object of beautifulsoup
except HTTPError as e: #如果报出错误
	print(e) #则将错误打印出来
except URLError as e:
	print(e)
else:
	print(bso.h1)

No tag error
在这个网站中找不到具体的tag 例如上面的‘h1’，又如‘title’等。
会输出一个 None。
网站需求人机认证
1.人工处理：人工用浏览器进行人工认证之后再进行爬虫。
2.绕行处理：绕开人机验证的方法

五、写一个函数进行上述操作

def get_title(url):
	try:
		html=urlopen(url)
	except HTTPError as e: #如果报出错误
		return None
	try:
		bso=BeautifulSoup(html.read())
		title=bso.title
	except AttributeError as e: #如果报出错误
		return None
	return title

url="https://sh.lianjia.com"
title=get_title(url)
if title == None:
	print("There is no title.")
else
	print(title)

六、采用 html解析解析维基百科

在之前的代码中，没有确定一个parser，所以在每次执行时会报错，虽然可以执行，但是很不好看。
在construct BeautifulSoup obj时，可以加入特定parser方法：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("https://www.wikipidia.org")
bso=BeautifulSoup(html,"html.parser")
print(bso)

七、爬下维基百科可选语言

我们在wiki主页的page sourcefile里看到了可选语言的左右都有 < div >XXX< /div> 所以：

from orllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("https://www.wikipedia.org/")
bso=BeautifulSoup(html,"html.parser")

a_list=bso.findAll("div")#使用了findAll方法

for item in a_list:
    print(item.get_text())#使用了get_text方法

八、

如果按照上面的操作进行的话，最后并不能够得到我们想要的十种可选语言，还有别的项被抓取，于是我们打开页面源文件，观察发现 div 后有 class 所以我们可以在find操作时加入限定条件：

from orllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("https://www.wikipedia.org/")
bso=BeautifulSoup(html,"html.parser")

a_list=bso.findAll("div",{"class":"central-featured-lang")#使用了findAll方法，以及限定条件

for item in a_list:
    print(item.get_text())