BeautifulSoup简介
Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后你仅仅需要说明一下原始编码方式就可以了。
Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的的速度。
BeautifulSoup findall()
find_all()方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件:find_all(name,attrs,recursive,text,**kwargs)
name参数可以查找所有名字为name的tag,字符串对象会被自动忽略掉,它不仅可以传字符串,还可以传入 列表/正则表达式/方法/布尔值/keyword 参数等作为参数去搜索标签
例:
- 传入字符串:
soup.find_all(["a","b"])
- 传入正则表达式:
soup.find_all(re.comple("^b"))
- 传入布尔值:
soup.find_all(True)
- 传入方法:校验当前元素,如果包含了class属性却不包含id属性,那么将返回True
def hac_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id)
soup.find_all(has_class_but_no_id)
- 指定关键词:
soup.find_all(id='link2')
soup.find_all(href=re.compile("elsie") # 查找链接地址中带有elsie的标签
soup.find_all("a", class_="sister") # class_当作关键词
BeautifulSoup 对象
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是python对象,所有对象可以归纳为4中:
- Tag:HTML中的标签
- NavigableString:标签内部非属性文字
- BeautifulSoup:对象标识的是一个文档的全部内容
- Comment:标签注释文字
对于Tag,他有两个重要属性,是 name 和 attrs :
print soup.name
| print soup.p.attrs
| print soup.head.name
等等会输出所有属性;
像单独获取某个属性可以使用 get,或者通过选取的方式:
print soup.title.get('class')
| print soup.title['class']
代码展示
免费代理ip网址:https://www.kuaidaili.com/free/inha/1/
代理ip存活检测:http://httpbin.org/get 或 http://icanhazip.com/
import requests
from bs4 import BeautifulSoup
import re
import signal
import sys
import os
import random
list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"
]
def handler(signal_num, frame): # 用于处理信号
Goduplicate()
print("\nDone,the available ip have been put in 'proxy_ips.txt'...")
print("\nSuccessed to exit.")
sys.exit(signal_num)
def proxy_spider():
headers = {"User-Agent": random.choice(list)} # 随机User-Agent
for i in range(20): # 爬取前20页
url = 'https://www.kuaidaili.com/free/inha/' + str(i + 1) + '/'
r = requests.get(url=url, headers=headers)
html = r.text
# print(r.status_code)
soup = BeautifulSoup(html, "html.parser")
datas = soup.find_all(name='tr')
for data in datas: # 根据页面特征来匹配内容
soup_proxy = BeautifulSoup(str(data), "html.parser")
proxy_contents = soup_proxy.find_all(name='td')
try:
ip_org = str(proxy_contents[0].string)
port = str(proxy_contents[1].string)
protocol = str(proxy_contents[3].string)
ip = protocol.lower() + '://' + ip_org
proxy_check(ip, port, protocol)
# print(ip)
except:
pass
def proxy_check(ip, port, protocol): # 代理存活检查
proxy = {}
proxy[protocol.lower()] = '%s:%s' % (ip, port)
# print(proxy)
headers = {"User-Agent": random.choice(list),
"Connection": "keep-alive"}
try:
r = requests.get(url='http://httpbin.org/get', headers=headers, proxies=proxy, timeout=5)
ip_available = re.findall(r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}", r.text)[0] # 匹配ip
ip_availables = protocol.lower() + '://' + ip_available
# print(ip_availables)
# print(ip)
if ip_availables == ip:
print(str(proxy) + 'is ok')
with open("proxy_ip.txt", "a", encoding="utf-8") as ip:
ip.write(ip_available + ':' + port + '\n')
# else:
# print('no')
except Exception as e:
# print e
pass
def Goduplicate():
with open("proxy_ip.txt", encoding="utf-8") as urls:
url = urls.readlines()
new_url = []
for id in url:
if id not in new_url:
new_url.append(id)
for i in range(len(new_url)):
with open("proxy_ips.txt", "a") as edu:
edu.write(new_url[i])
os.remove("proxy_ip.txt")
if __name__ == '__main__':
signal.signal(signal.SIGINT, handler)
proxy_spider()
免费代理还是不靠谱,这里爬取了20页也才捕捉到6个可用ip:
代码还需要进一步优化,虽然爬取了20页但其中有很多页由于访问速度过快而被禁止,还需学习如何修改为分布式爬虫。