Matplotlib Python 画图教程 (莫烦Python)
《零基础入门学习Python》(小甲鱼) P54-64
HTML
from urllib.request import urlopen
html = urlopen(URL).read().decode('utf-8') # 中文需decode()
print(html)
读取网页,然后用正则表达式选取内容。
BeautifulSoup
sudo pip3 install beautifulsoup4
sudo pip3 install lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features='lxml')
print(soup.h1)
all_href = soup.find_all('a')
all_href = [l['href'] for l in all_href]
print('\n', all_href)
BeautifulSoup CSS
month = soup.find_all('li', {"class": "month"})
for m in month:
print(m.get_text())
BeautifulSoup 正则
img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
Requests
sudo pip3 install requests
- get
- post
下载
from urllib.request import urlretrieve
import requests
下载大文件
爬虫加速
- 多进程分布式爬虫
- 异步加载 Asyncio
高级爬虫
- Selenium
- Scrapy