网页爬虫
3. 网页爬虫
3.1. 爬虫简介
3.1.1. 为什么要用爬虫
爬虫可以爬取很多信息
3.1.2. 了解网页结构
网页主要由许多标签构成,使用urlopen打开网页,然后使用正则表达式匹配网页内容
3.2. BeautifulSoup解析网页
beautifulsoup可以使用高级方式选取内容,替换正则表达式
3.2.1. 安装
pip install beautifulsoup4
pip install lxml # 安装解析器
基本使用:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
html = urlopen(
"https://mofanpy.com/static/scraping/basic-structure.html"
).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(soup.h1)
print(soup.p)
all_href = soup.find_all('a')
all_href = [l['href'] for l in all_href]
print('\n', all_href)
3.2.2. 使用CSS的class来选择内容
month = soup.find_all('li', {'class': 'month'})
for m in month:
print(m.get_text())
3.2.3. 配合正则进行匹配
img_links = soup.find_all('img', {'src': re.compile('.*?\.jpg')})
print(img_links)
3.3. Requests方式
urllib的一个替代模块
安装:pip install requests
基本使用:
import requests
# GET请求
r = requests.get(url, headers=headers)
r.encoding = 'utf-8'
print(r.text)
# POST请求
file = {'uploadFile': open('./test.gif', 'rb')}
r = requests.post('http://pythonscraping.com/pages/files/processing2.php', files=file)
print(r.text)
# 下载
IMAGE_URL = "https://mofanpy.com/static/img/description/learning_step_flowchart.png"
r = requests.get(IMAGE_URL, stream=True)
with open('image2.png', 'wb') as f:
for chunk in r.iter_content(chunk_size=32):
f.write(chunk)
3.4. 爬虫加速
3.5. 高级爬虫
3.5.1. Selenium
selenium可以控制浏览器,模拟人的操作
安装:
pip install selenium
# 安装浏览器驱动
# 下载geckodriver
cp geckodriver /usr/local/bin
chmod +x geckodriver
sudo spctl --master-disable
测试代码:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
# driver.get("http://www.baidu.com")
# cookie1 = {
# 'name': 'a',
# 'value': 'b'
# }
# driver.add_cookie(cookie_dict=cookie1)
# cookie2 = {
# 'name': 'aaaa',
# 'value': 'bb
# }
# driver.add_cookie(cookie_dict=cookie2)
# driver.get("http://www.baidu.com")
# driver.find_element_by_link_text(u"医院").click()
# driver.find_element_by_link_text(u"检查配送单").click()
# time.sleep(10)
# Select(driver.find_element_by_name("hospitalId")).select_by_visible_text(u"医院A")
# time.sleep(10)
# driver.find_element_by_xpath("(//option[@value='1'])[2]").click()
# time.sleep(10)
# driver.find_element_by_id("btnSearch").click()
# time.sleep(10)
# driver.find_element_by_xpath(u"(//a[contains(text(),'详细')])[2]").click()
# time.sleep(10)
driver.get('https://list.tmall.com/search_product.htm?spm=a221t.1710954.cat.12.3bac287aTblWOe&q=%E7%AF%AE%E7%90%83')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
productList = soup.find_all('div', {'class': 'product'})
# print(len(productList))
for product in productList:
productId = product['data-id']
productImg = product.find('a', {'class': 'productImg'})
img = productImg.find('img')['src']
productPrice = product.find('p', {'class': 'productPrice'})
price = productPrice.find('em')['title']
productTitle = product.find('p', {'class': 'productTitle'})
title = productTitle.find('a')['title']
print('id:', productId, ' price:', price, ' title:', title, ' img:', img)
time.sleep(100)
driver.close()