1、使用BeautifulSoup解析网页
Soup =BeautifulSoup(html, "lxml")爬取整个HTML网页
定位有两种方式:
(1)CSS selector:如 div.centering_wrapper > img 或 div.item.name > a
(2)XPath:
比如要爬取图片和标题,右击--> 检查--> 选中标题右击 -->检查 -->查看父级div的class name 锁定唯一性特征
2、描述要爬取的东西在哪里?
soup = soup.select() 括号里填定位地址
3、从标签中获得你要的信息,保存到字典中
完整代码:
from bs4 import BeautifulSoup
import requests
url = "https://www.tripadvisor.cn/Attractions-g294217-Activities-Hong_Kong.html"
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
title = soup.select(' div.item.name > a')
img = soup.select(' div.centering_wrapper > img')
name = soup.select(' div.detail > div.item')
#cate = soup.select('div.rating-widget > a')
#print(name)
for title ,img in zip (title , img ):
data = {
'title' : title.get_text(),
'img' : img.get('src'),
#'cate' : cate.get_text(),
}
print(data)