一个简单的网页python爬虫 BeautifulSoup

最新推荐文章于 2021-03-18 18:25:23 发布

luguanyou

最新推荐文章于 2021-03-18 18:25:23 发布

阅读量250

点赞数

分类专栏： Python爬虫文章标签： Python网页爬虫 BeautifulSoup selector 定位

本文链接：https://blog.csdn.net/luguanyou/article/details/80756543

版权

Python爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1、使用BeautifulSoup解析网页

Soup =BeautifulSoup(html, "lxml")爬取整个HTML网页

定位有两种方式：

（1）CSS selector：如 div.centering_wrapper > img 或 div.item.name > a

（2）XPath：

比如要爬取图片和标题，右击--> 检查--> 选中标题右击 -->检查 -->查看父级div的class name 锁定唯一性特征

2、描述要爬取的东西在哪里？

soup = soup.select() 括号里填定位地址

3、从标签中获得你要的信息，保存到字典中

完整代码:

from bs4 import BeautifulSoup
import requests
url = "https://www.tripadvisor.cn/Attractions-g294217-Activities-Hong_Kong.html"
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')

title = soup.select(' div.item.name > a')
img = soup.select(' div.centering_wrapper > img')
name = soup.select(' div.detail > div.item')
#cate = soup.select('div.rating-widget > a')
#print(name)
for title ,img in zip (title , img ):
    data = {
        'title' : title.get_text(),
        'img' : img.get('src'),
        #'cate' : cate.get_text(),
    }
print(data)

luguanyou

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
一个简单的网页python爬虫 BeautifulSoup

1、使用BeautifulSoup解析网页Soup =BeautifulSoup(html, "lxml")爬取整个HTML网页定位有两种方式：（1）CSS selector：如 div.centering_wrapper &gt; img 或 div.item.name &gt; a（2）XPath：比如要爬取图片和标题，右击--&gt; 检查--&gt; 选中标题右击 --&gt;检查 ...
复制链接

扫一扫