Python爬虫实战（二）

Echo_HK

于 2017-02-13 20:49:50 发布

阅读量299

点赞数

分类专栏： python爬虫实战总结文章标签： python 爬虫

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/hk490871360/article/details/55053908

版权

python爬虫实战总结专栏收录该内容

2 篇文章 0 订阅

订阅专栏

本地网页内容爬取内容总结

实验介绍：

本实验通过使用BeautifulSoup方法对网页进行简单的爬取工作,并对BeatifulSoup方法进行简单的介绍。 —— 【BeautifulSoup开发手册】

示例网页如下：

这里写图片描述

实验内容：

从本地网页爬取商品信息，商品名，价格，评分等级等相关信息

实验代码：


from bs4 import BeautifulSoup

path = './index.html'

with open(path, 'r') as f:
    soup = BeautifulSoup(f.read(), 'lxml')
    titles = soup.select("body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a")
    images = soup.select("body > div > div > div.col-md-9 > div > div > div > img")
    reviews = soup.select("body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right")
    prices = soup.select("body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right")
    stars = soup.select("div > div.ratings > p:nth-of-type(2)")

    print(len(titles), len(images), len(reviews), len(prices), len(stars))
    for title, image, review, price, star in zip(titles, images, reviews, prices, stars):
        title_content = title.get_text()
        review_content = review.get_text()
        price_content = price.get_text()
        image_content = image.get("src")
        stars_count = len(star.find_all("span", "glyphicon glyphicon-star"))

        data = {
            "title": title_content,
            "review": review_content,
            "image": image_content,
            "price": price_content,
            "star": stars_count
        }

        print(data)

实验总结

使用BeautifulSoup爬取网页内容的主要步骤

使用with open（）函数打开文件
新建soup类，使用select函数对节点内容进行检索
对检索出的信息进行筛选并用zip函数进行存储

python中的zip函数简介：

zip函数接受任意多个（包括0个和1个）序列作为参数，返回一个tuple列表。具体意思不好用文字来表述，直接看示例：

1.示例1：

复制代码

x = [1, 2, 3]

y = [4, 5, 6]

z = [7, 8, 9]

xyz = zip(x, y, z)

print xyz
复制代码
运行的结果是：

[(1, 4, 7), (2, 5, 8), (3, 6, 9)]

从这个结果可以看出zip函数的基本运作方式。

2.示例2：

x = [1, 2, 3]
y = [4, 5, 6, 7]
xy = zip(x, y)
print xy
运行的结果是：

[(1, 4), (2, 5), (3, 6)]

从这个结果可以看出zip函数的长度处理方式。

3.示例3：

x = [1, 2, 3]
x = zip(x)
print x
运行的结果是：

[(1,), (2,), (3,)]

从这个结果可以看出zip函数在只有一个参数时运作的方式。

4.示例4：

x = zip()
print x
运行的结果是：

[]

从这个结果可以看出zip函数在没有参数时运作的方式。

5.示例5：

复制代码
x = [1, 2, 3]

y = [4, 5, 6]

z = [7, 8, 9]

xyz = zip(x, y, z)

u = zip(*xyz)

print u
复制代码
运行的结果是：

[(1, 2, 3), (4, 5, 6), (7, 8, 9)]

一般认为这是一个unzip的过程，它的运行机制是这样的：

在运行zip(*xyz)之前，xyz的值是：[(1, 4, 7), (2, 5, 8), (3, 6, 9)]

那么，zip(*xyz) 等价于 zip((1, 4, 7), (2, 5, 8), (3, 6, 9))

所以，运行结果是：[(1, 2, 3), (4, 5, 6), (7, 8, 9)]

注：在函数调用中使用*list/tuple的方式表示将list/tuple分开，作为位置参数传递给对应函数（前提是对应函数支持不定个数的位置参数）

6.示例6：

x = [1, 2, 3]
r = zip(* [x] * 3)
print r
运行的结果是：

[(1, 1, 1), (2, 2, 2), (3, 3, 3)]

它的运行机制是这样的：

[x]生成一个列表的列表，它只有一个元素x

[x] * 3生成一个列表的列表，它有3个元素，[x, x, x]

zip(* [x] * 3)的意思就明确了，zip(x, x, x)

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫实战（二）

本地网页内容爬取内容总结实验介绍：本实验通过使用BeautifulSoup方法对网页进行简单的爬取工作,并对BeatifulSoup方法进行简单的介绍。 —— 【BeautifulSoup开发手册】示例网页如下：实验内容：从本地网页爬取商品信息，商品名，价格，评分等级等相关信息实验代码：from bs4 import BeautifulSouppath = './index.html
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。