爬当当“python图书”

最新推荐文章于 2024-03-25 13:00:06 发布

ytraister

最新推荐文章于 2024-03-25 13:00:06 发布

阅读量252

点赞数 1

分类专栏：爬虫文章标签： xpath 正则表达式 python

本文链接：https://blog.csdn.net/ytraister/article/details/106002226

版权

爬虫专栏收录该内容

30 篇文章 4 订阅

订阅专栏

其实用requests模块爬取数据再熟悉不过了，但是这次分享的是xpath中的一些获取方法，实操一下熟悉用法。

直接上代码👇：

import requests
from lxml import etree
import re
import pymongo

url = "http://search.dangdang.com/?key=python&category_path=01.00.00.00.00.00&page_index=1"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3610.2 Safari/537.36"
}
# client = pymongo.MongoClient("mongodb://localhost:27017")
# client.admin.authenticate("admin", "admin")
# db = client["db_dangdang"]
# coll = db["book"]

response = requests.get(url=url, headers=headers)
response.encoding = "gbk"

html = etree.HTML(response.text)
all_info = html.xpath("//ul[@class='bigimg']//li")

num_re = re.compile(r"(\d+)")
date_re = re.compile(r"\s/(.*)")

for item in all_info:
    info = {}
    info["title"] = item.xpath("./p[@class='name']/a/@title")[0]
    info["price"] = item.xpath("./p[@class='price']")[0].xpath("string(.)")
    info["author"] = item.xpath("./p[@class='search_book_author']/span[1]")[0].xpath("string(.)")

    info["date"] = date_re.findall(item.xpath("./p[@class='search_book_author']/span[2]/text()")[0])[0]
    # info["date"] = item.xpath("./p[@class='search_book_author']/span[2]/text()")[0].strip(" /")

    info["publisher"] = item.xpath("./p[@class='search_book_author']/span[3]/a/text()")[0]

    info["star"] = int(num_re.findall(item.xpath("./p[@class='search_star_line']/span/span/@style")[0])[0])/20
    # info["star"] = int((item.xpath("./p[@class='search_star_line']/span/span/@style")[0].strip("width: %;"))) / 20
    
    try:
        info["detail"] = item.xpath("./p[@class='detail']/text()")[0]
    except:
        info["detail"] = ""
    print(info)
    # coll.insert_one(info)

解析：
1️⃣：如上代码中第28行，获取author数据，正常使用xpath语句获取的话会以分段的方式获取结果，如下图所示：
在这里插入图片描述
而使用.xpath("string(.)")方法，可以将这些分段的字符串内容进行拼接，形成一个完整的整体。

2️⃣：如上代码第30行，获取date数据，结果为：“ /2016-07-01”，因此编写正则表达式，只获取日期部分。（注：编写正则时最好是将需要更改的字符串放在sublime中用ctrl+f来编写正则表达式，以确保无误）
【另外可以不用正则来去除时间前面的 “/” ，可以用strip()方法，里面传入参数 " /"。即：strip(" /")。如上代码第31行】

3️⃣：如上代码第35行，获取star数据，结果为：with:95%，同样的需要编写正则表达式来获取其中的数字。
【另外可以不用正则来去除时间前面的 “with:” 和后面的 "%;" ，可以用strip()方法，里面传入参数 "width: %;"。即：strip("width: %;")。如上代码第36行】

【注： strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列】
另外，使用 scrapy爬取当当图书，可参考该文章：https://www.cnblogs.com/CYHISTW/p/12377124.html

ytraister

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬当当“python图书”

其实用requests模块爬取数据再熟悉不过了，但是这次分享的是xpath中的一些获取方法，实操一下熟悉用法。直接上代码????：import requestsfrom lxml import etreeimport reimport pymongourl = "http://search.dangdang.com/?key=python&category_path=01.00.00.00.00.00&page_index=1"headers = { "User-Ag
复制链接

扫一扫

专栏目录