Python爬虫最强组合httpx+parsel测评

最新推荐文章于 2024-05-09 09:40:15 发布

不想敲代码的小码农

最新推荐文章于 2024-05-09 09:40:15 发布

阅读量849

点赞数 1

文章标签： python list django virtualenv flask

本文链接：https://blog.csdn.net/mid56579/article/details/119211045

版权

Python网络爬虫领域两个最新的比较火的工具莫过于httpx和parsel了。

httpx号称下一代的新一代的网络请求库，不仅支持requests库的所有操作，还能发送异步请求，为编写异步爬虫提供了便利。

parsel最初集成在著名Python爬虫框架Scrapy中，后独立出来成立一个单独的模块，支持XPath选择器, CSS选择器和正则表达式等多种解析提取方式, 据说相比于BeautifulSoup，parsel的解析效率更高。

今天我们就以爬取链家网上的二手房在售房产信息为例，来测评下httpx和parsel这两个库。为了节约时间，我们以爬取上海市浦东新区500万元-800万元以上的房产为例。

requests + BeautifulSoup组合

首先上场的是Requests + BeautifulSoup组合，这也是大多数人刚学习Python爬虫时使用的组合。本例中爬虫的入口url是https://sh.lianjia.com/ershoufang/pudong/a3p5/, 先发送请求获取最大页数，然后循环发送请求解析单个页面提取我们所要的信息（比如小区名，楼层，朝向，总价，单价等信息)，最后导出csv文件。如果你正在阅读本文，相信你对Python爬虫已经有了一定了解，所以我们不会详细解释每一行代码。

整个项目代码如下所示：

# homelink_requests.py# Author: 大江狗 from fake_useragent import UserAgent import requests from bs4 import BeautifulSoup import csv import re import time

 class HomeLinkSpider(object):     def __init__(self):         self.ua = UserAgent()         self.headers = {"User-Agent": self.ua.random}         self.data = list()         self.path = "浦东_三房_500_800万.csv"         self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"
     def get_max_page(self):         response = requests.get(self.url, headers=self.headers)         if response.status_code == 200:             soup = BeautifulSoup(response.text, 'html.parser')             a = soup.select('div[class="page-box house-lst-page-box"]')             #使用eval是字符串转化为字典格式             max_page = eval(a[0].attrs["page-data"])["totalPage"]              return max_page         else:             print("请求失败 status:{}".format(response.status_code))             return None
     def parse_page(self):         max_page = self.get_max_page()         for i in range(1, max_page + 1):             url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)             response = requests.get(url, headers=self.headers)             soup = BeautifulSoup(response.text, 'html.parser')             ul = soup.find_all("ul", class_="sellListContent")             li_list = ul[0].select("li")             for li in li_list:                 detail = dict()                 detail['title'] = li.select('div[class="title"]')[0].get_text()
                 #  2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼                 house_info = li.select('div[class="houseInfo"]')[0].get_text()                 house_info_list = house_info.split(" | ")
                 detail['bedroom'] = house_info_list[0]                 detail['area'] = house_info_list[1]                 detail['direction'] = house_info_list[2]
                 floor_pattern = re.compile(r'\d{1,2}')                 # 从字符串任意位置匹配                 match1 = re.search(floor_pattern, house_info_list[4])                   if match1:                     detail['floor'] = match1.group()                 else:                     detail['floor'] = "未知"
                 # 匹配年份                 year_pattern = re.compile(r'\d{4}')                 match2 = re.search(year_pattern, house_info_list[5])                 if match2:                     detail['year'] = match2.group()                 else:                     detail['year'] = "未知"
                 # 文兰小区 - 塘桥， 提取小区名和哈快                 position_info = li.select('div[class="positionInfo"]')[0].get_text().split(' - ')                 detail['house'] = position_info[0]                 detail['location'] = position_info[1]
                 # 650万，匹配650                 price_pattern = re.compile(r'\d+')                 total_price = li.select('div[class="totalPrice"]')[0].get_text()                 detail['total_price'] = re.search(price_pattern, total_price).group()
                 # 单价64182元/平米， 匹配64182                 unit_price = li.select('div[class="unitPrice"]')[0].get_text()                 detail['unit_price'] = re.search(price_pattern, unit_price).group()                 self.data.append(detail)
     def write_csv_file(self):         head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份",         "位置", "总价(万)", "单价(元/平方米)"]         keys = ["title", "house", "bedroom", "area", "direction",         "floor", "year", "location",                 "total_price", "unit_price"]
         try:             with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:                 writer = csv.writer(csv_file, dialect='excel')                 if head is not None:

最低0.47元/天解锁文章

不想敲代码的小码农

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫最强组合httpx+parsel测评

Python网络爬虫领域两个最新的比较火的工具莫过于httpx和parsel了。httpx号称下一代的新一代的网络请求库，不仅支持requests库的所有操作，还能发送异步请求，为编写异步爬虫提供了便利。parsel最初集成在著名Python爬虫框架Scrapy中，后独立出来成立一个单独的模块，支持XPath选择器, CSS选择器和正则表达式等多种解析提取方式, 据说相比于BeautifulSoup，parsel的解析效率更高。今天我们就以爬取链家网上的二手房在售房产信息为例，来测评下httpx
复制链接

扫一扫