关于python多线程抓取链家数据

本文介绍了如何使用Python的requests库和lxml库,通过XPath解析链家二手房页面上的信息,实现单页数据提取,并利用线程池技术爬取1到100页的内容,最终将数据写入CSV文件。
摘要由CSDN通过智能技术生成

分析思路

  1. 网页每次链家翻页的时候会有pg参数跟着,第一页就是pg1第二页就是pg2
  2. 每个网页对应的数据没有进行加密处理,直接用requsets请求网页
  3. 用xpath对网页进行内容提取
  4. 如果对一个网页的数据爬取完毕那么就用线程池进行多线程爬取

代码实现

  • 对单个网站进行爬取提取li标签
re = requests.get(url)  # 请求url
html = etree.HTML(re.text)  # 解析
li = html.xpath('//*[@id="content"]/div[1]/ul/li')  # 提取li标签
  • 用xpath提取数据
    for i in li:
        div = i.xpath('./div[1]/div')
        title = str(div[0].xpath('./a/text()')[0])
        address = ''.join(map(str, div[1].xpath('./div[1]/a/text()')))
        houseInfo = str(div[2].xpath('./div/text()')[0]).replace('|', '_')
        totalPrice, unitPrice = div[5].xpath('./div//span/text()')
        get_list = [title, address, houseInfo, str(totalPrice), str(unitPrice)]
  •  写入csv文件
csv_writer.writerow(get_list)
  • 提取单个网页函数

def get_page(url):
    re = requests.get(url)  # 请求url
    html = etree.HTML(re.text)  # 解析
    li = html.xpath('//*[@id="content"]/div[1]/ul/li')  # 提取li标签
    for i in li:
        div = i.xpath('./div[1]/div')
        title = str(div[0].xpath('./a/text()')[0])
        address = ''.join(map(str, div[1].xpath('./div[1]/a/text()')))
        houseInfo = str(div[2].xpath('./div/text()')[0]).replace('|', '_')
        totalPrice, unitPrice = div[5].xpath('./div//span/text()')
        get_list = [title, address, houseInfo, str(totalPrice), str(unitPrice)]
        csv_writer.writerow(get_list)
    print(url, 'success')
  •  添加线程池和写csv文件

    with open('data.csv', mode='w', encoding='utf-8', newline='') as f:
        csv_writer = csv.writer(f)
        with ThreadPoolExecutor(50) as t:
            for item in range(1, 101):
                t.submit(get_page, url=f"https://nanchong.lianjia.com/ershoufang/pg{item}rs%E5%8D%97%E5%85%85/")
  •  总代码

import requests
from lxml import etree
import csv
# 导入线程池
from concurrent.futures import ThreadPoolExecutor


def get_page(url):
    re = requests.get(url)  # 请求url
    html = etree.HTML(re.text)  # 解析
    li = html.xpath('//*[@id="content"]/div[1]/ul/li')  # 提取li标签
    for i in li:
        div = i.xpath('./div[1]/div')
        title = str(div[0].xpath('./a/text()')[0])
        address = ''.join(map(str, div[1].xpath('./div[1]/a/text()')))
        houseInfo = str(div[2].xpath('./div/text()')[0]).replace('|', '_')
        totalPrice, unitPrice = div[5].xpath('./div//span/text()')
        get_list = [title, address, houseInfo, str(totalPrice), str(unitPrice)]
        csv_writer.writerow(get_list)
    print(url, 'success')


if __name__ == '__main__':
    with open('data.csv', mode='w', encoding='utf-8', newline='') as f:  # 写一个csv
        csv_writer = csv.writer(f)
        with ThreadPoolExecutor(50) as t:  # 创建线程池
            for item in range(1, 101):  # 爬取1到100页的数据
                t.submit(get_page,
                         url=f"https://nanchong.lianjia.com/ershoufang/pg{item}rs%E5%8D%97%E5%85%85/")  # 以南充市为例爬取二手房信息
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值