python线程池爬取北京新发地价格万条数据

本文链接：https://blog.csdn.net/m0_62812713/article/details/135737247

最近重新开始学习爬虫，今天敲代码所遇到的有意思的地方，分享记录一下。

爬取网站：新发地-价格行情http://www.xinfadi.com.cn/priceDetail.html

可见页面上万，数据三十多万条。我便采用线程池爬取。

线程池：线程池是一种并发编程的技术，用于管理和重用多个线程，以便更有效地执行并行任务。Python标准库中提供了concurrent.futures模块，其中包含ThreadPoolExecutor类，用于创建和管理线程池。线程池可以帮助提高多线程程序的性能，尤其是在需要执行多个独立任务的情况下。线程池是为了在I/O密集型任务中获得性能提升而设计的，对于CPU密集型任务，由于GIL（全局解释器锁）的存在，多线程可能不会带来明显的性能提升。在CPU密集型任务中，可能需要考虑使用多进程来充分利用多核处理器。

开始爬取：

查看源代码没有所爬取列表的数据，然后开发者模式，发现是二次js传来的

查看请求url：http://www.xinfadi.com.cn/getPriceData.html

请求方式为POST

防盗链Referer：http://www.xinfadi.com.cn/priceDetail.html

先爬200页数据编写代码为下：

import requests
import csv
from concurrent.futures import ThreadPoolExecutor

f = open("12_菜市场价格.csv", mode='w', encoding="utf-8")
csvwriter = csv.writer(f)
def download_one_page(current):
    url = 'http://www.xinfadi.com.cn/getPriceData.html'
    req_url = 'http://www.xinfadi.com.cn/getPriceData.html'
    data = {
        "limit": "20",
        "current": current,
        "pubDateStartTime": '',
        "pubDateEndTime": '',
        "prodPcatid": '',
        "prodCatid": '',
        "prodName": '',
    }
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 ",
        "Referer": url,
    }
    resp = requests.post(req_url, headers=headers, data=data)

    alist = resp.json()["list"]

    for i in alist:
        lis = [i['prodCat'], i['prodPcat'], i['prodName'], i['lowPrice'], i['avgPrice'], i['highPrice'],
               i['specInfo'], i['place'], i['unitInfo'], i['pubDate']]

        lis[-1] = lis[-1].replace('00:00:00', '')
        # 存储数据
        csvwriter.writerow(lis)

    print(current, "爬取完毕")
    # resp.close()

if __name__ == '__main__':
    # 创建线程池
    with ThreadPoolExecutor(10) as t:
        for i in range(1, 201):
            t.submit(download_one_page, f'{i}')

    f.close()
    print("全部下载完毕")

但是运行运行结果为