Python爬虫：爬取城市景点信息

最新推荐文章于 2024-04-30 20:45:04 发布

影雀

最新推荐文章于 2024-04-30 20:45:04 发布

阅读量3.3k

点赞数 3

分类专栏： Python爬虫开发爬虫开发文章标签：爬虫景点 Python

如转载请指明出处！

本文链接：https://blog.csdn.net/qq_42952437/article/details/98785520

版权

爬虫开发同时被 2 个专栏收录

35 篇文章 1 订阅

订阅专栏

Python爬虫开发

31 篇文章 2 订阅

订阅专栏

爬取详情页面的所有信息

景点名称，地址、简介、类型、时间、门票等

直接就上代码；

import requests
from lxml import etree
from multiprocessing.pool import Pool
headers = {
    'Referer': 'https://yancheng.cncn.com/jingdian/dazonghu/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3722.400 QQBrowser/10.5.3738.400'
}
def save(content):#存储文件
    with open('盐城景区.doc', 'a')as f:
        f.write(content+'\n')

def get_detail(href):#获取详情页
    response = requests.get(href, headers=headers)
    return response.text

def paser_pages(resp):#解析详情页
    infos = []
    info = etree.HTML(resp)
    title = info.xpath('//h1/text()')[0]#获取标题
    infos.append(title)
    #print(title)
    dls = info.xpath('//div[@class="type"]//dl')#获取详情页信息
    for dl in dls:
        detail = dl.xpath('.//text()')
        detail = str(''.join(detail)).replace('\xa0', '').strip()
        infos.append(detail)
        #print(detail)
    save('\n'.join(infos))

def get_pages(url):#首页
    response = requests.get(url, headers=headers)
    # print(response.text)
    selector = etree.HTML(response.text)
    items = selector.xpath('//div[@class="city_spots_list"]/ul//li')
    for item in items:
        #获取详情页url
        href = item.xpath('./a/@href')[0]
        #print(href)
        res = get_detail(href)
        paser_pages(res)

if __name__ == '__main__':
    #多线程爬取
    page_href = ['https://yancheng.cncn.com/jingdian/1-{}-0-0.html'.format(str(i)) for i in range(1, 6)]
    pool = Pool()
    result = pool.map(get_pages, page_href)
    pool.close()
    pool.join()

结果：

影雀

关注

3
点赞
踩
24

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫：爬取城市景点信息

爬取详情页面的所有信息景点名称，地址、简介、类型、时间、门票等直接就上代码；import requestsfrom lxml import etreefrom multiprocessing.pool import Poolheaders = { 'Referer': 'https://yancheng.cncn.com/jingdian/dazonghu/', ...
复制链接

扫一扫