Python写网络爬虫：用多线程爬虫爬取澳洲有关的财经、政治、国际新闻

最新推荐文章于 2021-12-25 12:01:54 发布

暖仔会飞

最新推荐文章于 2021-12-25 12:01:54 发布

阅读量348

点赞数

分类专栏： Python写网络爬虫文章标签： python xpath 多线程

本文链接：https://blog.csdn.net/qq_42902997/article/details/107763978

版权

Python写网络爬虫专栏收录该内容

15 篇文章 9 订阅

订阅专栏

文章目录

动机：
代码：

动机：

要去澳洲留学，要时刻关注澳洲的动向，每次浏览太慢，直接爬下来方便多了。

代码：

采用了多线程，下载数据的时候快一些

import os
import requests
from lxml import etree
import threading

def request_page(url):
    response = requests.get(url,headers=headers)
    txt = response.text
    return txt


def txt_to_html(txt):
    html = etree.HTML(txt)
    return html


def html_to_hrefs(html,filter_condition=''):
    hrefs = html.xpath(filter_condition)
    return hrefs


def pages_url_construction(hrefs):
    pages_url = []
    for href in hrefs:
        href = 'https://www.xkb.com.au' + href
        pages_url.append(href)
    return pages_url


def html_to_pagetexts(html,filter_condition=''):
    text = html.xpath(filter_condition)
    return text


def html_to_page_title(html,filter_condition=''):
    titles = html.xpath(filter_condition)
    return titles


def request_every_page_data(pages_url):
    diction = {}
    for page_url in pages_url:
        txt = request_page(page_url)
        html = txt_to_html(txt)
        page_texts = html_to_pagetexts(html,filter_condition='//div[@class="article-cont-val"]//p/text()') #lst 每一项是一句话
        page_title = html_to_page_title(html,filter_condition='//div[@class="article-cont-head"]//h1/text()')[0] #str
        special_letter = '?\\*,？'
        for i in page_title:
            if i in special_letter:
                page_title = page_title.replace(i,"")
        texts = ''
        for sentence in page_texts:
            texts += sentence+'\n'
        diction[page_title] = texts
    return diction


def write_texts_to_local(diction,save_dir):
    for title in diction.keys():
        file = os.path.join(save_dir,title+'.txt')
        with open(file,'w',encoding='utf-8') as f:
            f.write(title+'\n\n')
            f.write(diction[title])


def main(url):
    txt = request_page(url)
    html = txt_to_html(txt)
    hrefs = html_to_hrefs(html, filter_condition='//div[@class="second-headline-cont2-div1-pteam"]//a/@href')
    pages_url = pages_url_construction(hrefs)
    diction = request_every_page_data(pages_url)
    write_texts_to_local(diction, save_dir='./澳洲新闻')



if __name__ == '__main__':
    web_lst = ['https://www.xkb.com.au/index.php/news/caijing',
               'https://www.xkb.com.au/index.php/news/shizheng',
               'https://www.xkb.com.au/index.php/news/guojixinwen']
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    print(web_lst[0])
    thread1 = threading.Thread(target=main,args=(web_lst[0],))
    thread2 = threading.Thread(target=main,args=(web_lst[1],))
    thread3 = threading.Thread(target=main,args=(web_lst[2],))

    thread1.start()
    thread2.start()
    thread3.start()

暖仔会飞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python写网络爬虫：用多线程爬虫爬取澳洲有关的财经、政治、国际新闻

文章目录动机：代码：动机：要去澳洲留学，要时刻关注澳洲的动向，每次浏览太慢，直接爬下来方便多了。代码：采用了多线程，下载数据的时候快一些import osimport requestsfrom lxml import etreeimport threadingdef request_page(url): response = requests.get(url,headers=headers) txt = response.text return txtd
复制链接

扫一扫