Python分布式通用爬虫（3）

最新推荐文章于 2024-05-13 16:48:34 发布

前词

最新推荐文章于 2024-05-13 16:48:34 发布

阅读量135

点赞数 1

分类专栏：笔记文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_42026580/article/details/120665214

版权

Python 分布式爬虫 XPath 内容解析数据入库

关键词由CSDN通过智能技术生成

笔记专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Python分布式通用爬虫（3）：获取所需内容

划分.py文件部分
- 通过xpath信息获取资料

如有不同意见或者其他建议，欢迎前来讨论。

划分.py文件部分

该分布式爬虫主要分为五个部分：（1）执行文件（程序的开端）、（2）获取xpath信息、（3）获取所需内容、（4）时间处理、（5）数据入库

我们想要获取某些网站的信息可以通过下载网站内容、复制文字、利用程序获取内容等等方法。现在，我们如果只想要一部分内容，而不是所有的信息，就可以通过程序去获取相应的部分。这个时候我们可以使用的编程语言就有C、C++、JAVA、Python等等。由于Python使用起来比较便捷，我们本次使用的编程语言为Python。我们使用的工具有PyCharm、Oracle11G等等。

通过xpath信息获取资料

这部分包括主函数、一级锚点函数、二级锚点函数、获取内容函数、入库预处理函数和获取页面函数。

主函数内容

该段代码主要是用来区分是否有一级锚点。如果有一级锚点，我们调用一级锚点，若没有一级锚点，我们直接调用二级锚点。主函数目的是遍历所有的已维护xpath的站点。

代码展示

下面展示一些 主函数代码片。
其中All表示的是一个数组（表格），site是一个字典，表示的是每一行数据。

def main(All):
    for site in All:  # 遍历所有站点
        print(site[2])
        if site[4] is not None:
            YJ_urls(site)
        else:
            to_EJ = None
            EJ_urls(site, to_EJ)
        print("----该站点遍历完成----")

一级锚点函数

这段代码是我们通过种子节点进入的页面中找一级锚点链接，然后再去查看是否需要拼接链接，最后把整理好的链接发送给二级锚点函数。

代码展示

下面展示一些 一级锚点函数代码片。
其中All表示的是一个数组（表格），site是一个字典，表示的是每一行数据。

def YJ_urls(dic):
    html = gethtml(dic[2])
    List_YJ = html.xpath(dic[4])
    for EJ_url in List_YJ:
        if "http" in EJ_url or "www." in EJ_url and "//www." not in EJ_url:  # 不用拼接
            judge = EJ_urls(dic, EJ_url)
        elif EJ_url == '/':
            continue
        else:
            judge = EJ_urls(dic, dic[3] + EJ_url)  # 拼接
        if judge == 0:
            continue
        else:
            continue
    return 0

二级锚点函数

这段代码是我们通过种子节点或者一级锚点发送的节点进入的页面中找二级锚点链接，然后再去查看是否需要拼接链接，并且把整理好的链接发送给获取内容函数，最后查看是否有下一页锚点，判断是否进入下一页开始遍历。

代码展示

下面展示一些 二级锚点函数代码片。
其中t表示是否从一级锚点发送过来的数据，dic是一个字典，表示的是每一行数据。

def EJ_urls(dic, t):
    try:
        if t is not None:
            html = gethtml(t)
        else:
            html = gethtml(dic[2])
        break_time = 0
        page_number = 0
        while True:
            page_number += 1
            time.sleep(random.uniform(1, 2))
            if html is None:
                print("---无法打开二级链接，已跳过---")
                return 1
            List_EJ = html.xpath(dic[5])
            for url in List_EJ:
                if "http" in url or "www." in url and "//www." not in url:  # 不用拼接
                    re_content = get_content(str(url), dic)
                elif "//www." in url and "http" not in url:
                    re_content = get_content(str(url.replace('//www.', 'www.')), dic)
                else:  # 拼接
                    if url == '/':
                        continue
                    re_content = get_content(str(dic[3]) + str(url), dic)
                if re_content == 0:
                    break_time += 1
                    if break_time >= 5:
                        return 0
                else:
                    continue
            if dic[6] == '':
                return 0
            next_page = html.xpath(dic[6])[0]
            if "http" in next_page or "www." in next_page and "//www." not in next_page:  # 不用拼接
                html = gethtml(next_page)
            elif next_page == '/':
                return 0
            else:
                html = gethtml(dic[3] + str(next_page))  # 拼接
            if page_number >= 3:
                return 0
    except Exception as e:
        print(e)
        return 0

获取内容函数

这段代码是我们通过二级锚点发送的节点进入的页面中获取我们想要的数据，然后把获取的数据编入字典中，其中时间数据会进行处理，全部改为datetime格式，并且会对时间进行限制（PQdate），最后把字典发送给入库预处理函数。

代码展示

下面展示一些 获取内容函数代码片。
其中url表示是二级锚点发送过来的链接。

PQdate = (datetime.today() - timedelta(days=2)).replace(tzinfo=None)  # 爬取时间段


def get_content(url, dic):
    try:
        items = {}
        html = gethtml(url)
        time.sleep(random.uniform(1, 2))
        if html is None:
            print("---无法打开本条链接，已跳过---")
            return 1
        publish_time = ''.join(html.xpath(dic[8])).replace("\n", "").strip()
        items["publish_time"] = time_set(publish_time, dic[17])
        if PQdate.__gt__(items["publish_time"]):
            print("---超过搜索时间---")
            return 0
        else:
            print(items["publish_time"])
        items["title"] = ''.join(html.xpath(dic[7])).replace("\n", "").strip()
        items["content"] = ''.join(html.xpath(dic[11])).replace("\n", "").strip()
        if items["title"] == '' or items["content"] == '':
            # print("---标题或内容无法获取，跳过---")
            return 1
        if dic[18] is not None:
            items["author"] = ''.join(html.xpath(dic[18])).replace("\n", "").strip()
        else:
            items["author"] = ''
        items["url"] = str(url)
        insert_db(items, dic)
        return 2
    except Exception as e:
        print(e)
        return 1

入库预处理函数

在这个函数中，我们会把获取内容函数发送过来的字典再整理一遍，加入一些我们其他的内容，最后发送给入库函数。

代码展示

下面展示一些 入库预处理函数代码片。

def insert_db(items, dic):
    items["l"] = dic[10]
    items["d"] = dic[14]
    items["y"] = dic[15]
    items["g"] = dic[16]
    items['yb'] = dic[19]
    try:
        insert_database(items)
    except Exception as into:
        print(into)
        print("入库失败！")

获取页面函数

在以上的函数中有一些在代码里调用了gethtml函数，这个就是我们用来做页面伪装的函数，由于cookie以及其他参数会随着页面改变，所以，我们能做的只有伪装agent。这里我还做了三次超时重传和utf-8解码处理。

代码展示

下面展示一些 获取页面函数代码片。

def gethtml(url):
    for i in range(3):
        try:
            time.sleep(random.uniform(1, 2))
            os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8'
            # 将多个user-agent值放入list
            agentlist = [
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
                "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome"
                "/91.0.4472.114 Safari/537.36",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome"
                "/91.0.4472.124 Safari/537.36 Edg/91.0.864.64"
            ]
            # 利用随机数随机取一个agent
            agent = random.choice(agentlist)
            headers = {"User-Agent": agent}
            html = requests.get(url=url, headers=headers, timeout=10)
            tree = etree.HTML(html.content.decode("utf-8"))
            return tree
        except requests.exceptions.RequestException:
            print(url + "超时")

前词

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python分布式通用爬虫（3）

Python分布式通用爬虫（3）：获取所需内容划分.py文件部分通过xpath信息获取资料主函数内容代码展示一级锚点函数代码展示二级锚点函数代码展示获取内容函数代码展示入库预处理函数代码展示获取页面函数代码展示划分.py文件部分该分布式爬虫主要分为五个部分：（1）执行文件（程序的开端）、（2）获取xpath信息、（3）获取所需内容、（4）时间处理、（5）数据入库我们想要获取某些网站的信息可以通过下载网站内容、复制文字、利用程序获取内容等等方法。现在，我们如果只想要一部分内容，而不是所有的信息，就可以通
复制链接

扫一扫