爬虫对解析文本的一些使用

最新推荐文章于 2021-11-22 11:08:34 发布

xzhanxiang

最新推荐文章于 2021-11-22 11:08:34 发布

阅读量315

点赞数

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/xzhanxiang/article/details/108003353

版权

爬虫专栏收录该内容

8 篇文章 0 订阅

订阅专栏

爬虫对解析文本的一些使用## 标题
当你在爬取网页文本的时候，可能碰到的格式比较多，这时我们就会多写一些代码，这时我们就可以用到这个方法，比较节省代码

    def gen_item_with_reg(cls, string_content, reg_dict, item,
                          need_reverse=None,
                          need_2list=None,
                          single_line_keys=None,
                          not_replace=None,
                          need_detail=None):
        """
        根据传入的正则表达式 字典 对string_content进行匹配，最后返回item。
        字典的值（正则表达式）可以是字符串或字符串组成的列表，正则表达式列表优先级为从左到右，有结果立刻返回。

        :param string_content: str对象，**需要匹配的文本**
        :param reg_dict: dict 对象，由字段名和 ----->**正则表达式列表组成的字典**
        :param item: ScrapyItem 对象
        :param need_reverse: 需要倒序取的字段列表。将会对列表中的字段取匹配结果的最后一个，类型为str。适用于一个详情页只生成一个 item 的情况
        :param need_2list: 需要返回结果列表的字段列表。将会返回第一个不为空的结果列表，类型为list，此时need_reverse参数将失效。适用于 http://www.ygcgfw.com/gggs/001001/001001003/20200601/4ab57080-2407-44b6-9207-ff8e8872f0a7.html
        :param single_line_keys: 需要匹配的多行的字段列表。
        :param not_replace: 不需要替换的字段列表，如果字段有初始值则不会替换。适用于传入已经过解析的 item。
        :param need_detail: 需要返回被匹配到的pattern和结果的字段列表，返回类型为元组 (pattern, 结果)，元组中结果的类型仍会受 need_2list 影响。适用于 debug 或者分辨中标与候选。
        :return: str:    ''                       （默认）
                list:   ['']                     （取决于参数 tolist）
                tuple:  ('', '') 或 ('', [''])   （取决于参数 tolist 和 detail ）
        """
'''
# 需要倒序取的字段列表。将会对最后：一个列表中的字段取匹配结果的最后一个
    d = {
         'com_name':['(?:最|最后)[：:]?([\u4e00-\u9fa5()（）]{3,5}个)',
 
     }
     
'''
        if need_reverse == 'all':
            need_reverse = [i for i in reg_dict]
        elif not need_reverse:
            need_reverse = []

        if need_2list == 'all':
            need_2list = [i for i in reg_dict]
        elif not need_2list:
            need_2list = []

        if single_line_keys == 'all':
            single_line_keys = [i for i in reg_dict]
        elif not single_line_keys:
            single_line_keys = []

        if not_replace == 'all':
            not_replace = [i for i in reg_dict]
        elif not not_replace:
            not_replace = []

        if need_detail == 'all':
            need_detail = [i for i in reg_dict]
        elif not need_detail:
            need_detail = []

        for key, pattern_list in reg_dict.items():
            reverse = False if key not in need_reverse else True
            tolist = False if key not in need_2list else True
            flag = 0 if key not in single_line_keys else re.S
            replace = True if key not in not_replace else False
            detail = False if key not in need_detail else True

            result = cls.reg_one(pattern_list, string_content, tolist, reverse, flag, detail)
            if isinstance(result, str):
                result = result.strip()  # 去空
            if replace:
                item[key] = result or item.get(key, '')
            else:
                item[key] = item.get(key, '') or result

        return item

这里拿到的只能局限于一个结果，想要多个结果你可以写一个，有时间的话会更新多个，并且会把里面多个属性相互对应，有好的方法欢迎评论

xzhanxiang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫对解析文本的一些使用

爬虫对解析文本的一些使用## 标题当你在爬取网页文本的时候，可能碰到的格式比较多，这时我们就会多写一些代码，这时我们就可以用到这个方法，比较节省代码 def gen_item_with_reg(cls, string_content, reg_dict, item, need_reverse=None, need_2list=None, si
复制链接

扫一扫