【PaddleOCR】训练端到端OCR模型PGNet报错超出最大递归深度:maximum recursion depth exceeded while decoding a JSON array...

问题

环境:

  • aistudio A100
  • python 3.7.4
  • PaddlePaddle 2.4.0
  • PaddleOCR2.6.0

在此环境下进行PGNet模型训练,报错如下:

......
[2023/07/06 14:31:40] ppocr INFO: epoch: [2/600], global_step: 950, lr: 0.001000, loss: 1.698581, score_loss: 0.074852, border_loss: 0.025947, direction_loss: 0.014492, ctc_loss: 0.318290, avg_reader_cost: 0.00038 s, avg_batch_cost: 0.55773 s, avg_samples: 14.0, ips: 25.10191 samples/s, eta: 2 days, 7:34:16
[2023/07/06 14:31:46] ppocr INFO: epoch: [2/600], global_step: 960, lr: 0.001000, loss: 1.529947, score_loss: 0.074622, border_loss: 0.025173, direction_loss: 0.014684, ctc_loss: 0.283169, avg_reader_cost: 0.00038 s, avg_batch_cost: 0.55826 s, avg_samples: 14.0, ips: 25.07770 samples/s, eta: 2 days, 7:32:33
[2023/07/06 14:31:51] ppocr INFO: epoch: [2/600], global_step: 970, lr: 0.001000, loss: 1.508448, score_loss: 0.073050, border_loss: 0.024964, direction_loss: 0.014839, ctc_loss: 0.278475, avg_reader_cost: 0.00037 s, avg_batch_cost: 0.55947 s, avg_samples: 14.0, ips: 25.02383 samples/s, eta: 2 days, 7:30:56
[2023/07/06 14:31:57] ppocr INFO: epoch: [2/600], global_step: 980, lr: 0.001000, loss: 1.370401, score_loss: 0.070708, border_loss: 0.024799, direction_loss: 0.014229, ctc_loss: 0.254010, avg_reader_cost: 0.00036 s, avg_batch_cost: 0.55940 s, avg_samples: 14.0, ips: 25.02692 samples/s, eta: 2 days, 7:29:21
[2023/07/06 14:32:03] ppocr INFO: epoch: [2/600], global_step: 990, lr: 0.001000, loss: 1.189384, score_loss: 0.070311, border_loss: 0.025831, direction_loss: 0.014544, ctc_loss: 0.215893, avg_reader_cost: 0.00036 s, avg_batch_cost: 0.55872 s, avg_samples: 14.0, ips: 25.05728 samples/s, eta: 2 days, 7:27:45
[2023/07/06 14:32:08] ppocr INFO: epoch: [2/600], global_step: 1000, lr: 0.001000, loss: 1.111022, score_loss: 0.072967, border_loss: 0.026057, direction_loss: 0.014936, ctc_loss: 0.200845, avg_reader_cost: 0.00036 s, avg_batch_cost: 0.55829 s, avg_samples: 14.0, ips: 25.07636 samples/s, eta: 2 days, 7:26:10
eval model::   0%|                                     | 0/2000 [00:00<?, ?it/s][2023/07/06 14:32:18] ppocr ERROR: When parsing line 1979, error happened with msg: maximum recursion depth exceeded while decoding a JSON array from a unicode string
[2023/07/06 14:32:18] ppocr ERROR: When parsing line 292, error happened with msg: maximum recursion depth exceeded while decoding a JSON array from a unicode string
Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007fa0d4681700 (most recent call first):
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/json/decoder.py", line 353 in raw_decode
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/json/decoder.py", line 337 in decode
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/json/__init__.py", line 348 in loads
  File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/imaug/label_ops.py", line 208 in __call__
  File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/imaug/__init__.py", line 53 in transform
  File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 95 in __getitem__
  File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
  File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
  File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
  File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
  File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
  File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
  ......

训练配置每1000step会eval一次,可以看到报错就发生在eval中

maximum recursion depth exceeded while decoding a JSON array from a unicode string

在解析JSON数组时超出递归最大深度,结合下面多次__getitem__的调用栈可以肯定递归是主要问题,是什么导致了深度的递归?

排查

sys包中有两个与递归有关的api:

sys.getrecursionlimit()
sys.setrecursionlimit(limit)

分别是获取和设置当前python解释器的递归深度

单纯的加大递归深度八成是治标不治本的方法,而且最大递归深度还受到当前平台的限制:

sys.setrecursionlimit(limit)
Set the maximum depth of the Python interpreter stack to limit. This limit prevents infinite recursion from causing an overflow of the C stack and crashing Python.

The highest possible limit is platform-dependent. A user may need to set the limit higher when they have a program that requires deep recursion and a platform that supports a higher limit. This should be done with care, because a too-high limit can lead to a crash.

The highest possible limit is platform-dependent.

那么当前平台的递归深度默认是多少呢?我在windows和aistudio平台尝试,结果都为1000:

>>> sys.getrecursionlimit()
1000

分析

当搜索无果时,走查代码是最直接的办法,从train.py的main方法一路看下去。。。但是我并没有正向找到调用pgnet_dataset的__getitem__方法的路径,只知道是在model中:

                # program.py 525行
                elif model_type in ['sr']:
                    preds = model(batch)
                    sr_img = preds["sr_img"]
                    lr_img = preds["lr_img"]
                else:
                    preds = model(images)

            batch_numpy = []
            for item in batch:
                if isinstance(item, paddle.Tensor):
                    batch_numpy.append(item.numpy())
                else:
                    batch_numpy.append(item)

只能逆向操作了,问题的关键代码位于"ppocr/data/pgnet_dataset.py, line 102 in getitem":

    def __getitem__(self, idx):
        file_idx = self.data_idx_order_list[idx]
        data_line = self.data_lines[file_idx]
        img_id = 0
        try:
            data_line = data_line.decode('utf-8')
            substr = data_line.strip("\n").split(self.delimiter)
            file_name = substr[0]
            label = substr[1]
            img_path = os.path.join(self.data_dir, file_name)
            if self.mode.lower() == 'eval':
                try:
                    img_id = int(data_line.split(".")[0][7:])
                except:
                    img_id = 0
            data = {'img_path': img_path, 'label': label, 'img_id': img_id}
            if not os.path.exists(img_path):
                raise Exception("{} does not exist!".format(img_path))
            with open(data['img_path'], 'rb') as f:
                img = f.read()
                data['image'] = img
            outs = transform(data, self.ops)
        except Exception as e:
            self.logger.error(
                "When parsing line {}, error happened with msg: {}".format(
                    self.data_idx_order_list[idx], e))
            outs = None
        if outs is None:
            return self.__getitem__(np.random.randint(self.__len__()))
        return outs

递归行为:

return self.__getitem__(np.random.randint(self.__len__()))

可以看到,函数开始在读取数据,之后进行了一个transform转换得到outs,如果outs等于None就会进入递归,在流程上这里取一张数据集中的图片,如果取不到就随机拿一张图片来补上。

发生了如此多的递归,表明outs一直为None,增加日志排查发现transform方法一直返回None。transform方法并不复杂:

def transform(data, ops=None):
    """ transform """
    if ops is None:
        ops = []
    for op in ops:
        data = op(data)
        if data is None:
            return None
    return data

transform循环调用op来操作data,什么是op?排查代码发现ops在__init__中:

self.ops = create_operators(dataset_config['transforms'], global_config)

进一步排查代码发现,ops就是读取配置文件的transforms配置来构造对象,那么transforms配置是什么呢?如下:

Eval:
  dataset:
    name: PGDataSet
    data_dir: ./train_data/total_text/test
    label_file_list: [./train_data/total_text/test/test.txt]
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - E2ELabelEncodeTest:
      - E2EResizeForTest:
          max_side_len: 768
      - NormalizeImage:
          scale: 1./255.
          mean: [ 0.485, 0.456, 0.406 ]
          std: [ 0.229, 0.224, 0.225 ]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: [ 'image', 'shape', 'polys', 'texts', 'ignore_tags', 'img_id']
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 1 # must be 1
    num_workers: 2

此处为Eval的transforms配置,可以看到其中有DecodeImage、E2ELabelEncodeTest、E2EResizeForTest…等等,在transform方法中,如果op返回None的话transform就会返回None:

def transform(data, ops=None):
    """ transform """
    if ops is None:
        ops = []
    for op in ops:
        data = op(data)
        if data is None:
            return None
    return data

从而外层的outs为None,进入递归,那么是在哪个op失败了呢?增加日志发现当返回None时,op全部为E2ELabelEncodeTest,也就是说全部data在E2ELabelEncodeTest失败了。

那么接下来重点分析一下这个E2ELabelEncodeTest是干嘛的,E2ELabelEncodeTest代码如下:

class E2ELabelEncodeTest(BaseRecLabelEncode):
    def __init__(self,
                 max_text_length,
                 character_dict_path=None,
                 use_space_char=False,
                 **kwargs):
        super(E2ELabelEncodeTest, self).__init__(
            max_text_length, character_dict_path, use_space_char)

    def __call__(self, data):
        import json
        padnum = len(self.dict)
        label = data['label']
        label = json.loads(label)
        nBox = len(label)
        boxes, txts, txt_tags = [], [], []
        for bno in range(0, nBox):
            box = label[bno]['points']
            txt = label[bno]['transcription']
            boxes.append(box)
            txts.append(txt)
            if txt in ['*', '###']:
                txt_tags.append(True)
            else:
                txt_tags.append(False)
        boxes = np.array(boxes, dtype=np.float32)
        txt_tags = np.array(txt_tags, dtype=np.bool)
        data['polys'] = boxes
        data['ignore_tags'] = txt_tags
        temp_texts = []
        for text in txts:
            text = text.lower()
            text = self.encode(text)
            if text is None:
                return None
            text = text + [padnum] * (self.max_text_len - len(text)
                                      )  # use 36 to pad
            temp_texts.append(text)
        data['texts'] = np.array(temp_texts)
        return data

可以看到E2ELabelEncodeTest也就是对数据做了一些整理,其中主要是对text做了编码,self.encode方法如下:

def encode(self, text):
        """convert text-label into text-index.
        input:
            text: text labels of each image. [batch_size]

        output:
            text: concatenated text index for CTCLoss.
                    [sum(text_lengths)] = [text_index_0 + text_index_1 + ... + text_index_(n - 1)]
            length: length of each text. [batch_size]
        """
        if len(text) == 0 or len(text) > self.max_text_len:
            return None
        if self.lower:
            text = text.lower()
        text_list = []
        for char in text:
            if char not in self.dict:
                # logger = get_logger()
                # logger.warning('{} is not in dict'.format(char))
                continue
            text_list.append(self.dict[char])
        if len(text_list) == 0:
            return None
        return text_list

其中把text字符按照dict做了映射,dict是什么,该到揭晓答案的时候了,在BaseRecLabelEncode的__init__方法中:

if character_dict_path is None:
            logger = get_logger()
            logger.warning(
                "The character_dict_path is None, model can only recognize number and lower letters"
            )
            self.character_str = "0123456789abcdefghijklmnopqrstuvwxyz"
            dict_character = list(self.character_str)
            self.lower = True
        else:
            self.character_str = []
            with open(character_dict_path, "rb") as fin:
                lines = fin.readlines()
                for line in lines:
                    line = line.decode('utf-8').strip("\n").strip("\r\n")
                    self.character_str.append(line)
            if use_space_char:
                self.character_str.append(" ")
            dict_character = list(self.character_str)
        dict_character = self.add_special_char(dict_character)
        self.dict = {}
        for i, char in enumerate(dict_character):
            self.dict[char] = i
        self.character = dict_character

可以看到该dict,就是character_dict_path所指的字典文件,而默认的character_dict_path配置是什么呢?

character_dict_path: ppocr/utils/ic15_dict.txt

答案就是ic15_dict.txt是一个英文数字字典!而我训练的是中文数据,导致在E2ELabelEncodeTest中无法对text进行编码,所以E2ELabelEncodeTest全部返回None,从而导致进入递归。

解决

在官网搜索了一番,我将字典换成了中文的:

character_dict_path: ppocr/utils/ppocr_keys_v1.txt

起码训练可以正常进行了。。

问题只是硬解了出来,但是并不理解,比如,评估时为什么要进行编码,为什么训练时E2ELabelEncodeTrain不用将字符编码,那一些transforms都有什么作用。。希望能早日提升理解水平。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值