pdfminer错误提交

在尝试使用pdfminer.six库进行PDF文本提取时遇到错误。具体表现为在处理某些PDF文件时,出现AttributeError,指出'PDFStream'对象没有'replace'属性。此外,还遇到了PDFTextExtractionNotAllowed错误,提示不允许从特定PDF中提取文本。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

https://github.com/pdfminer/pdfminer.six/issues


pdf: https://links.sgx.com/1.0.0/corporate-announcements/HOBG2B5Y0EVJ9PYQ/Manhattan%20Resources%20Limited%20-%20Offer%20Information%20Statement%20dated%2027%20November%202018.pdf

python ${pdfminer_path}/pdf2txt.py -M 99 -L 1 -o "/pdf/L02/HOBG2B5Y0EVJ9PYQ.txt" "/L02/HOBG2B5Y0EVJ9PYQ.pdf"

Traceback (most recent call last):
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts/pdf2txt.py", line 132, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts/pdf2txt.py", line 127, in main
    outfp = extract_text(**vars(A))
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts/pdf2txt.py", line 62, in extract_text
    pdfminer3.high_level.extract_text_to_fp(fp, **locals())
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/high_level.py", line 79, in extract_text_to_fp
    interpreter.process_page(page)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 851, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 861, in render_contents
    self.init_resources(resources)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 361, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 211, in get_font
    font = self.get_font(None, subspec)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 202, in get_font
    font = PDFCIDFont(self, spec)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdffont.py", line 656, in __init__
    self.cmap = CMapDB.get_cmap(name)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/cmapdb.py", line 257, in get_cmap
    data = klass._load_data(name)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/cmapdb.py", line 231, in _load_data
    name = name.replace("\0", "")
AttributeError: 'PDFStream' object has no attribute 'replace'

pdf: http://www3.hkexnews.hk/listedco/listconews/SEHK/2019/0121/LTN20190121455.pdf

python ${pdfminer_path}/pdf2txt.py  -o "/pdf/00137/LTN20190121455.txt" "/pdf/00137/LTN20190121455.pdf"

Traceback (most recent call last):
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts//pdf2txt.py", line 136, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts//pdf2txt.py", line 131, in main
    outfp = extract_text(**vars(A))
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts//pdf2txt.py", line 63, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer/high_level.py", line 80, in extract_text_to_fp
    check_extractable=True):
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer/pdfpage.py", line 132, in get_pages
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
pdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed: <_io.BufferedReader name='/appvol/selenium/hkex/pdf/00137/LTN20190121455.pdf'>

### 安装和配置 Qv2ray #### 下载并准备 Qv2ray 应用程序图像文件 为了在 Ubuntu 上安装 Qv2ray,首先需要获取应用程序的 AppImage 文件。可以从官方网站或其他可信资源下载最新版本。 ```bash wget https://github.com/Qv2ray/Qv2ray/releases/download/v2.7.0/Qv2ray-v2.7.0-linux-x64.AppImage ``` #### 设置可执行权限 下载完成后,需设置该文件具有可执行权限以便启动它[^3]: ```bash sudo chmod +x ./Qv2ray-v2.7.0-linux-x64.AppImage ``` #### 创建桌面快捷方式(可选) 如果希望创建一个桌面图标来方便访问 Qv2ray,则可以按照下面的方法操作[^2]: 1. 编辑一个新的 `.desktop` 文件用于定义应用程序条目: ```bash cd /usr/share/applications && sudo gedit Qv2ray.desktop ``` 2. 将下列内容粘贴进去,并根据实际情况调整路径: ```ini [Desktop Entry] Encoding=UTF-8 Name=Qv2ray Comment=A GUI client for V2Ray based on Qt5. Exec=/path/to/your/Qv2ray-v2.7.0-linux-x64.AppImage Icon=/path/to/icon/Qv2ray.png Terminal=false StartupNotify=true Type=Application Categories=Network; ``` 3. 授予 `.desktop` 文件必要的权限使其成为有效的启动器: ```bash sudo chmod u+x Qv2ray.desktop ``` #### 解决可能遇到的问题 当更改用户组之后发现 Qv2ray 无法正常工作时,可能是由于缺少某些特定的能力(capabilities),可以通过给定二进制文件增加这些能力来修复这个问题[^4]: ```bash sudo setcap cap_net_bind_service=+ep cap_net_admin=+ep /path/to/qv2ray_executable_file ``` #### 配置全局代理(适用于命令行工具) 对于那些依赖于 HTTP 或 SOCKS5 协议的应用和服务来说,在设置了 Qv2ray 的监听端口后还需要进一步修改环境变量以实现全局代理功能[^5]。这通常涉及到编辑用户的 shell profile 文件如 `~/.bashrc` 添加如下几行代码: ```bash export http_proxy="http://127.0.0.1:8889" export https_proxy=$http_proxy export all_proxy="socks5://127.0.0.1:1089" ``` 完成上述步骤后记得使新的环境变量生效: ```bash source ~/.bashrc ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值