Python爬虫+解析PDF+数据分类

最新推荐文章于 2023-12-09 01:26:46 发布

置顶 Blank_Tt

最新推荐文章于 2023-12-09 01:26:46 发布

阅读量1.7k

点赞数 4

分类专栏：计算智能文章标签：爬虫 PDF下载 PDF解析分类

本文链接：https://blog.csdn.net/Blank_Tt/article/details/102378871

版权

计算智能专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Python爬虫批量下载pdf

网页url为https://www.ml4aad.org/automl/literature-on-neural-architecture-search/，是一个关于神经网络架构搜索的文章页面。其中有许多的文章，其中标题为黑体的是已经发布的，不是黑体的暂未发布。我们的第一个任务是下载url链接内的pdf文档。

对网页源代码进行简要的分析，可以发现url格式并不是很多，可以通过if else的结构进行逐个检索。此外，要注意一些问题，比如网页404，链接不存在或多个链接。

首先我们存储的列表如下：

list  = []  #all except list5
list1 = []  #出版
list2 = []  #未出版
list3 = []  #属于EC
list4 = []  #不属于EC
list5 = []  #有问题的
list6 = []  #处理手动下载

先得到文章的标题和链接：

def get_page(url):
    page = requests.get(url)
    html = page.text
    return html

def get_list_all(html):
    soup = BeautifulSoup(html, "html.parser")
    data = soup.select('#post-722 > ul > li')
    for item in data:
        if item is None:
            continue
        var = str(item)
        sstr = item.get_text()
        '''
        截取paper的标题和链接
        每一篇文章都有()分隔，但要特别注意有些标题有嵌套括号
        部分文章链接没有<a>标签，经确定文章已被下架，或者有些文章直接没有链接，统一按照没有用<a></a>标签处理，对于这些文章存储在其他表中,
        部分文章url失效，404处理
        部分文章有多个链接,因为后一个链接比较新，所以直接采用后一个链接作为主要链接
        '''
        if(var.find('<a')  < 0): #没有链接或者链接不可用(没有<a>标签)
            dict = {
                "title": sstr[0:sstr.rfind('(')],
                "problem": "link is wrong"
            }
            list5.append(dict)
            continue

        dict = {
            "title": sstr[0:sstr.rfind('(')], #从后往前查找'('
            "link": sstr[sstr.rfind('http'):] #从后往前查找'http'(一般情况是查找')'但是可能存在多个链接，所以查找'http')
        }

        if (var.find('<strong>') != -1): #通过<strong>标签来区分出版和未出版(如果出版，显示的标题是粗体)
            list1.append(dict)
        else:
            list2.append(dict)

逐个分析，得到pdfs：（检测404代码运行跑太慢了，暂且注释）

def get_urls(list):
    pdfs = []
    for item in list:
        if (item["link"][-4:] == '.pdf'): #链接后缀直接有.pdf,可以直接下载，首先处理
            pdfs.append(item)

        elif(item["link"].find("https://arxiv.org/abs")>=0): # 处理此类url https://arxiv.org/abs/1909.02453
            temp1 = "https://arxiv.org/pdf/"
            temp1 += item["link"][-10:]
            temp = item
            temp["link"] = temp1
            pdfs.append(temp)

        elif(item["link"].find("https://link.springer.com/chapter")>=0):  #处理https://link.springer.com/chapter/10.1007/978-3-030-13001-5_12
            temp2 = "https://link.springer.com/content/pdf/"
            temp2 += item["link"][-28:] + ".pdf"
            temp = item
            temp["link"] = temp2
            pdfs.append(temp)

        elif(item["link"].find("https://ieeexplore.ieee.org")>=0):     #处理https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8791709
            temp3 = "https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber="
            temp3 += item["link"][-7:]
            temp = item
            temp["link"] = temp3
            pdfs.append(temp)

        elif(item["link"].find("https://openreview.net")>=0):   #处理https://openreview.net/forum?id=Syg3FDjntN and https://openreview.net/pdf?id=BJ-MRKkwG
            if(item["link"][23:28] == 'forum'):
                var = item["link"].replace("forum","pdf")
                temp = item
                temp["link"] = var
                pdfs.append(temp)
            else:
                pdfs.append(item)
            continue

        elif(item["link"].find("https://www.mitpressjournals.org") >= 0): # 处理https://www.mitpressjournals.org/doi/abs/10.1162/evco_a_00253
            var = item["link"].replace("abs", "pdf")
            temp = item
            temp["link"] = var
            pdfs.append(temp)

        elif(item["link"].find("https://www.nature.com") >= 0): #处理https://www.nature.com/articles/s42256-018-0006-z
            temp = item
            temp["link"] = item["link"]+'.pdf'
            pdfs.append(temp)

        elif(item["link"].find("https://www.worldscientific.com") >= 0):  # 处理https://www.worldscientific.com/doi/abs/10.1142/S1469026818500086
            var = item["link"].replace("abs", "pdf")
            temp = item
            temp["link"] = var
            pdfs.append(temp)

        elif(item["link"].find("https://papers.nips.cc") >= 0):   # 处理https://papers.nips.cc/paper/207-the-cascade-correlation-learning-architecture
            temp = item
            temp["link"] = item["link"] + ".pdf"
            pdfs.append(temp)

        elif(item["link"].find("https://hal.archives-ouvertes.fr") >= 0):   #直接下载
            pdfs.append(item)

        elif(item["link"].find("http://www.complex-systems.com") >= 0): #处理http://www.complex-systems.com/abstracts/v04_i04_a06/
            # 链接无法自动下载，所以改为手动
            dict = {
                "title" : str(item["title"]),
                "problem" : "can not be downloaded automatically"
            }
            list5.append(dict)
            list6.append(item)
        
        elif (item["link"].find("https://www.sciencedirect.com") >= 0):  # 处理https://www.sciencedirect.com/science/article/pii/S1361841518307734
            # 链接无法自动下载，所以改为手动
            dict = {
                "title" : str(item["title"]),
                "problem" : "can not be downloaded automatically"
            }
            list5.append(dict)
            list6.append(item)

        elif (item["link"].find("https://dl.acm.org") >= 0):  # 处理https://dl.acm.org/citation.cfm?id=2834896
            # 链接无法自动下载，所以改为手动
            dict = {
                "title": item["title"],
                "problem": "can not be downloaded automatically"
            }
            if (item["link"] == "https://dl.acm.org/citation.cfm?id=94034"):
                dict["problem"] = "the pdf does not exist"
                list5.append(dict)
            else:
                list6.append(item)
        else:
            dict = {
                "title": item["title"],
                "problem": "Unknown"
            }
            list5.append(dict)

        #if(requests.head(item["link"]).status_code == 404):
        #    dict = {
        #        "title" : item["title"],
        #        "problem" : "link is wrong"
        #    }
        #    list5.append(dict)
        #    continue

    return pdfs

下载pdf：

def download_pdf(pdfs):
    i = 1
    for pdf in pdfs:
        path = r"D:\\Download\\autoDocuments\\" + str(i) + ".pdf"
        r = requests.get(pdf["link"])
        f = open(path,"wb")
        f.write((r.content))
        i += 1
    f.close()

此外，有些pdf并不是免费下载的，这里我用校园网直接进行下载，放在另外一个文件夹。

PDF解析内容

因为我们要实现分类处理，我采用的是直接解析pdf关于摘要的内容（当然有极小部分pdf没有abstract内容，同样用if else简化）

def pdf_miner_word(pdf,path):     #得到文档abstract中的内容
    try:
        # 用文件对象来创建一个pdf文档分析器
        praser = PDFParser(open(path, 'rb'))
        # 创建一个PDF文档
        doc = PDFDocument()
        # 连接分析器 与文档对象
        praser.set_document(doc)
        doc.set_parser(praser)

        # 提供初始化密码
        # 如果没有密码 就创建一个空的字符串
        doc.initialize()

        # 检测文档是否提供txt转换，不提供就忽略
        if not doc.is_extractable:
            raise PDFTextExtractionNotAllowed
        else:
            # 创建PDf 资源管理器 来管理共享资源
            rsrcmgr = PDFResourceManager()
            # 创建一个PDF设备对象
            laparams = LAParams()
            device = PDFPageAggregator(rsrcmgr, laparams=laparams)
            # 创建一个PDF解释器对象
            interpreter = PDFPageInterpreter(rsrcmgr, device)

            # 循环遍历列表，每次处理一个page的内容
            for page in doc.get_pages():
                interpreter.process_page(page)
                # 接受该页面的LTPage对象
                layout = device.get_result()
                # 这里layout是一个LTPage对象，里面存放着这个 page 解析出的各种对象
                # 包括 LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal 等
                list = []
                for x in layout:
                    if isinstance(x, LTTextBox):
                        list.append(lower(x.get_text().strip()))
                strinfo = re.compile(' ')
                for i in range(len(list)):
                    if (strinfo.sub('', list[i]) == 'abstract'):
                        if(path[-6:-4]=='h6'):
                            return list[i+3]
                        elif(path[-6:-4]=='h8'):
                            return list[i+4]
                        else:
                            return list[i+1]
                    elif (list[i][0:8] == 'abstract'):
                        return list[i][9:]
                    elif (list[i] == '1 introduction'):
                        return list[i+1]
                    elif (list[i] == 'summary'):
                        return list[i+1]
    except PDFSyntaxError:
        dict = {
            "title" : pdf['title'],
            "problem" : "fail to open pdf"
        }
        list5.append(dict)

然而，之前下载下来的部分pdf仍然有问题，无法打开，所以这里用了异常处理。诸如此类只有3kb的明显存在问题。

Rake算法分类

我们的最终任务是要实现分类文章，其中发表和不发表可以由<strong>标签很容易就进行分类。那如何区分属于EC(evolutionary computing)还是不属于EC呢？我的想法是采用关键词匹配的方式（效率不高）。我们已经提取了文章的摘要内容，我们可以根据摘要中的关键词进行匹配。进化计算关键词我列出了几个（不全面），成为我们的匹配列表：（大小写不敏感）

#match list to classify EC
children
crossover
EC
evolutionary
fitness
gene
generation
genetic
iteration
GA
MA
MOEA
mutate
mutation
NSGA
reproduction
selection

那么我们如何对摘要提取关键词呢？这里，我采用了rake算法。关于该算法的详细解释，请参照大牛们的github，我只是个搬运工。Rake算法主要是用于短语，我只需要每个关键词的分数。经过试验后，为了提高精准度，选择分数>=3的关键词进行匹配。

原始的RAKE的GitHub地址：
https://github.com/zelandiya/RAKE-tutorial

另一个博客写的：

https://blog.csdn.net/chinwuforwork/article/details/77993277

def load_match_words(match_word_file):
    match_words = []
    for line in open(match_word_file):
        if line.strip()[0:1] != "#":
            for word in line.split():  # in case more than one per line
                match_words.append(lower(word))
    return match_words

def pre_process_abstract(abstract): #预处理abstract的单词内容
    abstract.strip()
    abstract.replace('-','')
    return abstract


def abstract_analyze(pdf,abstract):
    match_word_file = "Matchlist.txt"
    match = load_match_words(match_word_file)
    stop_words_path = "SmartStoplist.txt"
    r = Rake(stop_words_path) #Rake类直接import
    temp= r.run(abstract)
    matched = []
    for item in temp:
        if(item[1] >= 3):          #以分数3的界限分隔
            matched.append(item)
    matched = temp
    flag = False
    for item in matched:
        if(item[0] in match):
            list3.append(pdf)
            flag = True
            break
    if(flag == False):
        list4.append(pdf)

结果

代码中我将运行结果永久存储到Mysql了。

为了方便查看。我用h5的形式进行展现：

http://www.blanktt.top/

代码参考详见github地址：

https://github.com/blankTt/Browser.git

Blank_Tt

关注

4
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫+解析PDF+数据分类

Python爬虫批量下载pdf 网页url为https://www.ml4aad.org/automl/literature-on-neural-architecture-search/，是一个关于神经网络架构搜索的文章页面。其中有许多的文章，其中标题为黑体的是已经发布的，不是黑体的暂未发布。我们的第一个任务是下载url链接内的pdf文档。对网页源代码进行简要的分析，可以发现u...
复制链接

扫一扫