如何用Python 3调用ArXiv API查询ArXiv论文元数据

诸神缄默不语-个人CSDN博文目录

ArXiv API文档:
arXiv API Access - arXiv info
arXiv API Basics - arXiv info
arXiv API User’s Manual - arXiv info

1. 调包

import urllib
from urllib.parse import quote

from xml.dom.minidom import parseString

2. 获取数据

注意,在Windows上运行这套代码会报错 IncompleteRead: IncompleteRead(112176 bytes read)
我也不知道是为什么,但是在Linux服务器上就没有问题……只能怀疑是操作系统的问题了。

1. 简单入门

根据关键词调用ArXiv API返回搜索结果(关于引号你们自己注意一下,Python基础常识略):

keyword = '"math word problem"'
url = (
    "http://export.arxiv.org/api/query?search_query=all:"
    + keyword
    + "&start=0&max_results=1000&sortBy=lastUpdatedDate&sortOrder=descending"
)
url = quote(url, safe='%/:=&?~#+!$,;@()*[]"')
data = urllib.request.urlopen(url)

doc = parseString(data.read().decode("utf-8"))

URL转义知识见:Python3常用其他API速查手册(持续更新ing…)

返回值doc就是一个xml.DOM对象,可以通过如下方式在文本文件中展示:
doc.writexml(open("arxiv.xml", "w"), addindent=" ", newl="\n")

2. XML返回值示例

我觉得具体啥意思还挺见名知义的:

<?xml version="1.0" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
  
  
  <link href="http://arxiv.org/api/query?search_query%3Dall%3A%22math%20word%20problem%22%26id_list%3D%26start%3D0%26max_results%3D1000" rel="self" type="application/atom+xml"/>
  
  
  <title type="html">ArXiv Query: search_query=all:&quot;math word problem&quot;&amp;id_list=&amp;start=0&amp;max_results=1000</title>
  
  
  <id>http://arxiv.org/api/omit</id>
  
  
  <updated>2024-04-01T00:00:00-04:00</updated>
  
  
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">121</opensearch:totalResults>
  
  
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  
  
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1000</opensearch:itemsPerPage>
  
  
  <entry>
    
    
    <id>http://arxiv.org/abs/2402.17916v2</id>
    
    
    <updated>2024-03-30T04:16:20Z</updated>
    
    
    <published>2024-02-27T22:07:52Z</published>
    
    
    <title>LLM-Resistant Math Word Problem Generation via Adversarial Attacks</title>
    
    
    <summary>  Large language models (LLMs) have significantly transformed the educational
landscape. As current plagiarism detection tools struggle to keep pace with
LLMs' rapid advancements, the educational community faces the challenge of
assessing students' true problem-solving abilities in the presence of LLMs. In
this work, we explore a new paradigm for ensuring fair evaluation -- generating
adversarial examples which preserve the structure and difficulty of the
original questions aimed for assessment, but are unsolvable by LLMs. Focusing
on the domain of math word problems, we leverage abstract syntax trees to
structurally generate adversarial examples that cause LLMs to produce incorrect
answers by simply editing the numeric values in the problems. We conduct
experiments on various open- and closed-source LLMs, quantitatively and
qualitatively demonstrating that our method significantly degrades their math
problem-solving ability. We identify shared vulnerabilities among LLMs and
propose a cost-effective approach to attack high-cost models. Additionally, we
conduct automatic analysis on math problems and investigate the cause of
failure, offering a nuanced view into model's limitation.
</summary>
    
    
    <author>
      
      
      <name>Roy Xie</name>
      
    
    </author>
    
    
    <author>
      
      
      <name>Chengxuan Huang</name>
      
    
    </author>
    
    
    <author>
      
      
      <name>Junlin Wang</name>
      
    
    </author>
    
    
    <author>
      
      
      <name>Bhuwan Dhingra</name>
      
    
    </author>
    
    
    <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">Code/data: https://github.com/ruoyuxie/adversarial_mwps_generation</arxiv:comment>
    
    
    <link href="http://arxiv.org/abs/2402.17916v2" rel="alternate" type="text/html"/>
    
    
    <link title="pdf" href="http://arxiv.org/pdf/2402.17916v2" rel="related" type="application/pdf"/>
    
    
    <arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cs.CL" scheme="http://arxiv.org/schemas/atom"/>
    
    
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom"/>
    
    
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom"/>
    
  
  </entry>
  ...
</feed>

3. 添加分类信息,获取最新论文

ArXiv API仅在指定分类时提供最新论文(The current arXiv feeds only give you updates on new papers within the category you specify.
(但是我自己使用时其实感觉不到太大的更新速度差异……但是我也没有高强度使用,我也不确定)

ArXiv分类ID见:Category Taxonomy

代码:

keyword = '"math word problem"'
taxonomy = "cs.AI"
url = (
    "http://export.arxiv.org/api/query?search_query=all:"
    + keyword
    + "AND+cat:"
    + taxonomy
    + "&start=0&max_results=1000&sortBy=lastUpdatedDate&sortOrder=descending"
)
url = quote(url, safe='%/:=&?~#+!$,;@()*[]"')
data = urllib.request.urlopen(url)

doc = parseString(data.read().decode("utf-8"))

4. ArXiv查询入参详解

略,待补。

3. 解析XML数据

total_list的每一个元素就是一个字典格式的论文元数据对象:

collection = doc.documentElement

total_list=[]

for entry in collection.getElementsByTagName("entry"):
    now_list = {}
    now_list["paper_url"] = entry.getElementsByTagName("id")[0].childNodes[0].data
    now_list["updated_date"] = (
        entry.getElementsByTagName("updated")[0].childNodes[0].data
    )
    now_list["publication_date"] = (
        entry.getElementsByTagName("published")[0].childNodes[0].data
    )
    now_list["title"] = (
        entry.getElementsByTagName("title")[0].childNodes[0].data.replace("\n", " ")
    )
    now_list["summary"] = (
        entry.getElementsByTagName("summary")[0].childNodes[0].data.replace("\n", " ").strip()
    )

    author_str = ""
    for author in entry.getElementsByTagName("author"):
        author_str += author.getElementsByTagName("name")[0].childNodes[0].data + "; "
    now_list["authors"] = author_str[:-2]

    comments = entry.getElementsByTagName("arxiv:comment")
    if comments:
        now_list["comment"] = comments[0].childNodes[0].data
    else:
        now_list["comment"] = ""

    links = entry.getElementsByTagName("link")
    for link in links:
        rel = link.getAttribute("rel")
        href = link.getAttribute("href")
        link_type = link.getAttribute("type")
        if rel == "alternate":
            now_list["alternate_link"] = href
        elif rel == "related" and link_type == "application/pdf":
            now_list["pdf_link"] = href

    primary_categories = entry.getElementsByTagName("arxiv:primary_category")
    if primary_categories:
        now_list["primary_category"] = primary_categories[0].getAttribute("term")
    else:
        now_list["primary_category"] = ""

    categories = entry.getElementsByTagName("category")
    category_list = []
    for category in categories:
        category_term = category.getAttribute("term")
        category_list.append(category_term)
    now_list["categories"] = "; ".join(category_list)

    total_list.append(now_list)
  • 37
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

诸神缄默不语

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值