ArXiv API文档:
arXiv API Access - arXiv info
arXiv API Basics - arXiv info
arXiv API User’s Manual - arXiv info
1. 调包
import urllib
from urllib.parse import quote
from xml.dom.minidom import parseString
2. 获取数据
注意,在Windows上运行这套代码会报错 IncompleteRead: IncompleteRead(112176 bytes read)
我也不知道是为什么,但是在Linux服务器上就没有问题……只能怀疑是操作系统的问题了。
1. 简单入门
根据关键词调用ArXiv API返回搜索结果(关于引号你们自己注意一下,Python基础常识略):
keyword = '"math word problem"'
url = (
"http://export.arxiv.org/api/query?search_query=all:"
+ keyword
+ "&start=0&max_results=1000&sortBy=lastUpdatedDate&sortOrder=descending"
)
url = quote(url, safe='%/:=&?~#+!$,;@()*[]"')
data = urllib.request.urlopen(url)
doc = parseString(data.read().decode("utf-8"))
URL转义知识见:Python3常用其他API速查手册(持续更新ing…)
返回值doc就是一个xml.DOM对象,可以通过如下方式在文本文件中展示:
doc.writexml(open("arxiv.xml", "w"), addindent=" ", newl="\n")
2. XML返回值示例
我觉得具体啥意思还挺见名知义的:
<?xml version="1.0" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
<link href="http://arxiv.org/api/query?search_query%3Dall%3A%22math%20word%20problem%22%26id_list%3D%26start%3D0%26max_results%3D1000" rel="self" type="application/atom+xml"/>
<title type="html">ArXiv Query: search_query=all:"math word problem"&id_list=&start=0&max_results=1000</title>
<id>http://arxiv.org/api/omit</id>
<updated>2024-04-01T00:00:00-04:00</updated>
<opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">121</opensearch:totalResults>
<opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
<opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1000</opensearch:itemsPerPage>
<entry>
<id>http://arxiv.org/abs/2402.17916v2</id>
<updated>2024-03-30T04:16:20Z</updated>
<published>2024-02-27T22:07:52Z</published>
<title>LLM-Resistant Math Word Problem Generation via Adversarial Attacks</title>
<summary> Large language models (LLMs) have significantly transformed the educational
landscape. As current plagiarism detection tools struggle to keep pace with
LLMs' rapid advancements, the educational community faces the challenge of
assessing students' true problem-solving abilities in the presence of LLMs. In
this work, we explore a new paradigm for ensuring fair evaluation -- generating
adversarial examples which preserve the structure and difficulty of the
original questions aimed for assessment, but are unsolvable by LLMs. Focusing
on the domain of math word problems, we leverage abstract syntax trees to
structurally generate adversarial examples that cause LLMs to produce incorrect
answers by simply editing the numeric values in the problems. We conduct
experiments on various open- and closed-source LLMs, quantitatively and
qualitatively demonstrating that our method significantly degrades their math
problem-solving ability. We identify shared vulnerabilities among LLMs and
propose a cost-effective approach to attack high-cost models. Additionally, we
conduct automatic analysis on math problems and investigate the cause of
failure, offering a nuanced view into model's limitation.
</summary>
<author>
<name>Roy Xie</name>
</author>
<author>
<name>Chengxuan Huang</name>
</author>
<author>
<name>Junlin Wang</name>
</author>
<author>
<name>Bhuwan Dhingra</name>
</author>
<arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">Code/data: https://github.com/ruoyuxie/adversarial_mwps_generation</arxiv:comment>
<link href="http://arxiv.org/abs/2402.17916v2" rel="alternate" type="text/html"/>
<link title="pdf" href="http://arxiv.org/pdf/2402.17916v2" rel="related" type="application/pdf"/>
<arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cs.CL" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.CL" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.AI" scheme="http://arxiv.org/schemas/atom"/>
</entry>
...
</feed>
3. 添加分类信息,获取最新论文
ArXiv API仅在指定分类时提供最新论文(The current arXiv feeds only give you updates on new papers within the category you specify.
)
(但是我自己使用时其实感觉不到太大的更新速度差异……但是我也没有高强度使用,我也不确定)
ArXiv分类ID见:Category Taxonomy
代码:
keyword = '"math word problem"'
taxonomy = "cs.AI"
url = (
"http://export.arxiv.org/api/query?search_query=all:"
+ keyword
+ "AND+cat:"
+ taxonomy
+ "&start=0&max_results=1000&sortBy=lastUpdatedDate&sortOrder=descending"
)
url = quote(url, safe='%/:=&?~#+!$,;@()*[]"')
data = urllib.request.urlopen(url)
doc = parseString(data.read().decode("utf-8"))
4. ArXiv查询入参详解
略,待补。
3. 解析XML数据
total_list的每一个元素就是一个字典格式的论文元数据对象:
collection = doc.documentElement
total_list=[]
for entry in collection.getElementsByTagName("entry"):
now_list = {}
now_list["paper_url"] = entry.getElementsByTagName("id")[0].childNodes[0].data
now_list["updated_date"] = (
entry.getElementsByTagName("updated")[0].childNodes[0].data
)
now_list["publication_date"] = (
entry.getElementsByTagName("published")[0].childNodes[0].data
)
now_list["title"] = (
entry.getElementsByTagName("title")[0].childNodes[0].data.replace("\n", " ")
)
now_list["summary"] = (
entry.getElementsByTagName("summary")[0].childNodes[0].data.replace("\n", " ").strip()
)
author_str = ""
for author in entry.getElementsByTagName("author"):
author_str += author.getElementsByTagName("name")[0].childNodes[0].data + "; "
now_list["authors"] = author_str[:-2]
comments = entry.getElementsByTagName("arxiv:comment")
if comments:
now_list["comment"] = comments[0].childNodes[0].data
else:
now_list["comment"] = ""
links = entry.getElementsByTagName("link")
for link in links:
rel = link.getAttribute("rel")
href = link.getAttribute("href")
link_type = link.getAttribute("type")
if rel == "alternate":
now_list["alternate_link"] = href
elif rel == "related" and link_type == "application/pdf":
now_list["pdf_link"] = href
primary_categories = entry.getElementsByTagName("arxiv:primary_category")
if primary_categories:
now_list["primary_category"] = primary_categories[0].getAttribute("term")
else:
now_list["primary_category"] = ""
categories = entry.getElementsByTagName("category")
category_list = []
for category in categories:
category_term = category.getAttribute("term")
category_list.append(category_term)
now_list["categories"] = "; ".join(category_list)
total_list.append(now_list)