前言:
最近在做深度学习在材料的应用的事情,第一步就是下载数据集,由于OQMD的数据量更大, 并且结构多样性高等优点,因此我想从该数据库中下载数据作为训练集。但是检索了一些资料发现没有能够直接用来运行就能下载数据的代码。总结了搜集到的资料, 自己写了一个用来从OQMD中下载数据的python脚本。
代码实现:
总体部分
def get_poscar_data_from_oqmd():
#定义起始url地址
#initial_url_template = "起始url地址"
all_links = []
next_links = ""
csv_name = "materials_ids.csv"
max_attempts = 5
attempts = 0
#添加列标题
if (os.path.getsize(csv_name) == 0):
with open(csv_name,"w",newline="") as f:
csv_writer = csv.writer(f)
csv_writer.writerow(["Material_ID"])
f.close()
#从中断的url缓存中继续读取文件
while(attempts<=max_attempts):
with open("links", "r") as f:
all_links = f.readlines()
rows = len(all_links)
if(rows == 0):
link = initial_url_template
else:
link = all_links[-1]
response = requests.get(link)
data = response.json()
link = data["links"]["next"]
get_all_links_and_data(link,rows+1,csv_name)
print("数据已经全部下载")
这里是总体调用的部分。
读取地址部分:
def get_all_links_and_data(url,rows,csv_file_name):
next_link = url
i = rows
while(next_link):
print(f"第{i}个link")
response = requests.get(next_link)
data_json = response.json()
with open("links", "a") as file_links:
file_links.write(next_link + "\n")
file_links.close()
data = data_json["data"]
sum = 0
with open(csv_file_name,"a",newline="") as csv_file:
csv_writer = csv.writer(csv_file)
for j in range(len(data)):
if data[j]["attributes"]["_oqmd_spacegroup"] in cubic_sp:
sum += 1
structure = data[j]
poscar, filename = get_poscar_from_optimade_structure(structure)
id = structure["id"]
csv_writer.writerow([id])
with open(os.path.join("dataset","POSCAR_"+str(id)), "w") as f:
f.write(poscar)
f.close()
csv_file.close()
print(f"第{i}页符合条件的晶体数量为{sum}个")
这部分是用来从json网页中下载数据,但是这里下载的下来的结构是POSCAR结构的,如果需要用cif文件,则需要用pymatgen转换成CIF结构文件,后续需要的话再添加上去。
获得POSCAR文件:
def get_poscar_from_optimade_structure(structure):
if '_oqmd_entry_id' in structure['attributes'].keys():
poscar = ["REST API StructureID {}, OQMD Entry ID {}".format(
structure['id'], structure['attributes']['_oqmd_entry_id']
)]
filename = "ID-{}_OQMD-EnID-{}.poscar".format(structure['id'], structure['attributes']['_oqmd_entry_id'])
else:
poscar = ["REST API StructureID {}".format(structure['id'])]
filename = "ID-{}.poscar".format(structure['id'])
poscar.append("1.0")
poscar += [" ".join([str(jtem) for jtem in item])
for item in structure['attributes']['lattice_vectors']
]
elems = []
counts = []
for item in structure['attributes']['species_at_sites']:
if item in elems:
assert elems.index(item) == len(elems) - 1
counts[-1] += 1
else:
elems.append(deepcopy(item))
counts.append(1)
poscar.append(" ".join(elems))
poscar.append(" ".join([str(item) for item in counts]))
poscar.append("Cartesian")
poscar += [" ".join([str(jtem) for jtem in item])
for item in structure['attributes']['cartesian_site_positions']
]
poscar = "\n".join(poscar)
return (poscar, filename)
这部分代码主要是根据之前网址中读取的晶体信息保存成POSCAR文件。
总结:
从OQMD中下载数据集的代码大致流程是这样,有一些细节上的东西没有修改。另外,使用之前需要先根据筛选条件检索得到一个json地址,然后用来初始化inital_url_template。 在运行过程中,由于考虑到了网址访问出错的问题,因此在中断后继续运行代码即可,如果遇到不能访问的网址,则需要在网址的offset部分往后翻一页即可。后续的转换成cif文件是一个比较容易的问题。