本示例旨在展示如何基于Schema的定义,利用大模型实现对图谱实体和关系的抽取和构建到图谱。
Step1:进入案例目录
cd python/knext/knext/examples/medicine/
Step2:项目初始化
先对项目进行初始化动作
knext project create --prj_path .
Step3:知识建模
schema文件已创建好医疗SPG Schema模型,具体内容参考:
namespace Patent
Disease(疾病): EntityType
properties:
complication(并发症): Disease
constraint: MultiValue
commonSymptom(常见症状): Symptom
constraint: MultiValue
applicableDrug(适用药品): Drug
constraint: MultiValue
treatmentMethod(治疗方法): TreatmentMethod
constraint: MultiValue
diseaseSite(发病部位): BodyPart
constraint: MultiValue
diseaseFrequency(发病率): Float
deathRate(死亡率): Float
highRiskGroup(高风险人群): Text
department(就诊科室): Text
constraint: MultiValue
relations:
abnormalIndicator(异常指标): BiologicalMarker
properties:
detectionMethod(检测方法): Text
normalRange(正常范围): Text
relatedDisease(相关疾病): Text
Symptom(症状): EntityType
desc: 这是一个症状
Drug(药品): EntityType
properties:
dose(剂量): Float
sideEffect(副作用): Text
pharmaceuticalCompany(制药公司): Text
BiologicalMarker(生物标志物): EntityType
properties:
detectionMethod(检测方法): Text
normalRange(正常范围): Text
relatedDisease(相关疾病): Disease
BodyPart(人体部位): ConceptType
hypernymPredicate: isA
TreatmentMethod(治疗方法): EntityType
properties:
successRate(成功率): Float
sideEffect(副作用): Text
applicableDisease(适用疾病): Disease
constraint: MultiValue
MedicalDevice(医疗设备): EntityType
properties:
manufacturer(制造商): Text
usage(使用方法): Text
applicableDisease(适用疾病): Disease
constraint: MultiValue
Gene(基因): EntityType
properties:
location(位置): Text
function(功能): Text
relatedDisease(相关疾病): Disease
constraint: MultiValue
Protein(蛋白质): EntityType
properties:
function(功能): Text
structure(结构): Text
relatedDisease(相关疾病): Disease
constraint: MultiValue
Patient(患者): EntityType
properties:
age(年龄): Integer
gender(性别): Text
diseaseHistory(疾病历史): Disease
constraint: MultiValue
可执行如下命令提交
knext schema commit
# 提交人体部位和医院部门概念导入任务
knext builder execute BodyPart,HospitalDepartment
step4:知识抽取构建
该图谱中“Disease”实体类型,需要从非结构化的本文数据中抽取,最终得到结构化的知识。
输入原始数据参考Disease原始文本,如下面例子所示:
左乳癌术后。两肺纹理清晰,两肺多发结节状密度增高影,边界清晰,较大者直径约6mm。右肺中叶及左肺上叶可见斑条状、条索状密度增高影,边界清。两肺门无增大,气管支气管通畅,纵隔未见明显肿大淋巴结。心影形态密度未见明显异常改变。所见骨质未见明显破坏征象。附见:胆囊结石。\r\n左乳癌术后;两肺多发结节灶,考虑增殖灶,较前片(2021.09.08)相仿,临床上有恶性肿瘤病史,建议随访。右肺中叶及左肺上叶纤维灶。附见:胆囊结石。
下面的抽取示例中,我们通过抽取算子,调用了“gpt-3.5”模型完成抽取任务,具体步骤:
第一步:配置模型服务,配置文件参考builder/model/openai_infer.json,这里使用openai的gpt-3.5模型,内容格式如下。
{
"nn_name": "gpt-4",
"openai_api_key": "EMPTY",
"openai_api_base": "http://127.0.0.1:38080/v1",
"openai_max_tokens": 2000
}
第二步:编写Disease的构建任务代码,本案例的代码可以参考builder/job/disease.py。
在Disease的BuildJob代码中,使用了LLMBasedExtractor算子:
from nn4k.invoker import NNInvoker
from knext.api.component import CSVReader, LLMBasedExtractor, SPGTypeMapping, KGWriter
from knext.api.auto_prompt import REPrompt
from knext.client.model.builder_job import BuilderJob
from schema.medicine_schema_helper import Medicine
class Disease(BuilderJob):
def build(self):
# 数据源
source = CSVReader(
local_path="builder/job/data/Disease.csv",
columns=["input"],
start_row=1,
)
# 使用默认的LLMBasedExtractor抽取算子
# NNInvoker封装了对gpt模型服务的调用
# REPrompt根据schema自动生成prompt
extract = LLMBasedExtractor(
llm=NNInvoker.from_config("builder/model/openai_infer.json"),
prompt_ops=[
REPrompt(
spg_type_name=Medicine.Disease,
property_names=[
Medicine.Disease.complication,
Medicine.Disease.commonSymptom,
Medicine.Disease.applicableDrug,
Medicine.Disease.department,
Medicine.Disease.diseaseSite,
],
relation_names=[(Medicine.Disease.abnormal, Medicine.Indicator)],
)
],
)
#抽取结果与schema的映射
mappings = [
SPGTypeMapping(spg_type_name=Medicine.Disease),
SPGTypeMapping(spg_type_name=Medicine.BodyPart),
SPGTypeMapping(spg_type_name=Medicine.Drug),
SPGTypeMapping(spg_type_name=Medicine.HospitalDepartment),
SPGTypeMapping(spg_type_name=Medicine.Symptom),
SPGTypeMapping(spg_type_name=Medicine.Indicator),
]
sink = KGWriter()
return source >> extract >> mappings >> sink
第三步:提交知识抽取任务
Bash复制代码
knext builder execute Disease,Symptom,Drug,BiologicalMarker,BodyPart,TreatmentMethod,MedicalDevice,Gene,Protein,Patient
step5:执行图谱任务
SPG支持ISO GQL写法,可用如下命令行执行查询任务
Cypher复制代码
查看知识图谱内容显示
常见错误:
解决方法:
这是一个bug,我们会在下个版本进行彻底解决,目前可在本地环境执行下 pip install openspg-knext -U 命令升级下knext版本进行解决