1、文章简介
《LLaGA: Large Language and Graph Assistant》
LLaGA模型的三种特性:通用性,可推广性以及可解释性
该模型的大体思路:
1. 把各种图(graph)结构转化为序列;
2. 转化为序列的方式:以中心节点为root节点,转化为树结构,然后进行层序遍历,每层选取的叶子节点是固定好的,如果不够则用pad补全序列;
3. 如图1将图结构转化为node序列部分,选取了两跳邻居转化为树结构,每一跳选取三个邻居,不够三个用【pad】补全;
1. 对graph的文本特征进行embedding,用现成的LM就可以;
2. 是两部分特征(node sequence和text describe)进行对齐;
2.开源代码,实验复现
Step 1: Environment Preparation
# create a new environment conda create -n llaga python=3.10 conda activate llaga # install pytorch. Modify the command to align with your own CUDA version. pip3 install torch --index-url https://download.pytorch.org/whl/cu118 # install related libraries pip install -r requirements.txt # install flash-attn pip install flash-attn --no-build-isolation # install pyg pip install torch_geometric pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
Step 2: Data Preparation
Download our datasets from Box (updated link prediction data on 4/11/2024). And move the processed data to
./dataset
. ├── README.md ├── __init__.py ├── doc ├── dataset │ ├── ogbn-arxiv │ │ ├── sampled_2_10_test.jsonl │ │ ├── sampled_2_10_train.jsonl │ │ ├── edge_sampled_2_10_only_test.jsonl │ │ ├── edge_sampled_2_10_only_train.jsonl │ │ ├── processed_data.pt │ │ ├── roberta_x.pt │ │ ├── sbert_x.pt │ │ ├── simteg_*_x.pt │ ├── ogbn-products │ │ ├── sampled_2_10_test.jsonl │ │ ├── sampled_2_10_train.jsonl │ │ ├── edge_sampled_2_10_only_test.jsonl │ │ ├── edge_sampled_2_10_only_train.jsonl │ │ ├── processed_data.pt │ │ ├── roberta_x.pt │ │ ├── sbert_x.pt │ │ ├── simteg_*_x.pt │ └── pubmed │ │ ├── sampled_2_10_test.jsonl │ │ ├── sampled_2_10_train.jsonl │ │ ├── edge_sampled_2_10_only_test.jsonl │ │ ├── edge_sampled_2_10_only_train.jsonl │ │ ├── processed_data.pt │ │ ├── roberta_x.pt │ │ ├── sbert_x.pt │ │ ├── simteg_*_x.pt │ ├── cora │ │ ├── sampled_2_10_test.jsonl │ │ ├── sampled_2_10_train.jsonl │ │ ├── edge_sampled_2_10_only_test.jsonl │ │ ├── edge_sampled_2_10_only_train.jsonl │ │ ├── processed_data.pt │ │ ├── roberta_x.pt │ │ ├── sbert_x.pt │ │ ├── simteg_*_x.pt │ ├── laplacian_2_10.pt │ ├── laplacian_2_20.pt │ ├── laplacian_2_5.pt ├── eval ├── model ├── requirements.txt ├── scripts ├── train ├── utils
Step 3: Training
To execute the training process, you can run either
./scripts/train.sh
or./scripts/train_deepspeed.sh
. The usage instructions are as follows:#Auguments # $1 = model type, e.g. vicuna, vicuna_4hop # $2 = training task Use nc/lp/nd for single task. For multiple tasks combined, connect these abbreviations with '-', e.g. nc-lp # $3 = dataset arxiv/products/pubmed/cora For multiple datasets combined, connect these abbreviations with '-', and use '.n' to repeat multiple times e.g. arxiv-products-pubmed-cora.3 means using arxiv+products+pubmed+cora, and repeat cora for 3 times # $4 = batch size default: 16 # $5 = embedding e.g. simteg, sbert, roberta # training on single GPU CUDA_VISIBLE_DEVICES=0 ./scripts/train.sh vicuna nc-lp arxiv-products-pubmed-cora.3 16 simteg # training on multiple GPU ./scripts/train_deepspeed.sh vicuna nc-lp arxiv-products-pubmed-cora.3 4 simteg
Step 4: Evaluation
You can evaluate LLaGA with the command:
model_path="/path/to/projector" # local path or huggingface repo model_base="lmsys/vicuna-7b-v1.5-16k" #meta-llama/Llama-2-7b-hf mode="v1" # use 'llaga_llama_2' for llama and "v1" for others dataset="arxiv" #test dataset task="nc" #test task emb="simteg" use_hop=2 # 2 for ND and 4 for HO sample_size=10 template="ND" output_path="/path/to/output" python eval/eval_pretrain.py \ --model_path ${model_path} \ --model_base ${model_base} \ --conv_mode ${mode} \ --dataset ${dataset} \ --pretrained_embedding_type ${emb} \ --use_hop ${use_hop} \ --sample_neighbor_size ${sample_size} \ --answers_file ${output_path} \ --task ${task} \ --cache_dir ../../checkpoint \ --template ${template}
To evaluate your predicted results, please run:
python eval/eval_res.py --dataset ${dataset} --task ${task} --res_path ${output_path}