TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables
This repository contains source code for the TaBERT model, a pre-trained language model for learning joint representations of natural language utterances and (semi-)structured tables for semantic parsing. TaBERT is pre-trained on a massive corpus of 26M Web tables and their associated natural language context, and could be used as a drop-in replacement of a semantic parsers original encoder to compute representations for utterances and table schemas (columns).
Installation
First, install the conda environment tabert with supporting libraries.
bash scripts/setup_env.sh
Once the conda environment is created, install TaBERT using the following command:
conda activate tabert
pip install --editable .
Integration with HuggingFace's pytorch-transformers Library is still WIP. While all the pre-trained models were developed based on the old version of the library pytorch-pretrained-bert, they are compatible with the the latest version transformers. The conda environment will install both versions of the transformers library, and TaBERT will use pytorch-pretrained-bert by default. You could uninstall the pytorch-pretrained-bert library if you prefer using TaBERT with the latest version of transformers.
Pre-trained Models
To be released.
Using a Pre-trained Model
To load a pre-trained model from a checkpoint file:
from table_bert import TableBertModel
model = TableBertModel.from_pretrained(
'path/to/pretrained/model/checkpoint.bin',
)
To produce representations of natural language text and and its associated table:
from table_bert import Table, Column
table = Table(
id='List of countries by GDP (PPP)',
header=[
Column('Nation', 'text', sample_value='United States'),
Column('Gross Domestic Product', 'real', sample_value='21,439,453')
],
data=[
['United States', '21,439,453'],
['China', '27,308,857'],
['European Union', '22,774,165'],
]
).tokenize(model.tokenizer)
# To visualize table in an IPython notebook:
# display(table.to_data_frame(), detokenize=True)
context = 'show me countries ranked by GDP'
# model takes batched, tokenized inputs
context_encoding, column_encoding, info_dict = model.encode(
contexts=[model.tokenizer.tokenize(context)],
tables=[table]
)
For the returned tuple, context_encoding and column_encoding are PyTorch tensors representing utterances and table columns, respectively. info_dict contains useful meta information (e.g., context/table masks, the original input tensors to BERT) for downstream application.
context_encoding.shape
>>> torch.Size([1, 7, 768])
column_encoding.shape
>>> torch.Size([1, 2, 768])
Use Vanilla BERT To initialize a TaBERT model from the parameters of BERT:
from table_bert import VanillaTableBert, TableBertConfig
model = VanillaTableBert(
TableBertConfig(base_model_name='bert-base-uncased')
)
Reference
If you plan to use TaBERT in your project, please consider citing our paper:
@inproceedings{yin20acl,
title = {Ta{BERT}: Pretraining for Joint Understanding of Textual and Tabular Data},
author = {Pengcheng Yin and Graham Neubig and Wen-tau Yih and Sebastian Riedel},
booktitle = {Annual Conference of the Association for Computational Linguistics (ACL)},
month = {July},
year = {2020}
}
License
TaBERT is CC-BY-NC 4.0 licensed as of now.