python taslate_TaBERT:学习自然语言词组和结构化表的上下文表示法。

TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables

This repository contains source code for the TaBERT model, a pre-trained language model for learning joint representations of natural language utterances and (semi-)structured tables for semantic parsing. TaBERT is pre-trained on a massive corpus of 26M Web tables and their associated natural language context, and could be used as a drop-in replacement of a semantic parsers original encoder to compute representations for utterances and table schemas (columns).

Installation

First, install the conda environment tabert with supporting libraries.

bash scripts/setup_env.sh

Once the conda environment is created, install TaBERT using the following command:

conda activate tabert

pip install --editable .

Integration with HuggingFace's pytorch-transformers Library is still WIP. While all the pre-trained models were developed based on the old version of the library pytorch-pretrained-bert, they are compatible with the the latest version transformers. The conda environment will install both versions of the transformers library, and TaBERT will use pytorch-pretrained-bert by default. You could uninstall the pytorch-pretrained-bert library if you prefer using TaBERT with the latest version of transformers.

Pre-trained Models

To be released.

Using a Pre-trained Model

To load a pre-trained model from a checkpoint file:

from table_bert import TableBertModel

model = TableBertModel.from_pretrained(

'path/to/pretrained/model/checkpoint.bin',

)

To produce representations of natural language text and and its associated table:

from table_bert import Table, Column

table = Table(

id='List of countries by GDP (PPP)',

header=[

Column('Nation', 'text', sample_value='United States'),

Column('Gross Domestic Product', 'real', sample_value='21,439,453')

],

data=[

['United States', '21,439,453'],

['China', '27,308,857'],

['European Union', '22,774,165'],

]

).tokenize(model.tokenizer)

# To visualize table in an IPython notebook:

# display(table.to_data_frame(), detokenize=True)

context = 'show me countries ranked by GDP'

# model takes batched, tokenized inputs

context_encoding, column_encoding, info_dict = model.encode(

contexts=[model.tokenizer.tokenize(context)],

tables=[table]

)

For the returned tuple, context_encoding and column_encoding are PyTorch tensors representing utterances and table columns, respectively. info_dict contains useful meta information (e.g., context/table masks, the original input tensors to BERT) for downstream application.

context_encoding.shape

>>> torch.Size([1, 7, 768])

column_encoding.shape

>>> torch.Size([1, 2, 768])

Use Vanilla BERT To initialize a TaBERT model from the parameters of BERT:

from table_bert import VanillaTableBert, TableBertConfig

model = VanillaTableBert(

TableBertConfig(base_model_name='bert-base-uncased')

)

Reference

If you plan to use TaBERT in your project, please consider citing our paper:

@inproceedings{yin20acl,

title = {Ta{BERT}: Pretraining for Joint Understanding of Textual and Tabular Data},

author = {Pengcheng Yin and Graham Neubig and Wen-tau Yih and Sebastian Riedel},

booktitle = {Annual Conference of the Association for Computational Linguistics (ACL)},

month = {July},

year = {2020}

}

License

TaBERT is CC-BY-NC 4.0 licensed as of now.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值