colab上用bert进行文本分类_colaboratory bert-CSDN博客

本文链接：https://blog.csdn.net/liguo_yao/article/details/103365343

Google colab上用bert进行文本分类

运用公共数据
用自己的数据

运用公共数据

Colaboratory 是一个免费的 Jupyter 笔记本环境，包括GPU和TPU。不需要进行任何设置就可以使用，并且完全在云端运行。

借助 Colaboratory，您可以编写和执行代码、保存和共享分析结果，以及利用强大的计算资源，所有这些都可通过浏览器免费使用。

要使用Google的chrome浏览器；
要有google账号；

第一步

#安装相关的套件
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

result

E: Package 'python-software-properties' has no installation candidate
Selecting previously unselected package google-drive-ocamlfuse.
(Reading database ... 145605 files and directories currently installed.)
Preparing to unpack .../google-drive-ocamlfuse_0.7.14-0ubuntu1~ubuntu18.04.1_amd64.deb ...
Unpacking google-drive-ocamlfuse (0.7.14-0ubuntu1~ubuntu18.04.1) ...
Setting up google-drive-ocamlfuse (0.7.14-0ubuntu1~ubuntu18.04.1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force
··········
Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force
Please enter the verification code: Access token retrieved correctly.

第二步

#加载云端硬碟，每次加载都会更新
!mkdir -p drive
!google-drive-ocamlfuse drive  -o nonempty

#进入云端存有自己文件夹
import os
os.chdir('drive/bert-master')

#看一下文件夹里面的内容
!ls

result

CONTRIBUTING.md		    predicting_movie_reviews_with_bert_on_tf_hub.ipynb
create_pretraining_data.py  __pycache__
data_dir		    README.md
extract_features.py	    requirements.txt
__init__.py		    run_classifier.py
LICENSE			    run_classifier_with_tfhub.py
modeling.py		    run_pretraining.py
modeling_test.py	    run_squad.py
multilingual.md		    sample_text.txt
optimization.py		    tokenization.py
optimization_test.py	    tokenization_test.py

!python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=data_dir \
  --vocab_file=gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/vocab.txt \
  --bert_config_file=gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_config.json \
  --init_checkpoint=gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=output \

result

I1202 13:57:51.438973 140524128954240 run_classifier.py:923] ***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.85294116
I1202 13:57:51.439047 140524128954240 run_classifier.py:925]   eval_accuracy = 0.85294116
INFO:tensorflow:  eval_loss = 0.4763845
I1202 13:57:51.960015 140524128954240 run_classifier.py:925]   eval_loss = 0.4763845
INFO:tensorflow:  global_step = 343
I1202 13:57:51.960390 140524128954240 run_classifier.py:925]   global_step = 343
INFO:tensorflow:  loss = 0.4763845
I1202 13:57:51.960465 140524128954240 run_classifier.py:925]   loss = 0.4763845

用自己的数据

在文本分类问题中，需要先将数据集整理成可用的形式。然后对代码进行相应的修改。
主要修改以下文件：run_classifier.py
不同的格式对应了不同的DataProcessor类。可以将数据保存成如下格式：

class SafeProcessor(DataProcessor):
  """Processor for the move data set ."""

  def get_train_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "train.csv")), "train")

  def get_dev_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "dev.csv")), "dev")

  def get_test_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "test.csv")), "test")

  def get_labels(self):
    """See base class."""
    return ["0", "1", "2", "3"]
    ##分类数，根据自己情况调整

  def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      guid = "%s-%s" % (set_type, i)
      if set_type == "test":
        text_a = tokenization.convert_to_unicode(line[0])
        label = "0"
      else:
        text_a = tokenization.convert_to_unicode(line[1])
        label = tokenization.convert_to_unicode(line[0])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples

processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
      "mrpc": MrpcProcessor,
      "xnli": XnliProcessor,
      "safe": SafeProcessor
  }