13_Load Preproc4_[..., np.newaxis]_ExitStack_walk_file操作_timeit_regex_feature vector embed_Text Toke

tf.shape(X_example)*tf.constant([1,0])+ tf.constant([0, 50])

13_Loading and Preprocessing Data from multiple CSV with TensorFlow_custom training loop_TFRecord
https://blog.csdn.net/Linli522362242/article/details/107704824

13_Loading and Preprocessing Data from multiple CSV with TensorFlow 2_Feature Columns_TF eXtended
https://blog.csdn.net/Linli522362242/article/details/107933572

13_Loading & Preprocessing Data with TF 3_TF Datasets_images[index, ...,0]_plt images_profiling data
https://blog.csdn.net/Linli522362242/article/details/108039793

7. Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model, or using TF Transform. Can you list a few pros and cons of each option?

Let’s look at the pros and cons of each preprocessing option:

  • If you preprocess the data when creating the data files, the training script will run faster, since it will not have to perform preprocessing on the fly. In some cases, the preprocessed data will also be much smaller than the original data, so you can save some space and speed up downloads. It may also be helpful to materialize实例化 the preprocessed data, for example to inspect it or archive it. However, this approach has a few cons. First, it’s not easy to experiment with various preprocessing logics if you need to generate a preprocessed dataset for each variant. Second, if you want to perform data augmentation扩充, you have to materialize many variants of your dataset, which will use a large amount of disk space and take a lot of time to generate. Lastly, the trained model will expect preprocessed data, so you will have to add preprocessing code in your application before it calls the model.
     
  • If the data is preprocessed with the tf.data pipeline, it’s much easier to tweak the preprocessing logic and apply data augmentation. Also, tf.data makes it easy to build highly efficient preprocessing pipelines (e.g., with multithreading and prefetching). However, preprocessing the data this way will slow down training. Moreover, each training instance will be preprocessed once per epoch rather than just once if the data was preprocessed when creating the data files. Lastly, the trained model will still expect preprocessed data.
     
  • If you add preprocessing layers to your model, you will only have to write the preprocessing code once for both training and inference. If your model needs to be deployed to many different platforms, you will not need to write the preprocessing code multiple times. Plus, you will not run the risk of using the wrong preprocessing logic for your model, since it will be part of the model. On the downside, preprocessing the data will slow down training, and each training instance will be preprocessed once per epoch. Moreover, by default the preprocessing operations will run on the GPU for the current batch (you will not benefit from parallel preprocessing on the CPU, and prefetching). Fortunately, the upcoming Keras preprocessing layers should be able to lift the preprocessing operations from the preprocessing layers and run them as part of the tf.data pipeline, so you will benefit from multithreaded execution on the CPU and prefetching.
     
  • Lastly, using TF Transform for preprocessing gives you many of the benefits from the previous options: the preprocessed data is materialized, each instance is preprocessed just once (speeding up training), and preprocessing layers get generated automatically so you only need to write the preprocessing code once. The main drawback is the fact that you need to learn how to use this tool.

8. Name a few common techniques you can use to encode categorical features. What about text?

Let’s look at how to encode categorical features and text:

  • To encode a categorical feature that has a natural order, such as a movie rating (e.g., “bad,” “average,” “good”), the simplest option is to use ordinal encoding: sort the categories in their natural order and map each category to its rank (e.g., “bad” maps to 0, “average” maps to 1, and “good” maps to 2). However, most categorical features don’t have such a natural order. For example, there’s no natural order for professions or countries. In this case, you can use one-hot encoding or, if there are many categories, embeddings.
     
  • For text, one option is to use a bag-of-words representation: a sentence is represented by a vector counting the counts of each possible word. Since common words are usually not very important, you’ll want to use TF-IDF(Term-Frequency × Inverse-Document-Frequency) to reduce their weight. Instead of counting words, it is also common to count n-grams, which are sequences of n consecutive words—nice and simple. Alternatively, you can encode each word using word embeddings, possibly pretrained. Rather than encoding words, it is also possible to encode each letter, or subword tokens (e.g., splitting “smartest” into “smart” and “est”). These last two options are discussed in Chapter 16.

9. Load the Fashion MNIST dataset (introduced in 10_Introduction to Artificial Neural Networks w Keras_3_FashionMNIST_pydot_sparse_shift(0.)_plt_imgs https://blog.csdn.net/Linli522362242/article/details/106562190); split it into a training set, a validation set, and a test set;

from tensorflow import keras
import numpy as np
import tensorflow as tf

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
# len(y_train_full), len(y_test) ==> (60000, 10000)
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

shuffle the training set; and save each dataset to multiple TFRecord files( numpy.ndarray ==> tf.data.Dataset.from_tensor_slices() ==>serialized Example ==> example.SerializeToString()). Each record should be a serialized Example protobuf with two features: the serialized image (use tf.io.serialize_tensor() to serialize each image, (BytesList)), and the label(Int64List).

     Note: for large images, you could use tf.io.encode_jpeg() instead. This would save a lot of space, but it would lose a bit of image quality.

# type(X_train) ==> numpy.ndarray

# from_tensors: combines the input and returns a dataset with a single element 
# t = tf.constant([[1, 2], [3, 4]])
# ds = tf.data.Dataset.from_tensors(t)   ==> [[1, 2], [3, 4]]

# from_tensor_slices: creates a dataset with a separate element for each row of the input tensor
# t = tf.constant([[1, 2], [3, 4]])
# ds = tf.data.Dataset.from_tensor_slices(t)   # [1, 2], [3, 4]
train_set = tf.data.Dataset.from_tensor_slices( (X_train, y_train) ).shuffle( buffer_size=len(X_train) )
valid_set = tf.data.Dataset.from_tensor_slices( (X_valid, y_valid) )
test_set = tf.data.Dataset.from_tensor_slices( (X_test, y_test) )
X_valid[0].shape

 

X_valid[0]

... ... 

 

valid_set.take(1)

 

BytesList = tf.train.BytesList
Int64List = tf.train.Int64List

Feature = tf.train.Feature
Features = tf.train.Features
Example = tf.train.Example


def create_example(image, label):
    image_data = tf.io.serialize_tensor(image)
    # encode an image using the JPEG format and put this binary data in a BytesList.
    # image_data = tf.io.encode_jpeg( image[..., np.newaxis] ) # OR image[:,:, np.newaxis].shape ==> (28, 28, 1)
    print('image_data:',image_data)    # can be removed
    print('------------------------')  # can be removed
    return Example(
        features = Features(
            feature = {
                'image': Feature( bytes_list=BytesList( value=[image_data.numpy()] ) ),
                'label': Feature( int64_list=Int64List( value=[label.numpy()] ) )
            }
        )
    )

for image, label in valid_set.take(1):
    print( create_example(image, label) )

After image_data = tf.io.serialize_tensor(image) 

 
After bytes_list=BytesList( value=[image_data.numpy()] )

     The following function saves a given dataset to a set of TFRecord files. The examples are written to the files in a round-robin fashion. To do this, we enumerate all the examples using the dataset.enumerate() method, and we compute index % n_shards to decide which file to write to. We use the standard contextlib.ExitStack class to make sure that all writers are properly closed whether or not an I/O error occurs while writing.

from contextlib import ExitStack

def write_tfrecords(name, dataset, n_shards=10):
    paths=[ "{}.tfrecord-{:05d}-of-{:05d}".format(name, index, n_shards) 
           for index in range(n_shards) ] 
    
    with ExitStack() as stack:
        writers = [stack.enter_context(tf.io.TFRecordWriter(path))
                   for path in paths]
        
        for index, (image, label) in dataset.enumerate():
            shard = index % n_shards
            example = create_example(image, label)
            writers[shard].write(example.SerializeToString())
    return paths

train_filepaths = write_tfrecords("my_fashion_mnist.train", train_set)
valid_filepaths = write_tfrecords("my_fashion_mnist.valid", valid_set)
test_filepaths = write_tfrecords("my_fashion_mnist.test", test_set)

    

train_filepaths

 

Then use tf.data to create an efficient dataset for each set.

def preprocess(tfrecord):
    
    feature_descriptions = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""), #OR tf.io.VarLenFeature(tf.string)
        "label": tf.io.FixedLenFeature([], tf.int64, default_value=-1)
    }
    example = tf.io.parse_single_example(tfrecord, feature_descriptions)
    
    #out_type=tf.uint8 since RGB(0~255,0~255,0~255) and 2^8=256 is enough
    image = tf.io.parse_tensor(example["image"], out_type=tf.uint8) # since image_data = tf.io.serialize_tensor(image)
    # image = tf.io.decode_jpeg(example["image"]) # since image_data = tf.io.encode_jpeg( image[..., np.newaxis] )
    
    image = tf.reshape(image, shape=[28,28]) # <== image[:,:, np.newaxis].shape ==> (28, 28, 1)
    return image, example["label"]

#read data from tfrecords files    
def mnist_dataset(filepaths, 
                  n_read_threads=5, # num_parallel_reads
                  shuffle_buffer_size=None,
                  n_parse_threads=5, 
                  batch_size=5, 
                  cache=True):
    
    dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=n_read_threads)
    
    if cache:
        dataset = dataset.cache()
        
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size)
    
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads) #for each record
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

train_set = mnist_dataset(train_filepaths, shuffle_buffer_size=60000)
valid_set = mnist_dataset(valid_filepaths)
test_set = mnist_dataset(test_filepaths)
train_set.take(1)

image = tf.io.parse_tensor(example["image"], out_type=tf.uint8) # since image_data = tf.io.serialize_tensor(image) 

 

################################################################ VS tensorflow_datasets

import tensorflow_datasets as tfds

fashion_mnist_bldr=tfds.builder('fashion_mnist')

# Download the data, prepare it, and write it to disk
fashion_mnist_bldr.download_and_prepare()

# Load data from disk as tf.data.Datasets
fashion_mnist_datasets = fashion_mnist_bldr.as_dataset(shuffle_files=False)

fashion_mnist_datasets.keys()



C:\Users\LlQ\tensorflow_datasets\fashion_mnist\3.0.1\dataset_info.json

{
  "citation": "@article{DBLP:journals/corr/abs-1708-07747,\n  author    = {Han Xiao and\n               Kashif Rasul and\n               Roland Vollgraf},\n  title     = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning\n               Algorithms},\n  journal   = {CoRR},\n  volume    = {abs/1708.07747},\n  year      = {2017},\n  url       = {http://arxiv.org/abs/1708.07747},\n  archivePrefix = {arXiv},\n  eprint    = {1708.07747},\n  timestamp = {Mon, 13 Aug 2018 16:47:27 +0200},\n  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1708-07747},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}",
  "description": "Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.",
  "downloadSize": "30878645",
  "location": {
    "urls": [
      "https://github.com/zalandoresearch/fashion-mnist"
    ]
  },
  "name": "fashion_mnist",
  "splits": [
    {
      "name": "test",
      "numBytes": "5470824",
      "shardLengths": [
        "10000"
      ]
    },
    {
      "name": "train",
      "numBytes": "32717636",
      "shardLengths": [
        "60000"
      ]
    }
  ],
  "supervisedKeys": {
    "input": "image",
    "output": "label"
  },
  "version": "3.0.1"
}

C:\Users\LlQ\tensorflow_datasets\fashion_mnist\3.0.1\image.image.json

{"encoding_format": "png", "shape": [28, 28, 1]}
import tensorflow as tf
fashion_mnist__train = fashion_mnist_datasets['train']
assert isinstance(fashion_mnist__train, tf.data.Dataset)

fashion_mnist__train.take(1)

fashion_mnist__train_set = mnist_dataset(r'C:\Users\LlQ\tensorflow_datasets\fashion_mnist\3.0.1\fashion_mnist-train.tfrecord-00000-of-00001')
fashion_mnist__train_set.take(1)


################################################################

import matplotlib.pyplot as plt

for X,y in train_set.take(1):
    for i in range(5):
        plt.subplot(1,5, i+1)
        plt.imshow(X[i].numpy(), cmap="binary")
        plt.axis("off")
        plt.title(str(y[i].numpy()))

     Finally, use a Keras model to train these datasets, including a preprocessing layer to standardize each input feature. Try to make the input pipeline as efficient as possible, using TensorBoard to visualize profiling data.

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
        
    def call(self, inputs):
        return (inputs-self.means_) / (self.stds_ + keras.backend.epsilon())
    
standardization = Standardization(input_shape=[28,28])
sample_image_batches = train_set.map(lambda image, label: image)
sample_images = np.concatenate(list(sample_image_batches.as_numpy_iterator()), 
                               axis=0).astype(np.float32)
#len(sample_images) == 5 * 100 batches
standardization.adapt(sample_images)

model = keras.models.Sequential([
    standardization,
    keras.layers.Flatten(), #1D
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer='nadam', 
              metrics=["accuracy"])
from datetime import datetime
import os
import math

logs = os.path.join(os.curdir, "my_logs", "run_"+datetime.now().strftime("%Y%m%d_%H%M%S"))

tensorboard_cb = tf.keras.callbacks.TensorBoard(log_dir=logs,
                                                histogram_freq=1, 
                                                profile_batch=10)

batch_size=32 #default    
model.fit(train_set, steps_per_epoch=np.ceil(len(X_train)/batch_size),
          epochs=5, 
          validation_data=valid_set, callbacks=[tensorboard_cb])

Warning: The profiling tab in TensorBoard works if you use TensorFlow 2.2+. You also need to make sure tensorboard_plugin_profile is installed (and restart Jupyter if necessary).

%load_ext tensorboard
%tensorboard --logdir=./my_logs --port=6006


Reason: No GPU

device_name = tf.test.gpu_device_name()
if not device_name:
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

################################################################################

################################################################################
10. In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer:

  1. a. Download the Large Movie Review Dataset http://ai.stanford.edu/~amaas/data/sentiment/, which contains 50,000 movies reviews from the Internet Movie Database. The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words), but we will ignore them in this exercise.
    from pathlib import Path
    from tensorflow import keras
    
    DOWNLOAD_ROOT = "http://ai.stanford.edu/~amaas/data/sentiment/"
    FILENAME = "aclImdb_v1.tar.gz"
    filepath = keras.utils.get_file(FILENAME, 
                                    DOWNLOAD_ROOT + FILENAME, extract=True)
    path = Path(filepath).parent / "aclImdb"
    path


     ==>

    https://www.geeksforgeeks.org/os-walk-python/

  2. Unzipped file directory tree
         OS.walk() generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).
         root : Prints out directories only from what you specified.
         dirs : Prints out sub-directories from root.
         files : Prints out all files from root and directories.

    # Driver function 
    import os 
    if __name__ == "__main__": 
    	for (root,dirs,files) in os.walk('Test', topdown=true): 
    		print root 
    		print dirs 
    		print files 
    		print '--------------------------------'
    

       

    path.parts

    import os
    
    #name: current directory OR folder
    for name, subdirs, files in os.walk(path):
        indent = len(Path(name).parts) - len(path.parts)
        print("    " *indent + Path(name).parts[-1] + os.sep)
        for index, filename in enumerate( sorted(files) ):
            if index == 3:
                print("    " *(indent+1) + "...")
                break; #exist for loop
            print("    " *(indent+1) + filename)

  3. Files Access path

    def review_paths(dirpath):
        return [str(path) for path in dirpath.glob("*.txt")]
    
    train_pos = review_paths(path / "train" / "pos") # WindowsPath('C:/Users/LlQ/.keras/datasets/aclImdb/train/pos')
    train_neg = review_paths(path / "train" / "neg")
    
    test_valid_pos = review_paths(path / "test" / "pos")
    test_valid_neg = review_paths(path / "test" / "neg")
    
    len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)

  • b. Split the test set into a validation set (15,000) and a test set (10,000).
    import numpy as np
    
    np.random.shuffle(test_valid_pos)
    
    test_pos = test_valid_pos[:5000]
    test_neg = test_valid_neg[:5000]
    
    valid_pos = test_valid_pos[5000:]
    valid_neg = test_valid_neg[5000:]
  • c. Use tf.data to create an efficient dataset for each set.
    Since the dataset fits in memory
    , we can just load all the data using pure Python code and use tf.data.Dataset.from_tensor_slices(): # reading all files
    import tensorflow as tf
    
    def imdb_dataset(filepaths_positive, filepaths_negative):
        reviews = []
        labels = []
        
        for filepaths, label in ( (filepaths_negative, 0), 
                                  (filepaths_positive, 1) ):
            for filepath in filepaths:
                with open(filepath, encoding="utf8") as review_file:
                    reviews.append( review_file.read() )
                labels.append(label)
                
        return tf.data.Dataset.from_tensor_slices(
                                                    ( tf.constant(reviews), 
                                                      tf.constant(labels) 
                                                    )
                )
    
    for X,y in imdb_dataset(train_pos, train_neg).take(1):
        print(X)
        print(y)
        print()


    %timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).repeat(10): pass

    It takes about 1 minute 26 seconds to load the dataset and go through it 10 times.

  • TFRecords

    But let's pretend the dataset does not fit in memory, just to make things more interesting. Luckily, each review fits on just one line (they use <br /> to indicate line breaks), so we can read the reviews using a TextLineDataset. If they didn't we would have to preprocess the input files (e.g., converting them to TFRecords). For very large datasets, it would make sense a tool like Apache Beam for that. # reading data batch by batch and multiple parallels

    def imdb_dataset(filepaths_positive, filepaths_negative, n_read_threads=5):
        dataset_neg = tf.data.TextLineDataset( filepaths_negative, num_parallel_reads=n_read_threads )
        dataset_neg = dataset_neg.map( lambda review: (review, 0) )
        
        dataset_pos = tf.data.TextLineDataset( filepaths_positive, num_parallel_reads=n_read_threads )
        dataset_pos = dataset_pos.map( lambda review: (review, 1) )
        
        return tf.data.Dataset.concatenate( dataset_pos, dataset_neg )
    
    %timeit -r1 for X,y in imdb_dataset(train_pos, train_neg).repeat(10): pass 

    Now it takes about 2 minutes 54 seconds to go through the dataset 10 times. That's much slower, essentially because the dataset is not cached in RAM, so it must be reloaded at each epoch. If you add .cache() just before .repeat(10), you will see that this implementation will be about as fast as the previous one.

    %timeit -r1 for X,y in imdb_dataset(train_pos, train_neg).cache().repeat(10): pass 

    batch_size = 32
    
    train_set = imdb_dataset(train_pos, train_neg).shuffle(25000).batch(batch_size).prefetch(1)
    valid_set = imdb_dataset(valid_pos, valid_neg).batch(batch_size).prefetch(1)
    test_set = imdb_dataset(test_pos, test_neg).batch(batch_size).prefetch(1)
    for X,y in train_set.take(1):
        print(X)# print(len(X)) # 32
        print(y)# print(len(y)) # 32
        print()

  • d. Create a binary classification model, using a TextVectorization layer to preprocess each review. If the TextVectorization layer is not yet available (or if you like a challenge), try to create your own custom preprocessing layer: you can use the functions in the tf.strings package, for example lower() to make everything lowercase, regex_replace() to replace punctuation with spaces, and split() to split words on spaces. You should use a lookup table to output word indices, which must be prepared in the adapt() method.

    Let's first write a function to preprocess the reviews, cropping them to 300 characters, converting them to lower case, then replacing <br /> and all non-letter characters to spaces, splitting the reviews into words, and finally padding or cropping each review so it ends up with exactly n_words tokens:
    #################################################

    https://blog.csdn.net/Linli522362242/article/details/87780336
    https://blog.csdn.net/Linli522362242/article/details/90139705
    https://blog.csdn.net/Linli522362242/article/details/103866244
    <br /> ==> "<br\\s*/?>"
    all non-letter characters ==> b"[^a-z]"
    X_example = tf.constant(["It's a great, great movie! I loved it.", "It was terrible, run away!!!"])
    tf.shape(X_example)
    tf.shape(X_example)*tf.constant([1,0])
    tf.shape(X_example)*tf.constant([1,0])+ tf.constant([0, 50])

    #################################################
  • tokenizer

  • def preprocess(X_batch, n_words=50):
        shape = tf.shape(X_batch) * tf.constant([1,0]) + tf.constant([0, n_words])# ==> shape(batches, n_words)
        Z = tf.strings.substr(X_batch, 0, 300) # cropping them to 300 characters
        
        # <head.*?>   matches only one “<head>”
        Z = tf.strings.lower(Z)                              # "\" : An escape character
        Z = tf.strings.regex_replace(Z, b"<br\\s*/?>", b" ") # "\s": white space, "\S": non-white space OR ([^\s]) 
        Z = tf.strings.regex_replace(Z, b"[^a-z]", b" ") # replace all non-letter characters to spaces
        Z = tf.strings.split(Z) #splitting the reviews into words
        
        return Z.to_tensor(shape=shape, default_value=b"<pad>")
    
    X_example = tf.constant(["It's a great, great movie! I loved it.", "It was terrible, run away!!!"])
    preprocess(X_example)

    Now let's write a second utility function that will take a data sample with the same format as the output of the preprocess() function, and will output the list of the top max_size most frequent words, ensuring that the padding token is first:
     
  • get_vocabulary

    from collections import Counter
    
    def get_vocabulary( data_sample, max_size=1000):
        preprocess_reviews = preprocess(data_sample).numpy()
        counter = Counter()
        
        for words in preprocess_reviews:
            for word in words:
                if word != b"<pad>" :
                    counter[word] += 1
        return [b"<pad>"] + [word for word, count in counter.most_common(max_size)]
    
    get_vocabulary(X_example)


    Now we are ready to create the TextVectorization layer. Its constructor just saves the hyperparameters (max_vocabulary_size and n_oov_buckets). The adapt() method computes the vocabulary using the get_vocabulary() function, then it builds a StaticVocabularyTable (see Chapter 16 for more details). The call() method preprocesses the reviews to get a padded list of words for each review, then it uses the StaticVocabularyTable to lookup the index of each word in the vocabulary:

    self.n_oov_buckets:  specify the number of out-of-vocabulary (oov) buckets. If we look up a category that does not exist in the vocabulary, the lookup table will compute a hash of this category and use it to assign the unknown category to one of the oov buckets

  • TextVectorization

    class TextVectorization(keras.layers.Layer):
        def __init__(self, max_vocabulary_size=1000, n_oov_buckets=100, dtype=tf.string, **kwargs):
            super().__init__(dtype=dtype, **kwargs)
            self.max_vocabulary_size = max_vocabulary_size
            self.n_oov_buckets = n_oov_buckets
        
        def adapt(self, data_sample):
            self.vocab = get_vocabulary(data_sample, self.max_vocabulary_size)
            words = tf.constant(self.vocab)
            word_ids = tf.range( len(self.vocab), dtype=tf.int64 )
            
            # https://blog.csdn.net/Linli522362242/article/details/107933572
            vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
            self.table = tf.lookup.StaticVocabularyTable(vocab_init, self.n_oov_buckets) # builds a StaticVocabularyTable
            
        def call(self, inputs):
            preprocessed_inputs = preprocess(inputs) #reviews ==> words array
            return self.table.lookup(preprocessed_inputs)

    Let's try it on our small X_example we defined earlier:

    text_vectorization = TextVectorization()
    
    text_vectorization.adapt(X_example)
    text_vectorization(X_example)


         Looks good! As you can see, each review was cleaned up and tokenized, then each word was encoded as its index in the vocabulary (all the 0s correspond to the <pad> tokens)

         Now let's create another TextVectorization layer and let's adapt it to the full IMDB training set (if the training set did not fit in RAM, we could just use a smaller sample of the training set by calling train_set.take(500)):

    max_vocabulary_size = 1000
    n_oov_buckets = 100
    
    sample_review_batches = train_set.map(lambda review, label: review)               #tensor: [batch, batch, ...]
    sample_reviews = np.concatenate( list( sample_review_batches.as_numpy_iterator() ),
                                   axis=0)                                            # to one list
    
    text_vectorization = TextVectorization(max_vocabulary_size, n_oov_buckets, input_shape=[])
    text_vectorization.adapt(sample_reviews)
    
    text_vectorization(X_example)


    Good! Now let's take a look at the first 10 words in the vocabulary:

    text_vectorization.vocab[:10]

    These are the most common words in the reviews.

  •  

    bags of words

         Now to build our model we will need to encode all these word IDs somehow. One approach is to create bags of words: for each review, and for each word in the vocabulary, we count the number of occurences of that word in the review. For example:
    simple_example = tf.constant([ [1,3,1,0,0], 
                                   [2,2,0,0,0] ])
    tf.one_hot(simple_example, 4)

    simple_example = tf.constant([ [1,3,1,0,0], 
                                   [2,2,0,0,0] ])
    tf.reduce_sum( tf.one_hot(simple_example, 4), axis=1)

    <tf.Tensor: shape=(2, 4), dtype=float32, numpy=
    array([[2., 2., 0., 1.],                 # since two-0s, two-1s, no 2, one-3s
          [3., 0., 2., 0.]], dtype=float32)> # since three-0s, no 1, two-2s, no 3
    The first review has 2 times the word 0, 2 times the word 1, 0 times the word 2, and 1 time the word 3, so its bag-of-words representation is [2, 2, 0, 1]. Similarly, the second review has 3 times the word 0, 0 times the word 1, and so on.

    Let's wrap this logic in a small custom layer, and let's test it. We'll drop the counts for the word 0, since this corresponds to the <pad> token, which we don't care about.
     

    class BagOfWords(keras.layers.Layer):
        def __init__(self, n_tokens, dtype=tf.int32, **kwargs):
            super().__init__(dtype=tf.int32, **kwargs)
            self.n_tokens = n_tokens
            
        def call(self, inputs):
            one_hot = tf.one_hot(inputs, self.n_tokens)
            return tf.reduce_sum(one_hot, axis=1) [:,1:]
        
    # Let's test it:
    bag_of_words = BagOfWords(n_tokens=4)
    bag_of_words( simple_example )

  • It works fine! Now let's create another BagOfWord with the right vocabulary size for our training set:
    n_tokens = max_vocabulary_size + n_oov_buckets + 1 # add 1 for <pad>
    bag_of_words = BagOfWords(n_tokens)
    We're ready to train the model!
    keras.backend.clear_session()
    tf.random.set_seed(42)
    np.random.seed(42)
    
    model = keras.models.Sequential([
        text_vectorization, 
        bag_of_words,
        keras.layers.Dense(100, activation="relu"),
        keras.layers.Dense(1, activation="sigmoid"),
    ])
    
    model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])
    model.fit(train_set, epochs=5, validation_data=valid_set)


    We get about 73.5% accuracy on the validation set after just the first epoch, but after that the model makes no progress. We will do better in Chapter 16. For now the point is just to perform efficient preprocessing using tf.data and Keras preprocessing layers.
  • Embedding

  • e. Add an Embedding layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.

  • To be precise, the sentence(here, one sentence per review) embedding is equal to the mean word embedding multiplied by the square root of the number of words in the sentence. This compensates for the fact that the mean of n vectors gets shorter as
    n grows.


    To compute the mean embedding for each review, and multiply it by the square root of the number of words in that review, we will need a little function:

    # another_example[review1[ 1_word_dimesions[], 1_word_dimesions[], 1_word_dimensions[] ], 
    #                 review2[ 1_word_dimesions[], 1_word_dimesions[], 1_word_dimensions[] ]
    #                ]                  #d1  d2  d3
    # another_example = tf.constant([ [ [1., 2., 3.], # word
    #                                   [4., 5., 0.], # word
    #                                   [0., 0., 0.]  # word
    #                                 ],# one sentence(1 line) represents one review
    #                                 [ [6., 0., 0.], 
    #                                   [0., 0., 0.], 
    #                                   [0., 0., 0.] 
    #                                 ]
    #                               ])
    # # not_pad = tf.math.count_nonzero(inputs, axis=-1)########################
    # tf.math.count_nonzero(another_example, axis=-1)
    # <tf.Tensor: shape=(2, 3), dtype=int64, numpy=
    # array([[3, 2, 0],
    #        [1, 0, 0]], dtype=int64)>
    
    # # n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)########
    # tf.math.count_nonzero(tf.math.count_nonzero(another_example, axis=-1), 
    #                       axis=-1, 
    #                       keepdims=True)
    # <tf.Tensor: shape=(2, 1), dtype=int64, numpy=
    # array([[2],
    #        [1]], dtype=int64)>
    #The first review contains 2 words (the last token is a zero vector,
    #which represents the <pad> token). The second review contains 1 word. 
    
    
    # # sqrt_n_words = tf.math.sqrt( tf.cast(n_words, tf.float32) )##############
    # tf.math.sqrt( tf.cast(tf.math.count_nonzero(tf.math.count_nonzero(another_example, axis=-1), 
    #                       axis=-1, 
    #                       keepdims=True), tf.float32) )
    # <tf.Tensor: shape=(2, 1), dtype=float32, numpy=
    # array([[1.4142135],
    #        [1.       ]], dtype=float32)>
    
    # # return tf.reduce_mean(inputs, axis=1) * sqrt_n_words ####################
    
    # tf.reduce_mean(another_example, axis=1)
    # <tf.Tensor: shape=(2, 3), dtype=float32, numpy=
    #          mean word embedding , mean word embedding , mean word embedding
    # array([[ 1.6666666==(1+4+0)/3, 2.3333333==(2+5+0)/3, 1.==(3+0+0)/3 words  ],
    #        [ 2.       ==(6+0+0)/3, 0.       ==(0+0+0)/3, 0.==(0+0+0)/3  ]
    #       ], dtype=float32)>
    
    # 1.4142135*1.6666666 = 2.3570224
    # 1.4142135*2.3333333 = 3.2998314
    
    # tf.reduce_mean(another_example, axis=1) * tf.math.sqrt( tf.cast(tf.math.count_nonzero(tf.math.count_nonzero(another_example, axis=-1), 
    #                                                                                       axis=-1, 
    #                                                                                       keepdims=True), 
    #                                                                 tf.float32) 
    #                                                       )
    
    # <tf.Tensor: shape=(2, 3), dtype=float32, numpy=
    # array([[2.3570225, 3.2998314, 1.4142135],
    #        [2.       , 0.       , 0.       ]], dtype=float32)>
    
    
    def compute_mean_embedding(inputs):
        not_pad = tf.math.count_nonzero(inputs, axis=-1)
        n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)
        sqrt_n_words = tf.math.sqrt( tf.cast(n_words, tf.float32) )
        return tf.reduce_mean(inputs, axis=1) * sqrt_n_words
    
    another_example = tf.constant([ [ [1., 2., 3.], 
                                      [4., 5., 0.], 
                                      [0., 0., 0.] 
                                    ],
                                    [ [6., 0., 0.], 
                                      [0., 0., 0.], 
                                      [0., 0., 0.] 
                                    ]
                                  ])
    compute_mean_embedding(another_example)

    Let's check that this is correct. The first review contains 2 words (the last token is a zero vector, which represents the <pad> token). The second review contains 1 word. So we need to compute the mean embedding for each review, and multiply the first one by the square root of 2, and the second one by the square root of 1:
    tf.reduce_mean( another_example, axis=1) * tf.sqrt([[2.], [1.]
                                                       ])

  • Perfect. Now we're ready to train our final model. It's the same as before, except we replaced the BagOfWords layer with an Embedding layer followed by a Lambda layer that calls the compute_mean_embedding layer:

    embedding_size=20
    model = keras.models.Sequential([
        text_vectorization,
        keras.layers.Embedding(input_dim=n_tokens, #n_tokens = max_vocabulary_size + n_oov_buckets + 1 # add 1 for <pad>
                               output_dim=embedding_size,
                               mask_zero=True), # <pad> tokens => zero vectors
        keras.layers.Lambda(compute_mean_embedding),
        keras.layers.Dense(100, activation="relu"),
        keras.layers.Dense(1, activation="sigmoid")
    ])

     

  • f. Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.
     
    model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])
    model.fit(train_set, epochs=5, validation_data= valid_set)


    The model is not better using embeddings (but we will do better in Chapter 16). The pipeline looks fast enough (we optimized it earlier).

  • g. Use TFDS to load the same dataset more easily: tfds.load("imdb_reviews"). 
    import tensorflow_datasets as tfds
    
    datasets = tfds.load(name="imdb_reviews")
    train_set, test_set = datasets["train"], datasets["test"]
    for example in train_set.take(1):
        print(example["text"])
        print(example["label"])

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
import jieba import torch from transformers import BertTokenizer, BertModel, BertConfig # 自定义词汇表路径 vocab_path = "output/user_vocab.txt" count = 0 with open(vocab_path, 'r', encoding='utf-8') as file: for line in file: count += 1 user_vocab = count print(user_vocab) # 种子词 seed_words = ['姓名'] # 加载微博文本数据 text_data = [] with open("output/weibo_data.txt", "r", encoding="utf-8") as f: for line in f: text_data.append(line.strip()) print(text_data) # 加载BERT分词器,并使用自定义词汇表 tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', vocab_file=vocab_path) config = BertConfig.from_pretrained("bert-base-chinese", vocab_size=user_vocab) # 加载BERT模型 model = BertModel.from_pretrained('bert-base-chinese', config=config, ignore_mismatched_sizes=True) seed_tokens = ["[CLS]"] + seed_words + ["[SEP]"] seed_token_ids = tokenizer.convert_tokens_to_ids(seed_tokens) seed_segment_ids = [0] * len(seed_token_ids) # 转换为张量,调用BERT模型进行编码 seed_token_tensor = torch.tensor([seed_token_ids]) seed_segment_tensor = torch.tensor([seed_segment_ids]) model.eval() with torch.no_grad(): seed_outputs = model(seed_token_tensor, seed_segment_tensor) seed_encoded_layers = seed_outputs[0] jieba.load_userdict('data/user_dict.txt') # 构建隐私词库 privacy_words = set() privacy_words_sim = set() for text in text_data: words = jieba.lcut(text.strip()) tokens = ["[CLS]"] + words + ["[SEP]"] token_ids = tokenizer.convert_tokens_to_ids(tokens) segment_ids = [0] * len(token_ids) # 转换为张量,调用BERT模型进行编码 token_tensor = torch.tensor([token_ids]) segment_tensor = torch.tensor([segment_ids]) model.eval() with torch.no_grad(): outputs = model(token_tensor, segment_tensor) encoded_layers = outputs[0] # 对于每个词,计算它与种子词的余弦相似度 for i in range(1, len(tokens) - 1): word = tokens[i] if word in seed_words: continue if len(word) <= 1: continue sim_scores = [] for j in range(len(seed_encoded_layers)): sim_scores.append(torch.cosine_similarity(seed_encoded_layers[j][0], encoded_layers[j][i], dim=0).item()) cos_sim = sum(sim_scores) / len(sim_scores) print(cos_sim, word) if cos_sim >= 0.5: privacy_words.add(word) privacy_words_sim.add((word, cos_sim)) print(privacy_words) # 输出隐私词库 with open("output/privacy_words.txt", "w", encoding="utf-8") as f1: for word in privacy_words: f1.write(word + '\n') with open("output/privacy_words_sim.txt", "w", encoding="utf-8") as f2: for word, cos_sim in privacy_words_sim: f2.write(word + "\t" + str(cos_sim) + "\n") 详细解释上述代码,包括这行代码的作用以及为什么要这样做?
06-08
这段代码是一个隐私词提取的程序,其目的是从微博文本数据中提取出可能包含隐私信息的词汇,并将其输出到文件中。 代码分为以下几个部分: 1.导入必要的库和模型 ``` import jieba import torch from transformers import BertTokenizer, BertModel, BertConfig ``` 其中,jieba是一个中文分词库,torch是PyTorch深度学习框架,transformers是一个自然语言处理模型库。 2.加载自定义的词汇表 ``` vocab_path = "output/user_vocab.txt" count = 0 with open(vocab_path, 'r', encoding='utf-8') as file: for line in file: count += 1 user_vocab = count print(user_vocab) ``` 这里的自定义词汇表是一些特定领域的词汇,例如医学领域或法律领域的专业术语。这些词汇不在通用的词汇表中,需要单独加载。 3.加载微博文本数据 ``` text_data = [] with open("output/weibo_data.txt", "r", encoding="utf-8") as f: for line in f: text_data.append(line.strip()) print(text_data) ``` 这里的微博文本数据是程序要处理的输入数据。 4.加载BERT分词器,并使用自定义词汇表 ``` tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', vocab_file=vocab_path) config = BertConfig.from_pretrained("bert-base-chinese", vocab_size=user_vocab) ``` BERT分词器可以将中文文本转换为一系列的词汇编号,这里使用自定义词汇表来保证所有的词汇都可以被正确地转换。 5.加载BERT模型 ``` model = BertModel.from_pretrained('bert-base-chinese', config=config, ignore_mismatched_sizes=True) ``` BERT模型是一个预训练的深度学习模型,可以将文本编码为向量表示。 6.构建种子词库 ``` seed_words = ['姓名'] seed_tokens = ["[CLS]"] + seed_words + ["[SEP]"] seed_token_ids = tokenizer.convert_tokens_to_ids(seed_tokens) seed_segment_ids = [0] * len(seed_token_ids) seed_token_tensor = torch.tensor([seed_token_ids]) seed_segment_tensor = torch.tensor([seed_segment_ids]) model.eval() with torch.no_grad(): seed_outputs = model(seed_token_tensor, seed_segment_tensor) seed_encoded_layers = seed_outputs[0] ``` 种子词库是指一些已知的包含隐私信息的词汇,这里只有一个“姓名”。这部分代码将种子词转换为张量表示,并调用BERT模型进行编码。 7.构建隐私词库 ``` privacy_words = set() privacy_words_sim = set() for text in text_data: words = jieba.lcut(text.strip()) tokens = ["[CLS]"] + words + ["[SEP]"] token_ids = tokenizer.convert_tokens_to_ids(tokens) segment_ids = [0] * len(token_ids) token_tensor = torch.tensor([token_ids]) segment_tensor = torch.tensor([segment_ids]) model.eval() with torch.no_grad(): outputs = model(token_tensor, segment_tensor) encoded_layers = outputs[0] for i in range(1, len(tokens) - 1): word = tokens[i] if word in seed_words: continue if len(word) <= 1: continue sim_scores = [] for j in range(len(seed_encoded_layers)): sim_scores.append(torch.cosine_similarity(seed_encoded_layers[j][0], encoded_layers[j][i], dim=0).item()) cos_sim = sum(sim_scores) / len(sim_scores) print(cos_sim, word) if cos_sim >= 0.5: privacy_words.add(word) privacy_words_sim.add((word, cos_sim)) print(privacy_words) ``` 这部分代码是隐私词提取的核心部分,其流程如下: 1. 对每个文本进行分词。 2. 将分词后的词汇转换为张量表示,并调用BERT模型进行编码。 3. 对于每个词,计算它与种子词之间的余弦相似度。 4. 如果相似度大于等于0.5,则将该词添加到隐私词库中。 8.输出隐私词库 ``` with open("output/privacy_words.txt", "w", encoding="utf-8") as f1: for word in privacy_words: f1.write(word + '\n') with open("output/privacy_words_sim.txt", "w", encoding="utf-8") as f2: for word, cos_sim in privacy_words_sim: f2.write(word + "\t" + str(cos_sim) + "\n") ``` 这部分代码将提取出的隐私词输出到文件中,包括词汇本身和与种子词的相似度值。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值