Tensorflow – parse tfrecords, tf.io.VarLenFeature(tf.string), etc

Sometimes your labels might be something like

['text1', 'test2']

for each example. Say, it’s an image dataset and there’s a label for multiple objects existing in an image.

Creating dataset

Tensorflow documentation shows that we’ll need to use tf.train.BytesList().

def _bytes_feature(values):
    values = [v for v in values if v is not None]  # i added this
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=values))

The values is a list of strings, i.e., something like this.

def item_to_example(item):
    values = item['image_labels']  # list of strings
    values [d['label'].encode('utf-8') for d in values]
    features = {'label': _bytes_feature(values)}
    example = tf.train.Example(features=tf.train.Features(feature=features))
    return example.SerializeToString()

Training

Later, we need a parsing function for each example when constructing tf.Dataset. Because the number of groundtruth labels for each image varies in our example, we need to use tf.io.VarLenFeature.

def parse_example(example):
    parsed = tf.io.parse_single_example(
        example, {'label': tf.io.VarLenFeature(tf.string))}
    )
    return parsed['label']

Interestingly, this parsing returns not a list of string or whatever, but a sparse tensor (tf.Sparse.SparseTensor). This should be probably converted into a dense tensor.

dense_labels = tf.sparse.to_dense(sparse_tensor_labels, default_value=b'')
# now a tensor of length N and each element is tf.constant() - a tf string.

You might have used a lookup table to convert these strings to one-hot vectors.

# before data loading
all_the_labels = ['cat', 'dog', 'car', 'all the possible objects are', 'here']
table = tensorflow.contrib.lookup.index_table_from_tensor(
        mapping=[tf.constant(s.encode()) for s in all_the_labels], num_oov_buckets=1, default_value=-1
        )
depth = len(all_the_labels) + 1

# during data loading..
dense_labels = tf.sparse.to_dense(sparse_tensor_labels, default_value=b'')
index = self.table.lookup(dense_label)
# if you want to sum up all the one-hot-vectors into one N-hot vector,
out = tf.zeros((depth,), dtype=tf.float32)
for label in labels:
    index = table.lookup(label)
    out += tf.one_hot(index, depth=depth)
return tf.clip_by_value(out, 0, 1)  # just to be safer

It was pretty hard to Google all of these, so writing up here for me and you 🙂

Reference – this book – page 428 etc.

1 Comment

Leave a Comment