关于 tf.data.TextLineDataset() 和常见dataset函数-python黑洞网

本站消息

站长简介/公众号

出租广告位,需要合作请联系站长

9384vfnv

1110

文章

929850

访问

+关注

分类

暂无分类

日期归档

暂无数据

关于 tf.data.TextLineDataset() 和常见dataset函数

发布于2019-08-07 14:35 阅读(3163) 评论(0) 点赞(3) 收藏(2)


官方原话：
class TextLineDataset(dataset_ops.Dataset):
  """A `Dataset` comprising lines from one or more text files."""
    def __init__(self, filenames, compression_type=None, buffer_size=None):
        Creates a `TextLineDataset`.
        Args:
           filenames: A `tf.string` tensor containing one or more filenames.
           compression_type: (Optional.) A `tf.string` scalar evaluating to one of
           `""` (no compression), `"ZLIB"`, or `"GZIP"`.
           buffer_size: (Optional.) A `tf.int64` scalar denoting the number of bytes
           to buffer. A value of 0 results in the default buffering values chosen
           based on the compression type.

中文含义：
创造一个TextLineDataset()类
参数：
    filenames:                   单个或者多个string格式的文件名或者目录
    compression_type：   可选！！！格式是ZLIB或者GZIP
    buffer_size:                 可选！！！决定缓冲字节数多少


举例：   
# 文件路径可以用list包括起来，多个路径
input_files = ['./input_file11', './input_file22']             
dataset = tf.data.TextLineDataset(input_files)

tf.data.TextLineDataset 接口提供了一种方法从数据文件中读取。我们提供只需要提供文件名（1个或者多个）。这个接口会自动构造一个dataset，类中保存的元素：文中一行，就是一个元素，是string类型的tensor。

小知识：支持data包下多个函数的操作，目前深度学习中最常用的有4个方法如下

1.map()：对元素进行操作


格式：
map(函数)
 
# map里面的函数决定了dataset中的数据的处理方式。列如：
dataset.map(Lambda string:tf.string_split([string]).values)
# dataset中元素命名为string,对string进行切割操作
'''
注：
tf.string_split(
    source,
    delimiter=' ',
    skip_empty=True
)
source：需要操作的对象，一般是字符串或者多个字符串构成的列表；
delimiter:分割符,默认空字符串
skip_empty：m默认True，暂时没用到过
'''

关于代码里面的tf.string_split可以看另外一篇文档：https://blog.csdn.net/xinjieyuan/article/details/90698352

2.shuttle()：


打乱元素的序列
其实就是随机组合

3.zip()

可以把不同的dataset组合起来


# 生成两个不同的dataset
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
 
dataset2 = tf.data.Dataset.from_tensor_slices(
    (tf.random_uniform([4]),
     tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
# 进行组合
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
'''
注：
使用zip()函数时候，注意要把多个dataset用括号包起来
不然会报：
TypeError: zip() takes 1 positional argument but 2 were given
'''

4.filter()

过滤符合要求的元素


# 设置过滤器
def FilterLength(src_len,trg_len):
    len_ok = tf.logical_andgic(
            tf.greater(src_len,1),            # src_len大于1，返回True
            tf.less_equal(trg_len,MAX_LEN)    # trg_len小于MAX_LEN，返回True
        )
    return len_ok 
# 调用过滤器，过滤不符合条件的元素
dataset = dataset.filter(FilterLength)