segmentseq2seq代码解析-1-CSDN博客

本文链接：https://blog.csdn.net/RanMW1129/article/details/80889155

segmentseq2seq源代码：https://github.com/pponnada/segmentseq2seq

tensorflow版本为0.12.0时运行OK

datasets原始数据文件有：

1.Financial1.sequences5ss-1024ws-64.bz2
2.Financial1.ss-1024.vocab-5ws-64

从 predict.py 开始：

maybe_download(datadir='datasets', fname=fname, url=origin)

1.判断文件夹是否存在，若不存在就创建一个文件夹：

if not os.path.exists(datadir):
    print("Creating directory %s" % datadir)
    os.mkdir(datadir)

2.得到文件路径：

os.path.join(datadir, fname)

3.根据URL将数据下载到本地：

filepath, _ = urllib.request.urlretrieve(url, filepath)

关于 urlretrieve()：

def urlretrieve(url, filename=None, reporthook=None, data=None):
    """
    Retrieve a URL into a temporary location on disk.

    Requires a URL argument. If a filename is passed, it is used as
    the temporary file location. The reporthook argument should be
    a callable that accepts a block number, a read size, and the
    total file size of the URL target. The data argument should be
    valid URL encoded data.

    If a filename is passed and the URL points to a local resource,
    the result is a copy from local file to new file.

    Returns a tuple containing the path to the newly created
    data file as well as the resulting HTTPMessage object.
    """

参数说明：

(1) url, 外部或本地URL

(2) filename=None, 指定了保存本地路径（如果参数未指定，urllib会生成一个临时文件保存数据）

(3) reporthook=None, 一个回调函数，接受三个参数：blocknum 已经下载的数据块数目, bs 数据块大小, size 待下载文件总大小

reporthook(blocknum, bs, size)

一个 reporthook 示例：

def reporthook_sample(blocknum, bs, size):
    ps = 100.0 * blocknum * bs / size
    if ps > 100:
        ps = 100
    print('%.2f%%' % ps)

(4) data=None，一个有效的URL编码数据，被 urlopen() 调用，用来指明发往服务器请求中的额外的信息

urlopen(url, data)

4.关于os.stat()，返回相关文件的系统状态信息

statinfo = os.stat(filepath)

示例：

os.stat('plot.py')
os.stat_result(st_mode=33204, st_ino=1463660, st_dev=2056, st_nlink=1, st_uid=1000, st_gid=1000, st_size=1198, st_atime=1530530109, st_mtime=1492579364, st_ctime=1530530101)

返回值说明：https://docs.python.org/3/library/os.html#os.stat_result

st_mode
st_ino
st_dev
st_nlink
st_uid
st_gid
st_size
st_atime
st_mtime
st_ctime

5.解压文件

import bz2
with bz2.BZ2File(compressed, 'rb') as file:
    with open(uncompressed, 'wb') as new_file:
        for data in iter(lambda: file.read(100 * 1024), b''):
            new_file.write(data)

(1) 关于 class BZ2File 的方法 read()：

def read(self, size=-1):
    """Read up to size uncompressed bytes from the file.

    If size is negative or omitted, read until EOF is reached.
    Returns b'' if the file is already at EOF.
    """

如果参数 size 为负值或省略，读到 EOF，并在读到 EOF 后返回 b''。

(2) 关于内置函数 iter()：

def iter(source, sentinel=None):  # known special case of iter
    """
    iter(iterable) -> iterator
    iter(callable, sentinel) -> iterator

    Get an iterator from an object.  In the first form, the argument must
    supply its own iterator, or be a sequence.
    In the second form, the callable is called until it returns the sentinel.
    """

传一个参数时，参数是一个 iterable。
传两个参数时，参数 callable 应是一个可调用对象(实例)，即定义了 __call__() 方法，此时将调用 callable 直到枚举到的值等于哨兵值。