Python实际应用-数据处理(二) 数据特定格式变化

最新推荐文章于 2024-07-12 17:01:23 发布

michaelnju

最新推荐文章于 2024-07-12 17:01:23 发布

阅读量1.1k

点赞数

分类专栏： hadoop 文章标签： python 数据处理

本文链接：https://blog.csdn.net/michael_kong_nju/article/details/39482903

版权

hadoop 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

目前的状况是：

1. 在我一个文件夹下面有许多文件名是这样的数据文件

part-m-0000

part-m-0001

part-m-0002

part-m-0003

...

2. 其中每个文件夹里的数据是这样格式：

"460030730101160","3","0","0","0","2013/8/31 0:21:42"
"460036745672363","3","0","0","0","2013/8/31 0:21:31"
"460030250931114","3","1307","1","0","2013/8/31 0:21:40"
"460030250942643","3","0","0","0","2013/8/31 0:21:40"
"460036650411006","3","1021","1","0","2013/8/31 0:21:39"
"000000000009674","8","0","0","0","2013/8/31 0:12:28"
"000000000005661","8","0","0","0","2013/8/31 0:12:29"
"460030731390121","3","0","0","0","2013/8/31 21:54:00"
"460030256111396","3","0","0","0","2013/8/31 21:54:00"
"460030207447762","3","0","0","0","2013/8/31 21:53:58"
"460030250939916","3","0","0","0","2013/8/31 21:53:58"
"460030957972011","3","1613","0","0","2013/8/31 21:53:51"
"460030237206739","3","0","0","0","2013/8/31 21:53:59"
...

现在需要将数字上的引号去掉，同时将最后一列的时间的小时提取出来，下面是我用python处理的过程：

1. 先遍历当前文件夹下所有的以'part'开头的文件；

2. 对每一个文件，读取每一行，根据“，”进行分割；

3. 之后读每一部分取引号中间的部分，对最后一项时间取小时数部分，这里需要判断小时的位数是1还是2；

4. 每读一行就写一行

下面是具体的待买

#coding: utf-8
import os
for root,dir,files in os.walk("./"):
        for file in files:
                if file.startswith("part"):
                        filepath = "./"+file #This is the current file path
                        print filepath
                        newfilepath = "./data_handled/"+file[7:] # This is file used to write into
                        file = open(filepath)
                        newfile = open(newfilepath,'w')
                        for line in file:
                                string = ""
                                line_ = line.split(',')
                                for i in range(len(line_)-1):
                                        j = line_[i][1:len(line_[i])-1] #Delte the " "
                                        string += j
                                        string += ','
                                len1 = len(line_)
                                if len(line_[len1-1]) > 12:
                                        if line_[len1-1][12]==':':
                                                k = line_[len1-1][11:12]
                                        else:
                                                k = line_[len1-1][11:13]
                                else :
                                        k = "-1"
                                string += k
                                newfile.write(string+"\n")
                        newfile.close()