目前的状况是:
1. 在我一个文件夹下面有许多文件名是这样的数据文件
part-m-0000
part-m-0001
part-m-0002
part-m-0003
...
2. 其中每个文件夹里的数据是这样格式:
"460030730101160","3","0","0","0","2013/8/31 0:21:42""460036745672363","3","0","0","0","2013/8/31 0:21:31"
"460030250931114","3","1307","1","0","2013/8/31 0:21:40"
"460030250942643","3","0","0","0","2013/8/31 0:21:40"
"460036650411006","3","1021","1","0","2013/8/31 0:21:39"
"000000000009674","8","0","0","0","2013/8/31 0:12:28"
"000000000005661","8","0","0","0","2013/8/31 0:12:29"
"460030731390121","3","0","0","0","2013/8/31 21:54:00"
"460030256111396","3","0","0","0","2013/8/31 21:54:00"
"460030207447762","3","0","0","0","2013/8/31 21:53:58"
"460030250939916","3","0","0","0","2013/8/31 21:53:58"
"460030957972011","3","1613","0","0","2013/8/31 21:53:51"
"460030237206739","3","0","0","0","2013/8/31 21:53:59"
...
现在需要将数字上的引号去掉,同时将最后一列的时间的小时提取出来,下面是我用python处理的过程:
1. 先遍历当前文件夹下所有的以'part'开头的文件;
2. 对每一个文件,读取每一行,根据“,”进行分割;
3. 之后读每一部分取引号中间的部分,对最后一项时间取小时数部分,这里需要判断小时的位数是1还是2;
4. 每读一行就写一行
下面是具体的待买
#coding: utf-8
import os
for root,dir,files in os.walk("./"):
for file in files:
if file.startswith("part"):
filepath = "./"+file #This is the current file path
print filepath
newfilepath = "./data_handled/"+file[7:] # This is file used to write into
file = open(filepath)
newfile = open(newfilepath,'w')
for line in file:
string = ""
line_ = line.split(',')
for i in range(len(line_)-1):
j = line_[i][1:len(line_[i])-1] #Delte the " "
string += j
string += ','
len1 = len(line_)
if len(line_[len1-1]) > 12:
if line_[len1-1][12]==':':
k = line_[len1-1][11:12]
else:
k = line_[len1-1][11:13]
else :
k = "-1"
string += k
newfile.write(string+"\n")
newfile.close()