一位程序员提供了一个Python脚本,该脚本可以从HTML文件中提取内容并将其作为“partial”文件进行保存。另外,该脚本还会生成一个JSON文件作为清单,其中包含所有文件的URL、标题和创建日期。该程序员希望对脚本进行修改,使其能够按照创建日期对文件进行排序,同时在清单文件中以及创建的文件名中包含创建日期。
2. 解决方案:
为了满足程序员的要求,可以对脚本进行以下修改:
- 在
parse_article
函数中,使用os.path.getmtime(abs_path)
函数来获取文件的修改日期。 - 在
process_folder
函数中,使用sorted
函数对文件的列表按照修改日期进行排序。 - 在
save_json
函数中,将文件的修改日期添加到清单文件中。 - 在
save_file
函数中,将文件的修改日期添加到文件名中。
代码示例:
import os
import re
from BeautifulSoup import BeautifulSoup
import simplejson as json
def parse_article(root, filename):
path = os.path.join(root, filename)
abs_path = os.path.abspath(path)
try:
article = open(abs_path, 'rU')
html = article.read()
article.close()
except IOError:
print "Cannot open article: %s" % path
url = "/%s" % path
soup = BeautifulSoup(html)
title = None
fallbacks = ['h1', 'h2', 'h3', 'title']
for fallback in fallbacks:
if title is None:
title = soup.find(fallback)
else:
break
content = u"" if soup.body is None else soup.body.renderContents()
save_file(root, "%s.partial" % filename, content)
title = u"" if title is None else title.renderContents()
modification_date = os.path.getmtime(abs_path)
return unicode(url), title, unicode(modification_date)
def process_folder(path):
files = os.listdir(path)
articles = filter(lambda name: not name.startswith('index.') and (name.endswith('.html') or name.endswith('.htm')), files)
manifest = {}
for article in articles:
url, title, modification_date = parse_article(path, article)
manifest[url] = {'title': title, 'modification_date': modification_date}
# 对文件列表按照修改日期进行排序
sorted_articles = sorted(articles, key=lambda article: os.path.getmtime(os.path.join(path, article)))
return sorted_articles, manifest
def save_json(root, name, obj):
if len(obj.keys()) == 0:
return
path = os.path.join(root, name)
manifest = open(path, 'w')
json.dump(obj, manifest)
manifest.close()
print "Wrote %s" % path
def save_file(root, name, content):
path = os.path.join(root, name)
manifest = open(path, 'w')
manifest.write(content)
manifest.close()
print "Wrote %s" % path
def process(root):
root = os.path.abspath(root)
root_re = '^%s[/]*' % root
for dirname, dirnames, filenames in os.walk(root):
dirname = re.sub(root_re, '', dirname)
if len(dirname) > 0:
sorted_articles, manifest = process_folder(dirname)
abs_path = os.path.abspath(os.path.join(root, dirname))
save_json(abs_path, "manifest.json", manifest)
# 将文件的修改日期添加到文件名中
for article in sorted_articles:
modification_date = os.path.getmtime(os.path.join(dirname, article))
new_filename = "[%s]-%s.partial" % (modification_date, article)
save_file(dirname, new_filename, content)
if __name__ == "__main__":
process('.')
通过这些修改,脚本将按照创建日期对文件进行排序,并将在清单文件中以及创建的文件名中包含创建日期。