需要提取的内容格式:
Here's what the html looks like in the baby.html files:
...
<h3 align="center">Popularity in 1990</h3>
....
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td>
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td>
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>
...
输出要求:
"""
Given a file name for baby.html, returns a list starting with the year string
followed by the name-rank strings in alphabetical order.
['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
"""
解决思路:
提取main命令参数,按照文件名依次读取,按行匹配,姓名排序,dict存储,输出结果到文件
<span style="font-size:18px;">#!/usr/bin/python
# Copyright 2010 Google Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
# Google's Python Class
# http://code.google.com/edu/languages/google-python-class/
import sys
import re
"""Baby Names exercise
Define the extract_names() function below and change main()
to call it.
For writing regex, it's nice to include a copy of the target
text for inspiration.
Here's what the html looks like in the baby.html files:
...
<h3 align="center">Popularity in 1990</h3>
....
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td>
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td>
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>
...
Suggested milestones for incremental development:
-Extract the year and print it
-Extract the names and rank numbers and just print them
-Get the names data into a dict and print it
-Build the [year, 'name rank', ... ] list and print it
-Fix main() to use the extract_names list
"""
def extract_names(filename):
"""
Given a file name for baby.html, returns a list starting with the year string
followed by the name-rank strings in alphabetical order.
['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
"""
# +++your code here+++
file_output = open('reTestFile/output.txt', 'a+') # summary output file
file_raw = open('reTestFile/'+filename, 'rU') # input single file
# extract year
# <h3 align="center">Popularity in 1990</h3>
dict_show = []
for a_line in file_raw:
match_year = re.search(r'>Popularity in\s(\w+)<', a_line)
match_name_and_rank = re.search(r'<tr align="right"><td>(\w+)</td><td>(\w+)</td><td>(\w+)</td>', a_line)
if match_year:
year = match_year.group(1) # 1990
# print >> file_output, year
if match_name_and_rank:
rank = match_name_and_rank.group(1)
name = match_name_and_rank.group(2)
# print >> file_output, name+rank
dict_show.append(name+' '+rank)
dict_show.sort()
dict_show.insert(0, year)
print >> file_output, dict_show
file_output.write('\n')
file_raw.close()
file_output.close()
return
def main():
# This command-line parsing code is provided.
# Make a list of command line arguments, omitting the [0] element
# which is the script itself.
args = sys.argv[1:]
if not args:
print 'usage: [--summaryfile] file [file ...]'
sys.exit(1)
# Notice the summary flag and remove it from args if it is present.
summary = False
if args[0] == '--summaryfile':
summary = True
del args[0]
# +++your code here+++
# For each filename, get the names, then either print the text output
# or write it to a summary file
if summary:
for a_file in args:
extract_names(a_file)
if __name__ == '__main__':
main()
</span>