Google Python Class 之——正则表达式提取html网页数据字段

需要提取的内容格式:

Here's what the html looks like in the baby.html files:
...
<h3 align="center">Popularity in 1990</h3>
....
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td>
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td>
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>
...

输出要求:

"""
    Given a file name for baby.html, returns a list starting with the year string
    followed by the name-rank strings in alphabetical order.
    ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
    """

解决思路:

     提取main命令参数,按照文件名依次读取,按行匹配,姓名排序,dict存储,输出结果到文件


<span style="font-size:18px;">#!/usr/bin/python
# Copyright 2010 Google Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Google's Python Class
# http://code.google.com/edu/languages/google-python-class/

import sys
import re


"""Baby Names exercise

Define the extract_names() function below and change main()
to call it.

For writing regex, it's nice to include a copy of the target
text for inspiration.

Here's what the html looks like in the baby.html files:
...
<h3 align="center">Popularity in 1990</h3>
....
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td>
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td>
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>
...

Suggested milestones for incremental development:
 -Extract the year and print it
 -Extract the names and rank numbers and just print them
 -Get the names data into a dict and print it
 -Build the [year, 'name rank', ... ] list and print it
 -Fix main() to use the extract_names list
"""


def extract_names(filename):
    """
    Given a file name for baby.html, returns a list starting with the year string
    followed by the name-rank strings in alphabetical order.
    ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
    """
    # +++your code here+++
    file_output = open('reTestFile/output.txt', 'a+')   # summary output file
    file_raw = open('reTestFile/'+filename, 'rU')       # input single file

    # extract year
    # <h3 align="center">Popularity in 1990</h3>
    dict_show = []
    for a_line in file_raw:
        match_year = re.search(r'>Popularity in\s(\w+)<', a_line)
        match_name_and_rank = re.search(r'<tr align="right"><td>(\w+)</td><td>(\w+)</td><td>(\w+)</td>', a_line)
        if match_year:
            year = match_year.group(1)                       # 1990
            # print >> file_output, year
        if match_name_and_rank:
            rank = match_name_and_rank.group(1)
            name = match_name_and_rank.group(2)
            # print >> file_output, name+rank
            dict_show.append(name+' '+rank)
    dict_show.sort()
    dict_show.insert(0, year)
    print >> file_output, dict_show
    file_output.write('\n')
    file_raw.close()
    file_output.close()
    return


def main():
    # This command-line parsing code is provided.
    # Make a list of command line arguments, omitting the [0] element
    # which is the script itself.
    args = sys.argv[1:]

    if not args:
        print 'usage: [--summaryfile] file [file ...]'
        sys.exit(1)

    # Notice the summary flag and remove it from args if it is present.
    summary = False
    if args[0] == '--summaryfile':
        summary = True
        del args[0]

    # +++your code here+++
    # For each filename, get the names, then either print the text output
    # or write it to a summary file
    if summary:
        for a_file in args:
            extract_names(a_file)


if __name__ == '__main__':
    main()
</span>


  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Django中的邮箱正则表达式是用于验证输入的邮箱地址是否符合规范。 在Django中,邮箱的正则表达式被定义在email模块中的`validators`中。一般而言,邮箱的正则表达式遵循以下规则: 1. 邮箱的格式由用户名和域名组成,中间用@符号连接。域名可以是IP地址或者主机名,例如"example@example.com"。 2. 用户名通常由字母、数字、下划线和特殊字符组成,长度可以是1到64个字符。 3. 域名由一个或多个标签组成,标签之间用`.`分隔。标签由字母、数字和连字符组成,长度可以是1到63个字符。例如"example.com"。 4. 不允许出现连续的`.`符号。 5. 域名后缀由字母和数字组成,长度可以是2到6个字符。例如".com"。 Django中的邮箱正则表达式示例:`[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}` 这个正则表达式的含义是:邮箱的用户名可以由大小写字母、数字、下划线、百分号、加号、减号和点号组成;域名可以由大小写字母、数字、连字符和点号组成;域名后缀可以由大小写字母组成,长度为2到6个字符。 在Django中,可以使用该正则表达式进行邮箱验证,以保证输入的邮箱地址符合规范,有效性。例如,在模型类的字段中可以使用`validators`参数来设置邮箱验证规则,示例代码如下: ```python from django.db import models from django.core.validators import RegexValidator class MyModel(models.Model): email = models.EmailField(validators=[RegexValidator( regex=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}', message='请输入有效的邮箱地址' )]) ``` 这样,在保存数据时,如果邮箱地址不符合规范,Django会抛出`ValidationError`异常,提示用户输入有效的邮箱地址。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值