D o C P 学习笔记（3 - 2）Regular Expressions, other languages and interpreters - Problem Set 3_using python, write a function that takes a json s-CSDN博客

本文链接：https://blog.csdn.net/qq_33528613/article/details/79540430

备注1：每个视频的英文字幕，都翻译成中文，太消耗时间了，为了加快学习进度，我将暂停这个工作，仅对英文字幕做少量注释。
备注2：将.flv视频文件与Subtitles文件夹中的.srt字幕文件放到同1个文件夹中，然后在迅雷看看中打开播放，即可自动加载字幕。

Regular Expressions, other languages and interpreters

你可以学到什么：

定义正则表达式的语言；解释这个语言。
定义被1个正则表达式匹配的字符串集合。
其他语言。

Problem Set 3

视频链接：
Problem Set 3 - Udacity

Course Syllabus

Lesson 3: Regular Expressions, Other Languages and Interpreters

Lesson 3 Course Notes（主要是课程视频对应的英文字幕的网页。）
Lesson 3 Code
Help with Regular Expressions

1. 练习：Json Parser

Json Parser - Design of Computer Programs - YouTube

视频下方补充材料——开始

You can read about the JSON grammar here.

视频下方补充材料——结束

In this homework you’re going to be asked to write a grammar that will allow us to write a parser for the JSON language. JSON is a data interchange language for JavaScript that allows JavaScript programs to pass values back and forth and pass them on to other programs. What I’m asking you to do is to write this grammar. You can look at JSON.org to see the definition. There is a little grammar there on the right-hand side. It’s not quite in the right format that we expect here, so you’re going to have to translate it into the right format. Then you should be able to parse. A top level is called a value in JSON. You should be able to parse that with your grammar. Here I’ve given you some examples that you can test out to see if they make sense. See if you can translate the JSON grammar into our format, and see how easy that is.

# ---------------
# User Instructions
#
# In this problem, you will be using many of the tools and techniques
# that you developed in unit 3 to write a grammar that will allow
# us to write a parser for the JSON language. 
#
# You will have to visit json.org to see the JSON grammar. It is not 
# presented in the correct format for our grammar function, so you 
# will need to translate it.

# ---------------
# Provided functions
#
# These are all functions that were built in unit 3. They will help
# you as you write the grammar.  Add your code at line 102.

from functools import update_wrapper
from string import split
import re

def grammar(description, whitespace=r'\s*'):
    """Convert a description to a grammar.  Each line is a rule for a
    non-terminal symbol; it looks like this:
        Symbol =>  A1 A2 ... | B1 B2 ... | C1 C2 ...
    where the right-hand side is one or more alternatives, separated by
    the '|' sign.  Each alternative is a sequence of atoms, separated by
    spaces.  An atom is either a symbol on some left-hand side, or it is
    a regular expression that will be passed to re.match to match a token.

    Notation for *, +, or ? not allowed in a rule alternative (but ok
    within a token). Use '\' to continue long lines.  You must include spaces
    or tabs around '=>' and '|'. That's within the grammar description itself.
    The grammar that gets defined allows whitespace between tokens by default;
    specify '' as the second argument to grammar() to disallow this (or supply
    any regular expression to describe allowable whitespace between tokens)."""
    G = {' ': whitespace}
    description = description.replace('\t', ' ') # no tabs!
    for line in split(description, '\n'):
        lhs, rhs = split(line, ' => ', 1)
        alternatives = split(rhs, ' | ')
        G[lhs] = tuple(map(split, alternatives))
    return G

def decorator(d):
    "Make function d a decorator: d wraps a function fn."
    def _d(fn):
        return update_wrapper(d(fn), fn)
    update_wrapper(_d, d)
    return _d

@decorator
def memo(f):
    """Decorator that caches the return value for each call to f(args).
    Then when called again with same args, we can just look it up."""
    cache = {}
    def _f(*args):
        try:
            return cache[args]
        except KeyError:
            cache[args] = result = f(*args)
            return result
        except TypeError:
            # some element of args can't be a dict key
            return f(args)
    return _f

def parse(start_symbol, text, grammar):
    """Example call: parse('Exp', '3*x + b', G).
    Returns a (tree, remainder) pair. If remainder is '', it parsed the whole
    string. Failure iff remainder is None. This is a deterministic PEG parser,
    so rule order (left-to-right) matters. Do 'E => T op E | T', putting the
    longest parse first; don't do 'E => T | T op E'
    Also, no left recursion allowed: don't do 'E => E op T'"""

    tokenizer = grammar[' '] + '(%s)'

    def parse_sequence(sequence, text):
        result = []
        for atom in sequence:
            tree, text = parse_atom(atom, text)
            if text is None: return Fail
            result.append(tree)
        return result, text

    @memo
    def parse_atom(atom, text):
        if atom in grammar:  # Non-Terminal: tuple of alternatives
            for alternative in grammar[atom]:
                tree, rem = parse_sequence(alternative, text)
                if rem is not None: return [atom]+tree, rem  
            return Fail
        else:  # Terminal: match characters against start of text
            m = re.match(tokenizer % atom, text)
            return Fail if (not m) else (m.group(1), text[m.end():])

    # Body of parse:
    return parse_atom(start_symbol, text)

Fail = (None, None)

JSON = grammar("""your code here""", whitespace='\s*')

def json_parse(text):
    return parse('value', text, JSON)

def test():
    assert json_parse('["testing", 1, 2, 3]') == (                      
                       ['value', ['array', '[', ['elements', ['value', 
                       ['string', '"testing"']], ',', ['elements', ['value', ['number', 
                       ['int', '1']]], ',', ['elements', ['value', ['number', 
                       ['int', '2']]], ',', ['elements', ['value', ['number', 
                       ['int', '3']]]]]]], ']']], '')

    assert json_parse('-123.456e+789') == (
                       ['value', ['number', ['int', '-123'], ['frac', '.456'], ['exp', 'e+789']]], '')

    assert json_parse('{"age": 21, "state":"CO","occupation":"rides the rodeo"}') == (
                      ['value', ['object', '{', ['members', ['pair', ['string', '"age"'], 
                       ':', ['value', ['number', ['int', '21']]]], ',', ['members', 
                      ['pair', ['string', '"state"'], ':', ['value', ['string', '"CO"']]], 
                      ',', ['members', ['pair', ['string', '"occupation"'], ':', 
                      ['value', ['string', '"rides the rodeo"']]]]]], '}']], '')
    return 'tests pass'

print test()

1.Json Parser Solution

JSON = grammar("""
object => { } | { member }
members => pair , members | pair
pair => string : value
array => [[] []] | [[] elements []]
elements => value , elements | value
values => string | number | object | array | true | false | null
string => "[^"]*"
number => int frac exp | int frac | int exp | int
int => -?[1-9][0-9]*
frac => [.][0-9]+
exp => [eE][-+]?[0-9]+
""", whitespace='\s*')

Here is my answer. There is a lot of variation in exactly how you want the grammar to look. But I just went through the grammar on the json.org webpage, started to write it down, made sure I wrote it in the correct order. They had things like members goes to pair or pair, members, and I had to make sure to put the longer element first, because that’s the way our parser works. Other than that, it was straightforward. I abbreviated(abbreviate 使简短) a little bit. I didn’t get into spelling out the individual digits. I kind of stuck them together into longer regular expressions to take advantage of that rather than break it out all the way. But either way you did it, if you get it to parse, you’ve done the job and congratulations.

2. 练习：Inverse Function

Inverse Function - Design of Computer Programs - YouTube

视频下方补充材料——开始

You may want to read about binary search and Newton’s method to help you solve this problem.

视频下方补充材料——结束

In this homework we’re going to show the power of functions as tools by computing inverse functions and doing the work just once, rather than having to do it for each function.

What do I mean by that?

The square function for squaring numbers is easy, right? We just return x times x.
def square(x): return x*x

But if we wanted to define the square root function, that’s a lot harder. Assuming we didn’t have the exponentiation(求幂;取幂) operator built in, if we wanted to define it in terms of elementary arithmetic(算术,计算;算法), that’s a lot of work.

Isaac Newton came up with a way to do it, but he’s Newton.

For those of us who aren’t, wouldn’t it be great if instead of having to write all this code to define the square root, we could just say “sqrt” is the inverse of square and be done with it?
sqrt = inverse(square)

Well, in this homework we’re going to do just that. We’re going to do it in a slightly restricted(有受限制的;保密的) sense. We’re only going to deal with functions that are defined on the non-negative numbers and are monotonically(单调地) increasing. They have to keep on going up. That way they have a defined inverse. Functions that are non monotonic(单调的,无变化的) don’t have a single inverse(相反;相反的事物), because there’s a two-to-one mapping.

def inverse(f, delta=1/128.):
    """Given a function y = f(x) that is a monotonically increasing function on
    non-negative numbers, return the function x = f_1(y) that is an approximate
    inverse, picking the closest value to the inverse, within deilta."""
    def f_1(y):
        x = 0
        while f(x) < y:
            x += delta
        # Now x is too big, x-delta is too small; pick the closest to y
        return x if (f(x)-y < y-f(x-delta)) else x-delta
    return f_1

def square(x): return x * x
sqrt = inverse(square)             ## Hint: "binary search" "Newton's method"
print sqrt(100)
10.0

print sqrt(99)
9.953125

print sqrt(100000000)
10000.0

Here’s my definition of inverse. I’m going to give you a simple version and then ask you to write a more efficient one.

What does inverse do? It takes a function f and returns a function f_1. The way it figures out what to do is it says let’s start at zero, because we said this is the function defined on the non-negative numbers, and ask is this f(x) greater than the y that’s being passed to f_1. If it is, let’s increment x by a little bit–a little bit being delta, which here I’ve defined as 1/128, but you can define it as what you want when you’re asking for the inverse. Keep on going until we find an f(x) which is not less than y. Now x is too big. It’s greater than or equal to y, and y minus delta is too small. It’s less than y. We’re somewhere in between the two, and we want to pick the closest one. That’s what this expression does. It says we know the result is somewhere in there, and we want to choose which one is closer.

How does that work? Well, we can define square. We can ask for the square root of 100. I guess I missed a step in here where I have to say that sqrt is equal to inverse(square). Now when we ask for the square root of 100 we get exactly 10.0. That’s the right answer. When we ask for the square root of 99, we get 9.95-something. That’s pretty close, although there are more accurate representations that the computer could come up with. When we ask for the square root of 100 million, we get 10,000 exactly, which is exactly the right answer.

But it took a little bit too long. It took almost a second to come up with this result. I’d like it to go much faster.

So that’s what I’m going to ask you to do. I want you to modify inverse so that it has a run time closer to the logarithm(对数) of the input to f_1 rather than to linear in the input to f_1.

I’ll give you two hints of things to consider. One is binary search, and the other Newton’s method. So do some research on those, and then modify the definition of inverse so that when we say sqrt = inverse(square) the whole function runs faster.

# --------------
# User Instructions
#
# Write a function, inverse, which takes as input a monotonically
# increasing (always increasing) function that is defined on the 
# non-negative numbers. The runtime of your program should be 
# proportional to the LOGARITHM of the input. You may want to 
# do some research into binary search and Newton's method to 
# help you out.
#
# This function should return another function which computes the
# inverse of the input function. 
#
# Your inverse function should also take an optional parameter, 
# delta, as input so that the computed value of the inverse will
# be within delta of the true value.

# -------------
# Grading Notes
#
# Your function will be called with three test cases. The 
# input numbers will be large enough that your submission
# will only terminate in the allotted time if it is 
# efficient enough. 

def slow_inverse(f, delta=1/128.):
    """Given a function y = f(x) that is a monotonically increasing function on
    non-negatve numbers, return the function x = f_1(y) that is an approximate
    inverse, picking the closest value to the inverse, within delta."""
    def f_1(y):
        x = 0
        while f(x) < y:
            x += delta
        # Now x is too big, x-delta is too small; pick the closest to y
        return x if (f(x)-y < y-f(x-delta)) else x-delta
    return f_1 

def inverse(f, delta = 1/128.):
    """Given a function y = f(x) that is a monotonically increasing function on
    non-negatve numbers, return the function x = f_1(y) that is an approximate
    inverse, picking the closest value to the inverse, within delta."""

def square(x): return x*x
sqrt = slow_inverse(square)

print sqrt(1000000000)

2.Inverse Function Solution

Here’s my approach to the problem. Here’s our axes. Here’s my function, f(x). Now, the problem is I’m given some value of y, and I want to find the value of x that corresponds to that such that f(x) equals y. The strategy we had before was to just start at the 0 point and go out step-by-step-by-step, but that’s going to be slow if there are lots of steps. My approach is going to be–first I’m going to take one step forward, and then if the value of f(x) is still below y, then I’m going to double how far out I go. I’m going to keep on doubling how far out I go and checking– 1, 2, 4, 8, 16 units out–until I’ve got bounds. Here I have this value that I doubled. F(x) was less than the desired y. When I went all the way out to here, f(x was greater than the desired y. I know that the right x has to be somewhere within this range. That gives me the low and the high that I get by doubling. Now I’m going to find the exact value or close to exact value within low and high by halfing. First doubling, now halfing. I look in here, say what’s halfway between low and high. That’s at this point. F(x) there is still too high. Now I know I must be somewhere in this half. I go halfway there. That’s still too high. Now I know I must be in this half, and I keep on doing that process until I zero in on the right value.
That’s the strategy.

def inverse(f, delta=1/1024):
    def f_1(y):
        lo, hi = find_bounds(f, y)
        return binary_search(f, y, lo, hi, delta)
    return f_1

def find_bounds(f, y):
    x = 1
    while f(x) < y:
        x = x * 2
    lo = 0 if (x == 1) else x/2.
    return lp, x

Now let’s see what the code looks like. Here’s my function “inverse.” I’m going to have my smallest delta–the smallest amount that I move out–be 1/1024. That’s going to get me to within three significant digits. I’m going to build up this f inverse function. I was given y = f(x). I’m going to build up x = f_1(y). I do that first by finding the low and high bounds– it’s got to be somewhere in there–and then doing a binary search somewhere in between that low and high to find a value that’s accurate to within delta.

Here’s how I find the bounds. I start off, and I just keep on doubling until I find a value that’s high enough.

def binary_search(f, y, lo, hi, delta):
    while lo <= hi:
        x = (lo + hi) / 2.;
        if f(x) < y:
            lo = x + delta
        elif f(x) > y:
            hi = x - delta
        else:
            return x;
    return hi if (f(hi)-y < y-f(lo)) else lo

What I keep on doing is narrowing down the interval between low and high until they come out to be the same–until the interval has disappeared. If I’m too high, then I change the low. Otherwise, I change the high value. That makes the interval smaller and smaller. If I hit it exactly, I go ahead and return the x value. If I haven’t hit it exactly, then I know I’m somewhere in between the two, and I just check out which one to do.

def square(x): return x*x

def power10(x): return 10**x

log10 = inverse(power10)

sqrt = inverse(square)

Now, I’ve defined some functions here to help me test what I’ve done. I’ve defined the square and the 10^x functions. Now I define those inverses. The logarithm is just the inverse of the power of 10, and sgrt is the inverse of square using the function I defined.

cuberoot = inverse(lambda x: x*x*x)

Can also do a cube root as the inverse of the cube function.

def test():
    import math
    nums = [2, 4, 6, 8, 10, 99, 100, 101, 100, 1000, 10000, 20000, 40000, 100000000]
    for n in nums:
        test1(n, 'sqrt', sqrt(n), math.sqrt(n))
        test1(n, 'log ', log10(n), .ath.log10(n))
        test1(n, '3-rt', cuberoot(n), n**(1./3.))

def test1(n, name, value, expected):
    diff = abs(value-expected)
    print '%6g: %s = %13.7f (%13.7f actual); %.4f diff; %s' % (
        n, name, value, expected, diff,
        ('ok' if diff < .002 else '**** BAD ****'))

Then I’m defining some tests. For these sets of numbers, I’m going to test these functions–sqrt, log10, and cuberoot, and I’m going to test them against the correct mathematical functions as defined by Python. These are the ones I’ve defined with inverse. These are the ones Python defines, and here’s my individual test.

Here’s what I get when I run the tests. For each of the numbers and for each of the functions here is the result I compute with my inverse function. Here is the actual results, and you can see the differences are all in the 0.001 or less. That’s true for small numbers, and it’s true even as we go up to bigger numbers like 10^8.

3. 练习：Find Html Tags

Find Html Tags - Design of Computer Programs - YouTube

Now, if you took CS101, you had to find HTML tags within a document, and you did that mostly using the string.find method, which works okay, but it’s fragile(易碎的;虚弱的) and doesn’t quite get everything quite right.

The code you wrote was not completely robust(健壮,稳健) to having spaces and lines feeds and other things anywhere in the text, so we’d like to do something that’s a little bit more robust.

Your problem is to find these HTML tags, and in particular what I’m looking for are start tags. Things like we’d begin, and then a would be a tag, and table would be another one and so on. I’m not looking for the end tags, so don’t worry about those. Just the start tags. Then something like a and then a set of attribute value pairs. Then a closing angle bracket(angle bracket 尖括弧). We’re looking to find all the instances that look like that. This whole thing–angle bracket, tag, optional set of parameters–parameter equal string. These will have to be a string, and they have to be enclosed in double quotes. You can rely on that, and there won’t be any double quotes within the string. That simplifies it a little bit. Then the closing bracket. But there can be spaces anywhere and so on.

def findtags(text):
    [..., ..., ...]

I want you to write a function–call it “findtags”–which you pass it a text, and then it returns a list of strings that look like that that come out of the text. You could use whatever tools you found. You can use the regular expression module “re.” You can use the regular expression You can use the context-free parser that we built. Whatever you think is appropriate(适当的).

# ---------------
# User Instructions
#
# Write a function, findtags(text), that takes a string of text
# as input and returns a list of all the html start tags in the 
# text. It may be helpful to use regular expressions to solve
# this problem.

import re

def findtags(text):
    # your code here

testtext1 = """
My favorite website in the world is probably 
<a href="www.udacity.com">Udacity</a>. If you want 
that link to open in a <b>new tab</b> by default, you should
write <a href="www.udacity.com"target="_blank">Udacity</a>
instead!
"""

testtext2 = """
Okay, so you passed the first test case. <let's see> how you 
handle this one. Did you know that 2 < 3 should return True? 
So should 3 > 2. But 2 > 3 is always False.
"""

testtext3 = """
It's not common, but we can put a LOT of whitespace into 
our HTML tags. For example, we can make something bold by
doing <         b           > this <   /b    >, Though I 
don't know why you would ever want to.
"""

def test():
    assert findtags(testtext1) == ['<a href="www.udacity.com">', 
                                   '<b>', 
                                   '<a href="www.udacity.com"target="_blank">']
    assert findtags(testtext2) == []
    assert findtags(testtext3) == ['<         b           >']
    return 'tests pass'

print test()

3.Find Html Tags Solution

def findtags(text):
    parms = '(\w+\s*=\s*"[^"]*"\s*)'
    tags  = '(<\s*\w+\s*>)' + parms + '\s*/?>'
    return re.findall(tags, text)

Here’s my solution to the findtags problem. I decided to do it just using regular expressions. Why don’t I make that explicit and import the regular expression module, but then I broke up my regular expression, because they can get long and complicated. I said there is a part where I’m parsing the parameters, and that’s one or more word characters, an equal sign, and then a quoted string. A quote character, 0 or more non-quote characters, followed by a quote character. Then that whole thing we can have zero or more parameters. The thing that made it complicated is I’ve got to throw in optional spaces in multiple positions– ” \s ” means zero or more spaces. That defines a set of parameters. Then the tag is the angle bracket, maybe some spaces, word character– like the a or the table tag, then the parameters, then maybe some more spaces, and then the close. The close is an angle bracket. I allowed an optional slash-angle bracket there. Now that I’ve defined this regular expression in tags, then I just re.findall of the tags in the text.

4. Challenge Problem

Challenge Problem - Design of Computer Programs - YouTube

视频下方补充材料——开始

This homework problem is difficult and completely optional. We won’t be grading this one, so feel free to discuss possible solutions in the forums.

视频下方补充材料——结束

Now we showed how useful it is to have an API for regular expressions that we can say for example plus an option, alternative of literal A, literal B and these function calls are convenient to manipulate but that’s an awful lot to type, especially when the string notation for regular expressions is so much simpler, we can represent this as this simple string, and so what I want you to do in this homework is to write a grammar and a parser that maps from this string to this expression. So you should first define regular expression grammar using the tools that we’ve provided, build the parser for that, that’s going to give you a tree, so when we parse let’s say, RE is our main symbol, then some text with this grammar then that’s going to give you some sort of tree, but it’s not quite this API form, so then I want you to write another function to convert from the tree to the API, and so here’s what it looks like. You are going to define your grammar, RE is going to be the main left-hand side symbol to parse the regular expression that’s given for you, you parse it and convert it and then you have to convert this to make that cause into the API.