1. 测试文档:
#test_input.txt
Welcome to World Wide Spam. Inc.
These are the corporate web pages of*World Wide Spam*, Inc. We hope you find your stay enjoyable, andthat you will sample many of our products.
A short history of the company
World Wide Spam was startedin the summer of 2000. The business concept was to ride the dot-com wave and to make money both through bulk email andby selling canned meat online.
After receiving several complaintsfrom customers who weren't satisfied by their bulk email, World Wide Spam altered their profile, and focused 100% on canned goods. Today, they rank as the world's 13,892online supplier of SPAM.
Destinations
From this page you may visit several of our interesting web pages:- What is SPAM?(http://wwspam.fu/whatisspam)- How do they make it?(http://wwspam.fu/howtomakeit)- Why should I eat it?(http://wwspam.fu/whyeatit)
How to getintouch with us
You can getin touch with us in *many* ways: By phone (555-1234), by email (wwspam@wwspam.fu) or by visiting our customer feedback page (http://wwspam.fu/feedback)
2.实现
2.1 找出文本块
收集遇到的所有行,直到遇到一个空行,返回已经收集的行;
不需要收集空行,也不要返回空块(在遇到多个空行时);
确保最后一行是空行,否则不知道最后一个块啥时候结束。
1 #文本块生成器(util.py)
2 deflines(file): #文件末尾追加空行3 for line infile:yieldline5 yield '\n'
6
7 defblocks(file):8 block=[]9 for line inlines(file):10 ifline.strip(): #非空行11 block.append(line)12 elifblock: #遇到空白行时(即文本块末尾),且block非空,则连接里面的行13 yield ' '.join(block).strip()14 bolck=[]
2.2 添加标记
打印一些开始标记;
打印每个用段落标签括起来的块;
打印一些结束标记
#简单的标记程序(simple_markup.py)
importsys,refrom util import *
print '
...'title=Truefor block inblocks(sys.stdin):block=re.sub(r'\*(.+?)\*',r'\1',block)iftitle:print '
'
printblockprint ''title=Falseelse:print '
'
printblockprint '
'print ''
3 模块化
语法分析器: 读取文本,管理其他类的对象;
规则:为每个种类的块制定一条规则,规则能检测适用的快类型并且进行适当的格式化;
过滤器:包装一些处理内嵌元素的正则表达式;
处理程序:语法分析器使用处理程序来产生输出。每个处理程序能产生不同种类的标记
3.1 处理程序
补充:
1)getattr()函数是Python自省的核心函数,具体使用大体如下:获取对象引用getattr。 getattr用于返回一个对象属性,或者方法
classA:def __init__(self):
self.name= 'zhangjing'
#self.age='24'
defmethod(self):print"method print"Instance=A()print getattr(Instance , 'name,'not find') #如果Instance 对象中有属性name则打印self.name的值,否则打印'not find'print getattr(Instance , 'age', 'not find') #如果Instance 对象中有属性age则打印self.age的值,否则打印'not find'
print getattr(a, 'method', 'default')#如果有方法method,否则打印其地址,否则打印default
print getattr(a, 'method', 'default')()#如果有方法method,运行函数并打印None否则打印default
2)callable(object)
中文说明:检查对象object是否可调用。如果返回True,object仍然可能调用失败;但如果返回False,调用对象ojbect绝对不会成功。
注意:类是可调用的,而类的实例实现了__call__()方法才可调用。
版本:该函数在python2.x版本中都可用。但是在python3.0版本中被移除,而在python3.2以后版本中被重新添加。
>>>callable(0)
False>>> callable("mystring")
False>>> defadd(a, b):
…return a +b
…>>>callable(add)
True>>> classA:
…defmethod(self):
…return0
…>>>callable(A)
True>>> a =A()>>>callable(a)
False>>> classB:
…def __call__(self):
…return0
…>>>callable(B)
True>>> b =B()>>>callable(b)
True
处理程序:
classHandler:'''
'''
def callback(self, prefix, name, *args):
method= getattr(self,prefix+name,None)if callable(method): return method(*args)defstart(self, name):
self.callback('start_', name)defend(self, name):
self.callback('end_', name)defsub(self, name):defsubstitution(match):
result= self.callback('sub_', name, match)if result isNone: match.group(0)returnresultreturnsubstitutionclassHTMLRenderer(Handler):'''
'''
defstart_document(self):print '
...'defend_document(self):print ''
defstart_paragraph(self):print '
'
defend_paragraph(self):print '
'defstart_heading(self):print '
'
defend_heading(self):print ''
defstart_list(self):print '
- '
defend_list(self):print '
'defstart_listitem(self):print '
'defend_listitem(self):print '
'defstart_title(self):print '
'
defend_title(self):print ''
defsub_emphasis(self, match):return '%s' % match.group(1)defsub_url(self, match):return '%s' % (match.group(1),match.group(1))defsub_mail(self, match):return '%s' % (match.group(1),match.group(1))deffeed(self, data):print data
Handle类:
1)callback方法负责在给定一个前缀(比如'start_')和一个名字(比如'paragraph')后查找正确的方法(比如start_paragraph),而且使用以 None作为默认值的getattr 方法来完成工作。如果从getattr返回的对象能被调用,那么对象就可以用提供的任意额外的参数调用。比如如果对应的对象是存在的,那么调用 handler.callback('start_','paragraph')就会调用不带参数的hander.start_paragraph。
2)start和end方法使用各自的前缀start_和end_调用callback方法的助手方法
3)sub方法,返回新的函数,这个函数会被当成re.sub中的替换函数来使用
3.2 规则
能识别自己适用于那种块(条件)——condition方法
能对快进行转换(操作)——action方法
classRule:"""所有规则的基类"""
defaction(self,block,handler):
handler.start(self,type)
handler.feed(block)
handler.end(self,type)returnTrueclassHeadingRule(Rule):"""标题占一行,最多70个字符,并且不以冒号结尾"""type= 'heading'
defcondition(self,block):return not '\n' in block and len(block) <= 70 and not block[-1] == ':'
classTitleRule(HeadingRule):"""题目是文档的第一个块,但前提是它是大标题"""type= 'title'first=Truedefcondition(self,block):if notself.first:returnFalse
self.first=FalsereturnHeadingRule.condition(self,block)classListItemRule(Rule):"""列表项是以连字符开始的段落。作为格式化的一部分,要移除连字符"""type= "listitem"
defconfition(self,block):return block[0] == '_'
defaction(self,block,handler):
handler.start(self.type)
handler.feed(block[1:].strip)
handler.end(self,type)returnTrueclassListRule(ListItemRule):"""列表从不是列表项的块和随后的列表项之间。在最后一个连续列表项之后结束"""type= 'list'inside=Falsedefcondition(self,block):returnTruedefaction(self,block,handler):if not self.inside andListItemRule.condition(self,block):
handler.start(self.type)
self.inside=Trueelif self.inside and notListItemRule.condition(self,block):
handler.end(self.type)
self.inside=FalsereturnFalseclassParagraphRule(Rule):"""段落只是其他规则并没有覆盖到得块"""type= 'paragraph'
defcondition(self,block):return True
3.3 过滤器
三个过滤器,分别是:关于强调的内容,关于URL,关于电子邮件地址
self.addFilter(r'\*(.+?)\*', 'emphasis')
self.addFilter(r'(http://[\.a-z0-9A-Z/]+)', 'url')
self.addFilter(r'([\.a-zA-Z]+@[\.a-zA-Z]+[a-zA-Z]+)','mail')
3.4 语法分析器
importsys, refrom handlers import *
from util import *
from rules import *
classParser:def __init__(self,handler):
self.handler=handler
self.rules=[]
self.filters=[]defaddRule(self, rule):
self.rules.append(rule)defaddFilter(self,pattern,name):deffilter(block, handler):returnre.sub(pattern, handler.sub(name),block)
self.filters.append(filter)defparse(self, file):
self.handler.start('document')for block inblocks(file):for filter inself.filters:
block=filter(block, self.handler)for rule inself.rules:ifrule.condition(block):
last=rule.action(block, self.handler)if last:breakself.handler.end('document')classBasicTextParser(Parser):def __init__(self,handler):
Parser.__init__(self,handler)
self.addRule(ListRule())
self.addRule(ListItemRule())
self.addRule(TitleRule())
self.addRule(HeadingRule())
self.addRule(ParagraphRule())
self.addFilter(r'\*(.+?)\*', 'emphasis')
self.addFilter(r'(http://[\.a-z0-9A-Z/]+)', 'url')
self.addFilter(r'([\.a-zA-Z]+@[\.a-zA-Z]+[a-zA-Z]+)','mail')
handler=HTMLRenderer()
parser=BasicTextParser(handler)
parser.parse(sys.stdin)