Python challenge 3 - urllib & re

最新推荐文章于 2020-12-07 14:38:21 发布

加藤蜀黍

最新推荐文章于 2020-12-07 14:38:21 发布

阅读量590

点赞数

分类专栏： Python 文章标签： python 正则表达式

本文链接：https://blog.csdn.net/bearkino/article/details/40367237

版权

Python 专栏收录该内容

72 篇文章 0 订阅

订阅专栏

第三题的地址：http://www.pythonchallenge.com/pc/def/ocr.html
Hint1：recognize the characters. maybe they are in the book, but MAYBE they are in the page source.
Hint2: 网页源码的注释中有: find rare characters in the mess below；下面是一堆字符。
显然是从这对字符中找出现次数最少的；注意忽略空白符，出现次数同样多的字符按出现次数排序。

import re
import urllib

# urllib to open the website
response= urllib.urlopen("http://www.pythonchallenge.com/pc/def/ocr.html")
source = response.read()
response.close()

# 抓取到整个HTML的sourceprint source

# 得到注释中的所有元素

data = re.findall(r'', source, re.S)
# 得到字母charList = re.findall(r'([a-zA-Z])', data[1], 16)print charListprint ''.join(charList)

最终的结果是

['e', 'q', 'u', 'a', 'l', 'i', 't', 'y']
equality

####################################################################################################################################

Python urllib库提供了一个从指定URL地址获取网页数据，然后进行分析的功能。

import urllib
google = urllib.urlopen('http://www.google.com')
print 'http header:\n', google.info()
print 'http status:', google.getcode()
print 'url:', google.geturl()

# result

http header:
Date: Tue, 21 Oct 2014 19:30:35 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=521bc5021bb6e976:FF=0:TM=1413919835:LM=1413919835:S=7cbCQWnhLCPJFOiw; expires=Thu, 20-Oct-2016 19:30:35 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=mzfYCxoBC3d9VaQC6-cXKIcbxt4eekorvE6lon1ZHQhLeVxasD2oeRKEG2In90zRAqNPQ1xLfzR_ha1ife0JqdJankdexWaFjZiQN2mLGjavWCfMBYETbFfIst08iNtR; expires=Wed, 22-Apr-2015 19:30:35 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic,p=0.01

http status: 200
url: http://www.google.com

我们可以用urlopen抓取网页，然后read方法获得所有的信息。

info获取http header，返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息。

getcode获得http status，如果是http请求，200表示成功，404表示网址没找到。

geturl获得信息来源网站。

还有getenv获得环境变量。putenv设置环境变量。等等。

print help(urllib.urlopen)
#result
Help on function urlopen in module urllib:

urlopen(url, data=None, proxies=None)
    Create a file-like object for the specified URL to read from.

上述，我们可以知道，就是创建一个类文件对象为指定的url来读取。

参数url表示远程数据的路径，一般是http或者ftp路径

参数data表示以get或者post方法提交到url数据

参数proxies表示用于代理的设置

urlopen返回一个类文件对象

有read()，readline()，readlines()，fileno()，close()等和文件对象一样的方法

####################################################################################################################################

Python 中的re 正则表达式模块

re.match 字符串匹配模式

import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"

上述的代码的结果是

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

可以看出，group()返回整个match的对象。group(?)可以返回submatch，上述代码有两个匹配点。

主要函数语句 re.match(pattern, string, flags)

pattern就是写的regular expression用于匹配。

string就是传入的需要被匹配取值。

flags可以不写，可以用 | 分隔。

re.I 或者re.IGNORECASE，表示匹配部分大小写，case insensitively。

（Performs case-insensitive matching.）

re.S或者re.DOTALL，表示点任意匹配模式，改变'.'的行为，设置后可以匹配\n

（Makes a period (dot) match any character, including a newline.）

re.M或者re.MULTILINE，表示多行模式，改变'^'和'$'的行为

（Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).）

re.L或者re.LOCALE，使得预定义字符类\w,\W, \b, \B, \s, \S取决于当前区域设定

（Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B).）

re.U或者re.UNICODE，使得预定义字符类\w,\W, \b, \B, \s, \S取决于unicode定义的字符属性

（Interprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.）

re.X或者re.VERBOSE，详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。

（Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.）

re.search v.s. re.match

import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print "match --> matchObj.group() : ", matchObj.group()
else:
   print "No match!!"

searchObj = re.search( r'dogs', line, re.M|re.I)
if searchObj:
   print "search --> searchObj.group() : ", searchObj.group()
else:
   print "Nothing found!!"
   
# result

No match!!
search --> searchObj.group() :  dogs

我们可以看出来，match是从头开始check整个string的，如果开始没找到就是没找到了。而search寻找完整个string，从头到尾。

re.sub

具体的语句如下

re.sub(pattern, repl, string, max=0)

替换string所有的match部分为repl，替换所有的知道替换max个。然后返回一个修改过的string。

import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num

# Remove anything other than digits
num = re.sub(r'\D', "", phone)
print "Phone Num : ", num

# result

Phone Num :  2004-959-559 
Phone Num :  2004959559

re.split (pattern, string, maxsplit=0)

可以使用re.split来分割字符串。maxsplit是分离次数，maxsplit=1表示分离一次，默认是0，不限制次数。

import re

print re.split('\W+', 'Words, words, words.')
print re.split('(\W+)', 'Words, words, words.')
print re.split('\W+', 'Words, words, words.', 1)

# result

['Words', 'words', 'words', '']
['Words', ', ', 'words', ', ', 'words', '.', '']
['Words', 'words, words.']

如果在字符串的开头或者结尾就匹配，那么返回的list会以空串开始或结尾。

import re

print re.split('(\W+)', '...words, words...')

# result

['', '...', 'words', ', ', 'words', '...', '']

如果字符串不能匹配，就返回整个字符串的list。

import re

print re.split('a', '...words, words...')

# result

['...words, words...']

####

str.split('\s') 和re.split('\s',str)都是分割字符串，返回list，但是是有区别的。

1. str.split('\s') 是字面上的按照'\s'来分割字符串

2. re.split('\s', str)是按照空白来分割的，因为正则表达式中的‘\s’就是空白的意思。

re.findall(pattern, string, flags=0)

找到re匹配的所有子串，并把它们作为一个列表返回。这个匹配从左到右有序的返回。如果没有匹配就返回空列表。

import re

print re.findall('a', 'bcdef')
print re.findall(r'\d+', '12a34b56c789e')

# result

[]
['12', '34', '56', '789']

re.compile(pattern, flags=0)

编译正则表达式，返回RegexObject对象，然后通过RegexObject对象调用match方法或者search方法。

prog = re.compile(pattern)

result = prog.match(string)

等价

result = re.match(pattern, string)

第一种方法能够实现正则表达式的重用。

加藤蜀黍

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python challenge 3 - urllib & re

第三题的地址：http://www.pythonchallenge.com/pc/def/ocr.htmlHint1：recognize the characters. maybe they are in the book, but MAYBE they are in the page source.Hint2: 网页源码的注释中有: find rare characters in t
复制链接

扫一扫

专栏目录