Python学习笔记(21), 常用内置模块，contextlib, urllib,HTMLParser

最新推荐文章于 2020-05-09 06:07:15 发布

焦下鹿

最新推荐文章于 2020-05-09 06:07:15 发布

阅读量392

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/qq_46105155/article/details/105942827

版权

Python 专栏收录该内容

29 篇文章 1 订阅

订阅专栏

文章目录

Built-in modules

Built-in modules

contextlib

在读写文件时，打开文件，使用完毕后要正确的关闭它，一种方式是使用try...finally，另一种更方便的方式是使用with open(filename, 'r') as f:

实际上，任何对象，只要正确实现了上下文管理，就可以用with语句。实现上下文管理通过__enter__和__exit__这个两个方法来实现

class Query:

    def __init__(self, name):
        self.name = name
    
    def __enter__(self):
        print('Begin')
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        if exc_type:
            print('Error')
        else:
            print('End')
        
    def query(self):
        print('Query info about %s...' % self.name)

with Query('Bob') as q:
    q.query()

Begin
Query info about Bob...
End

@contextmanager

编写__enter__和__exit__依然繁琐，contextlib提供了更简单的写法：

from contextlib import contextmanager

class Query:
    def __init__(self, name):
        self.name = name
    
    def query(self):
        print('Query info about %s...' % self.name)

@contextmanager
def create_query(name):
    print('Begin')
    q = Query(name)
    yield q
    print('End')

with create_query('Bob') as q:
    q.query()

Begin
Query info about Bob...
End

@contextmanager这个decorator接受一个generator，用yield语句把with ... as var把变量输出出去，然后，with语句就可以正常地工作。

很多时候，希望在某段代码执行前后自动执行特定代码，也可以用@contextmanager来实现

@contextmanager
def tag(name):
    print("<%s>" % name)
    yield
    print("</%s>" % name)

with tag("h1"):
    print("hello")
    print("world")

<h1>
hello
world
</h1>

代码的执行顺序是：

with语句首先执行yield之前的语句，打印<h1>
yield调用会执行with语句内部的所有语句，打印出hello和world
最后执行yield之后的语句，打印出</h1>

@closing

from contextlib import closing
from urllib.request import urlopen

@contextmanager
def closing(thing):
    try:
        yield thing
    finally:
        thing.close()

with closing(urlopen('https://www.python.org')) as page:
    for line in page:
       pass # print(line) will print every line this website page.

urllib

Get

urllib的request模块可以方便的抓取URL内容，发送一个GET请求到指定的页面，然后返回HTTP的响应

from urllib import request
with request.urlopen('https://www.python.org') as f:
    data = f.read()
    print('Status:',f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    # print('Data', data.decode('utf-8'))

Status: 200 OK
Server: nginx
Content-Type: text/html; charset=utf-8
X-Frame-Options: DENY
Via: 1.1 vegur
Via: 1.1 varnish
Content-Length: 49123
Accept-Ranges: bytes
Date: Tue, 05 May 2020 18:46:26 GMT
Via: 1.1 varnish
Age: 769
Connection: close
X-Served-By: cache-bwi5122-BWI, cache-mdw17348-MDW
X-Cache: HIT, HIT
X-Cache-Hits: 4, 2
X-Timer: S1588704387.816773,VS0,VE0
Vary: Cookie
Strict-Transport-Security: max-age=63072000; includeSubDomains

如果想要模拟浏览器发送GET请求，就需要使用Request对象，通过往Request对象添加HTTP头，我们就可以把请求伪装成浏览器。例如，模拟iPhone 6去请求豆瓣首页

from urllib import request

req = request.Request('http://www.douban.com/')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

Status: 200 OK
Date: Tue, 05 May 2020 18:49:12 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Vary: Accept-Encoding
X-Xss-Protection: 1; mode=block
X-Douban-Mobileapp: 0
Expires: Sun, 1 Jan 2006 01:00:00 GMT
Pragma: no-cache
Cache-Control: must-revalidate, no-cache, private
Set-Cookie: bid=0uXd--_Fj8w; Expires=Wed, 05-May-21 18:49:12 GMT; Domain=.douban.com; Path=/
X-DOUBAN-NEWBID: 0uXd--_Fj8w
X-DAE-App: talion
X-DAE-Instance: default
Server: dae
Strict-Transport-Security: max-age=15552000
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Data: 

<!DOCTYPE html>
<html itemscope itemtype="http://schema.org/WebPage" class="ua-safari ua-mobile ">
  <head>
      <meta charset="UTF-8">
      <title>豆瓣(手机版)</title>
      <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
      <meta name="viewport" content="width=device-width, height=device-height, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">
     ...

POST

如果要以POST发送一个请求，只需要把参数data以bytes形式传入。

模拟一个微博登录，先读取登录的邮箱和口令，然后按照weibo.cn的登录页格式以username-xxx&password=xxx的编码传入

from urllib import request, parse
print('Login to weibo.cn')

email = input('Email: ')
passwd = input('Password: ')
# Encode a dict or sequence of two-element tuples into a URL query string.
login_data = parse.urlencode([
    ('username', email),
    ('password', passwd),
    ('entry', 'mweibo'),
    ('client_id', ''),
    ('savestate', 1),
    ('ec', ''),
    ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])

req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

Login to weibo.cn
Status: 200 OK
Server: nginx/1.6.1
Date: Tue, 05 May 2020 18:57:36 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: no-cache, must-revalidate
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Pragma: no-cache
Access-Control-Allow-Origin: https://passport.weibo.cn
Access-Control-Allow-Credentials: true
DPOOL_HEADER: localhost.localdomain
Data: {"retcode":50011007,"msg":"\u8bf7\u8f93\u5165\u7528\u6237\u540d","data":{"errline":320}}

登录失败

Handler

如果还需要更复杂的控制，比如通过一个Proxy去访问网站，我们需要利用ProxyHandler来处理

小结

urilib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能，需要把请求伪装成浏览器。伪装的方法就是先监控浏览器发出的请求，然后根据浏览器的请求头来伪装，User-Agent头就是用来标识浏览器的。

更详细的解释

XML

from xml.parsers.expat import ParserCreate

class DefaultSaxHandler(object):
    def start_element(self, name, attrs):
        print('sax:start_element: %s, attrs: %s' % (name, str(attrs)))

    def end_element(self, name):
        print('sax:end_element: %s' % name)

    def char_data(self, text):
        print('sax:char_data: %s' % text)

xml = r'''<?xml version="1.0"?>
<ol>
    <li><a href="/python">Python</a></li>
    <li><a href="/ruby">Ruby</a></li>
</ol>
'''

handler = DefaultSaxHandler()
parser = ParserCreate()
parser.StartElementHandler = handler.start_element
parser.EndElementHandler = handler.end_element
parser.CharacterDataHandler = handler.char_data
parser.Parse(xml)

sax:start_element: ol, attrs: {}
sax:char_data: 

sax:char_data:     
sax:start_element: li, attrs: {}
sax:start_element: a, attrs: {'href': '/python'}
sax:char_data: Python
sax:end_element: a
sax:end_element: li
sax:char_data: 

sax:char_data:     
sax:start_element: li, attrs: {}
sax:start_element: a, attrs: {'href': '/ruby'}
sax:char_data: Ruby
sax:end_element: a
sax:end_element: li
sax:char_data: 

sax:end_element: ol

HTMLParser

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print('<%s>' % tag)

    def handle_endtag(self, tag):
        print('</%s>' % tag)

    def handle_startendtag(self, tag, attrs):
        print('<%s/>' % tag)

    def handle_data(self, data):
        print(data)

    def handle_comment(self, data):
        print('<!--', data, '-->')

    def handle_entityref(self, name):
        print('&%s;' % name)

    def handle_charref(self, name):
        print('&#%s;' % name)

parser = MyHTMLParser()
parser.feed('''<html>
<head></head>
<body>
<!-- test html parser -->
    <p>Some <a href=\"#\">html</a> HTML&nbsp;tutorial...<br>END</p>
</body></html>''')

<html>


<head>
</head>


<body>


<!--  test html parser  -->

    
<p>
Some 
<a>
html
</a>
 HTML tutorial...
<br>
END
</p>


</body>
</html>

练习

找一个网页，例如https://www.python.org/events/python-events/，
用浏览器查看源码并复制，然后尝试解析一下HTML，输出Python官网发布的会议时间、名称和地点。

from html.parser import HTMLParser
from html.entities import name2codepoint
from urllib import request
from urllib.request import urlopen

class EventHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.tag = None
    
    def handle_starttag(self, tag, attrs):
        if ('class', 'event-title') in attrs:
            self.tag = 'Event-title'
        if tag == 'time':
            self.tag = 'Time'
        if ('class', 'say-no-more') in attrs:
            self.tag = 'Year'
        elif ('class', 'event-location') in attrs:
            self.tag = 'Event-location'
    
    def handle_data(self, data):
        if self.tag:
            print(self.tag, data)
    
    def handle_endtag(self, data):
        if self.tag:
            self.tag = None


with urlopen('https://www.python.org/events/python-events') as f:
    html_data = str(f.read())

parser = EventHTMLParser()
parser.feed(html_data)

Event-title PyConWeb 2020 (canceled)
Time 09 May – 10 May 
Year  2020
Event-location Munich, Germany
Event-title Django Girls Groningen
Time 16 May
Year  2020
Event-location Groningen, Netherlands
Event-title PyLondinium 2020 (postponed)
Time 05 June – 07 June 
Year  2020
Event-location London, UK
Event-title PyCon CZ 2020 (canceled)
Time 05 June – 07 June 
Year  2020
Event-location Ostrava, Czech Republic
Event-title PyCon Odessa 2020
Time 13 June – 14 June 
Year  2020
Event-location Odessa, Ukraine
Event-title Python Web Conference 2020 (Online-Worldwide)
Time 17 June – 19 June 
Year  2020
Event-location https://2020.pythonwebconf.com
Event-title Python Meeting D\xc3\xbcsseldorf
Time 01 July
Year  2020
Event-location D\xc3\xbcsseldorf, Germany
Event-title SciPy 2020
Time 06 July – 12 July 
Year  2020
Event-location Online
Event-title Python Nordeste 2020
Time 17 July – 19 July 
Year  2020
Event-location Fortaleza, Cear\xc3\xa1, Brasil
Event-title EuroPython 2020 (in-person: canceled, considering going virtual)
Time 20 July – 26 July 
Year  2020
Event-location https://blog.europython.eu/post/612826526375919616/europython-2020-going-virtual-europython-2021
Event-title EuroPython 2020 Online
Time 23 July – 26 July 
Year  2020
Event-location Online Event
Event-title EuroSciPy 2020 (canceled)
Time 27 July – 31 July 
Year  2020
Event-location Bilbao, Spain
Event-title PyCon JP 2020
Time 28 Aug. – 29 Aug. 
Year  2020
Event-location Tokyo, Japan
Event-title PyCon TW 2020
Time 05 Sept. – 06 Sept. 
Year  2020
Event-location International Conference Hall ,No.1, University Road, Tainan City 701, Taiwan
Event-title PyCon SK 2020
Time 11 Sept. – 13 Sept. 
Year  2020
Event-location Bratislava, Slovakia
Event-title DjangoCon Europe 2020
Time 16 Sept. – 20 Sept. 
Year  2020
Event-location Porto, Portugal
Event-title DragonPy 2020
Time 19 Sept. – 20 Sept. 
Year  2020
Event-location Ljubljana, Slovenia
Event-title PyCon APAC 2020
Time 19 Sept. – 20 Sept. 
Year  2020
Event-location Kota Kinabalu, Sabah, Malaysia
Event-title Django Day Copenhagen
Time 25 Sept.
Year  2020
Event-location Copenhagen, Denmark
Event-title PyCon Turkey
Time 26 Sept. – 27 Sept. 
Year  2020
Event-location Albert Long Hall, at Bogazici University Istanbul
Event-title Python Meeting D\xc3\xbcsseldorf
Time 30 Sept.
Year  2020
Event-location D\xc3\xbcsseldorf, Germany
Event-title PyCon India 2020
Time 02 Oct. – 05 Oct. 
Year  2020
Event-location Bangalore, India
Event-title PyConDE & PyData Berlin 2020
Time 14 Oct. – 16 Oct. 
Year  2020
Event-location Berlin, Germany
Event-title Swiss Python Summit
Time 23 Oct.
Year  2020
Event-location Rapperswil, Switzerland
Event-title PyCC Meetup'19 (Python Cape Coast User Group)
Time 26 Oct.
Year  2020
Event-location Cape coast, Ghana
Event-title Python Brasil 2020
Time 28 Oct. – 02 Nov. 
Year  2020
Event-location Caxias do Sul, RS, Brazil
Event-title PyData London 2020
Time 30 Oct. – 01 Nov. 
Year  2020
Event-location London, UK
Event-title PyCon Italia 2020
Time 05 Nov. – 08 Nov. 
Year  2020
Event-location Florence, Italy
Event-title enterPy
Time 23 Nov. – 24 Nov. 
Year  2020
Event-location Mannheim, Germany
Event-title PyCon US 2021
Time 12 May – 20 May 
Year  2021
Event-location Pittsburgh, PA, USA
Event-title SciPy 2021
Time 12 July – 18 July 
Year  2021
Event-location Austin, TX, US
Event-title EuroPython 2021
Time 26 July – 01 Aug. 
Year  2021
Event-location Dublin, Ireland
Year General
Year Initiatives

焦下鹿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python学习笔记(21), 常用内置模块，contextlib, urllib,HTMLParser

文章目录Built-in modulescontextlib@contextmanager@closingurllibGetPOSTHandler小结XMLHTMLParser练习Built-in modulescontextlib在读写文件时，打开文件，使用完毕后要正确的关闭它，一种方式是使用try...finally，另一种更方便的方式是使用with open(filename, 'r'...
复制链接

扫一扫

专栏目录