python 3.5.1
我装的最新的python3.6.1
python
urllib
from urllib.request import urlopen
beautifulSoup4
from bs4 import BeautifulSoup
安装BeautifulSoup4
linux:
sudo apt-get install python-bs4
mac:
sudo easy_install pip
pip install beautifulsoup4
windows:
pip install beautifulsoup4
pip3 install beautifulsoup4
3
3.1 urllib基本用法
urllib是python3.x中提供的一系列操作url的库,可以轻松的模拟用户使用浏览
器访问网页
模拟真实浏览器:
携带User-Agent头
req= request.Request(url)
req.add_header(key,value)
resp = reuqest.urlopen(req)
print(resp.read().decode(“utf-8”))
使用Post:
导入urllib库下面的parse
from urllib import parse
使用urlencode生成post数据
postData=parse.urlencode([
(key1,val1),
(key2,val2),
(keyn,valn)
])
使用postData发送post请求
request.urlopen(req,data=postData.encode(‘utf-8’))
得到请求状态
resp.status
得到服务器的类型
resp.reason
www.thsrc.com.tw/tw/TimeTable/SearchResult
3.3 beautifulsoup
https://www.crummy.com/software/BeautifulSoup/#Download
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id4
beautiful.py
from bs4 import BeautifulSoup as bs
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = bs(html_doc,'html.parser')
#print(soup.prettify())
#print(soup.title)
# print(soup.find(id="link3").string)
# print(soup.find(id="link3").get_text())
# for link in soup.findAll("a"):
# print(link.string)
# print(soup.find("p",{"class":"story"}).string) #这样是不行的
# print(soup.find("p",{"class":"story"}).get_text())
data = soup.findAll("a",href=re.compile("^http://example\.com"))
print(data)