安装FireBug Lite
三种网页抓取方法
1.正则表达式
官网正则表达式网址:https://docs.python.org/3/howto/regex.html
>>> import re
>>> url = 'http://example.webscraping.com/places/default/view/Afghanistan-1'
>>> p = re.compile('<td class="w2p_fw">(.*?)</td>')
>>> html = urllib.request.urlopen(url).read()
>>> p.findall(html)
Traceback (most recent call last):
File "<pyshell#79>", line 1, in <module>
p.findall(html)
TypeError: cannot use a string pattern on a bytes-like object
>>> p.findall(html.decode('utf-8'))
['<img src="/places/static/images/flags/af.png" />', '647,500 square kilometres', '29,121,286', 'AF', 'Afghanistan', 'Kabul', '<a href="/places/default/continent/AS">AS</a>', '.af', 'AFN', 'Afghani', '93', '', '', 'fa-AF,ps,uz-AF,tk', '<div><a href="/places/default/iso/TM">TM </a><a href="/places/default/iso/CN">CN </a><a href="/places/default/iso/IR">IR </a><a href="/places/default/iso/TJ">TJ </a><a href="/places/default/iso/PK">PK </a><a href="/places/default/iso/UZ">UZ </a></div>']
2.BeautifulSoup
安装:
pip install beautifulsoup4
安装lxml
pip install lxml
Beautifulsoup正确解析缺失的引号并闭合标签
>>> import lxml
>>> from bs4 import BeautifulSoup
>>> brouken_html = '<ul class=country><li>Area<li>Population</ul>'
>>> soup = BeautifulSoup(brouken_html, 'lxml')
>>> fixed_html = soup.prettify()
>>> prin(fixed_html)
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
prin(fixed_html)
NameError: name 'prin' is not defined
>>> print(fixed_html)
<html>
<body>
<ul class="country">
<li>
Area
</li>
<li>
Population
</li>
</ul>
</body>
</html>
查找数据:
>>> ul = soup.find('ul',attrs={'class':'country'})
>>> ul.find('li')
<li>Area</li>
>>> ul.find_all('li')
[<li>Area</li>, <li>Population</li>]
中文文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
英文文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
获取国家面积:
>>> from urllib import request
>>> html = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()
>>> soup = BeautifulSoup(html)
Warning (from warnings module):
File "C:\Users\zhuangyy\AppData\Local\Programs\Python\Python35\lib\site-packages\bs4\__init__.py", line 181
markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "lxml")
>>> soup = BeautifulSoup(html,'lxml')
>>> tr = soup.find(attrs={'id':'places_area__row'})
>>> td = tr.find(attrs={'class':'w2p_fw'})
>>> area = td.text
>>> print(area)
647,500 square kilometres
3. lxml
lxml基于libxml2 XML解析库的Python封装,使用C语言编写,解析速度比Beautiful Soup快
相关文档:http://lxml.de/installation.html#source-builds-on-ms-windows
>>> import lxml.html
>>> broken_html = '<ul class=country><li>Area<li>Population</ul>'
>>> tree = lxml.html.fromstring(broken_html)
>>> fixed_html = lxml.html.tostring(tree, pretty=True)
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
fixed_html = lxml.html.tostring(tree, pretty=True)
TypeError: tostring() got an unexpected keyword argument 'pretty'
>>> fixed_html = lxml.html.tostring(tree, pretty_print=True)
>>> pint(fixed_html)
Traceback (most recent call last):
File "<pyshell#27>", line 1, in <module>
pint(fixed_html)
NameError: name 'pint' is not defined
>>> print(fixed_html)
b'<ul class="country">\n<li>Area</li>\n<li>Population</li>\n</ul>\n'
>>>
lxml也可以正确解析属性两侧缺失的引号,并闭合标签,不过该模块没有额外添加<html>和<body>标签
XPath选择器和BeautifulSoup的find()类似
CSS选择器:
安装
pip install cssselect
>>> from lxml.cssselect import CSSSelector
>>> import urllib import request
SyntaxError: invalid syntax
>>> from urllib import request
>>> html1 = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()
>>> tree = lxml.html.fromstring(html1)
>>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]
>>> area = td.text_content()
>>> print(area)
647,500 square kilometres
这段代码会先找到ID为places_area__row的表格行元素,然后选择class为w2p_fw的表格数据子标签。
CSS选择器表示选择元素所使用的模式。常用的选择器示例:
选择所有标签 | * |
选择<a>标签 | a |
选择所有class="link"的元素 | .link |
选择class="link"的<a>标签 | a.link |
选择id="home"的<a>标签 | a#home |
选择父元素为<a>标签的所有<span>子标签: | a > span |
选择<a>标签内部的所有<span>标签 | a span |
选择title属性为"Home"的所有<a>标签 | a[title=Home] |
可以把数据存在 csv中