网络爬虫---2.数据分析

安装FireBug Lite

三种网页抓取方法

1.正则表达式

官网正则表达式网址:https://docs.python.org/3/howto/regex.html

>>> import re
>>> url = 'http://example.webscraping.com/places/default/view/Afghanistan-1'
>>> p = re.compile('<td class="w2p_fw">(.*?)</td>')
>>> html = urllib.request.urlopen(url).read()
>>> p.findall(html)
Traceback (most recent call last):
  File "<pyshell#79>", line 1, in <module>
    p.findall(html)
TypeError: cannot use a string pattern on a bytes-like object
>>> p.findall(html.decode('utf-8'))
['<img src="/places/static/images/flags/af.png" />', '647,500 square kilometres', '29,121,286', 'AF', 'Afghanistan', 'Kabul', '<a href="/places/default/continent/AS">AS</a>', '.af', 'AFN', 'Afghani', '93', '', '', 'fa-AF,ps,uz-AF,tk', '<div><a href="/places/default/iso/TM">TM </a><a href="/places/default/iso/CN">CN </a><a href="/places/default/iso/IR">IR </a><a href="/places/default/iso/TJ">TJ </a><a href="/places/default/iso/PK">PK </a><a href="/places/default/iso/UZ">UZ </a></div>']


2.BeautifulSoup

安装:

pip install beautifulsoup4


安装lxml

pip install lxml


Beautifulsoup正确解析缺失的引号并闭合标签

>>> import lxml
>>> from bs4 import BeautifulSoup
>>> brouken_html = '<ul class=country><li>Area<li>Population</ul>'
>>> soup = BeautifulSoup(brouken_html, 'lxml')
>>> fixed_html = soup.prettify()
>>> prin(fixed_html)
Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    prin(fixed_html)
NameError: name 'prin' is not defined
>>> print(fixed_html)
<html>
 <body>
  <ul class="country">
   <li>
    Area
   </li>
   <li>
    Population
   </li>
  </ul>
 </body>
</html>

查找数据:

>>> ul = soup.find('ul',attrs={'class':'country'})
>>> ul.find('li')
<li>Area</li>
>>> ul.find_all('li')
[<li>Area</li>, <li>Population</li>]


中文文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
英文文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

获取国家面积:

>>> from urllib import request
>>> html = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()
>>> soup = BeautifulSoup(html)

Warning (from warnings module):
  File "C:\Users\zhuangyy\AppData\Local\Programs\Python\Python35\lib\site-packages\bs4\__init__.py", line 181
    markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

>>> soup = BeautifulSoup(html,'lxml')
>>> tr = soup.find(attrs={'id':'places_area__row'})
>>> td = tr.find(attrs={'class':'w2p_fw'})
>>> area = td.text
>>> print(area)
647,500 square kilometres


3. lxml

lxml基于libxml2 XML解析库的Python封装,使用C语言编写,解析速度比Beautiful Soup快

相关文档:http://lxml.de/installation.html#source-builds-on-ms-windows

>>> import lxml.html
>>> broken_html = '<ul class=country><li>Area<li>Population</ul>'
>>> tree = lxml.html.fromstring(broken_html)
>>> fixed_html = lxml.html.tostring(tree, pretty=True)
Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    fixed_html = lxml.html.tostring(tree, pretty=True)
TypeError: tostring() got an unexpected keyword argument 'pretty'
>>> fixed_html = lxml.html.tostring(tree, pretty_print=True)
>>> pint(fixed_html)
Traceback (most recent call last):
  File "<pyshell#27>", line 1, in <module>
    pint(fixed_html)
NameError: name 'pint' is not defined
>>> print(fixed_html)
b'<ul class="country">\n<li>Area</li>\n<li>Population</li>\n</ul>\n'
>>> 
lxml也可以正确解析属性两侧缺失的引号,并闭合标签,不过该模块没有额外添加<html>和<body>标签

XPath选择器和BeautifulSoup的find()类似

CSS选择器:

安装
pip install cssselect

>>> from lxml.cssselect import CSSSelector
>>> import urllib import request
SyntaxError: invalid syntax
>>> from urllib import request
>>> html1 = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()
>>> tree = lxml.html.fromstring(html1)
>>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]
>>> area = td.text_content()
>>> print(area)
647,500 square kilometres

这段代码会先找到ID为places_area__row的表格行元素,然后选择class为w2p_fw的表格数据子标签。

CSS选择器表示选择元素所使用的模式。常用的选择器示例:

选择所有标签*
选择<a>标签a
选择所有class="link"的元素.link
选择class="link"的<a>标签a.link
选择id="home"的<a>标签a#home
选择父元素为<a>标签的所有<span>子标签:a > span
选择<a>标签内部的所有<span>标签a span
选择title属性为"Home"的所有<a>标签a[title=Home]











可以把数据存在 csv中

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值