网络爬虫---2.数据分析

最新推荐文章于 2024-07-12 17:25:40 发布

大大大v

最新推荐文章于 2024-07-12 17:25:40 发布

阅读量1.1k

点赞数

文章标签：爬虫 python3

本文链接：https://blog.csdn.net/yymonkeydo/article/details/74010399

版权

安装FireBug Lite

三种网页抓取方法

1.正则表达式

官网正则表达式网址：https://docs.python.org/3/howto/regex.html

>>> import re
>>> url = 'http://example.webscraping.com/places/default/view/Afghanistan-1'
>>> p = re.compile('<td class="w2p_fw">(.*?)</td>')
>>> html = urllib.request.urlopen(url).read()
>>> p.findall(html)
Traceback (most recent call last):
  File "<pyshell#79>", line 1, in <module>
    p.findall(html)
TypeError: cannot use a string pattern on a bytes-like object
>>> p.findall(html.decode('utf-8'))
['<img src="/places/static/images/flags/af.png" />', '647,500 square kilometres', '29,121,286', 'AF', 'Afghanistan', 'Kabul', '<a href="/places/default/continent/AS">AS</a>', '.af', 'AFN', 'Afghani', '93', '', '', 'fa-AF,ps,uz-AF,tk', '<div><a href="/places/default/iso/TM">TM </a><a href="/places/default/iso/CN">CN </a><a href="/places/default/iso/IR">IR </a><a href="/places/default/iso/TJ">TJ </a><a href="/places/default/iso/PK">PK </a><a href="/places/default/iso/UZ">UZ </a></div>']

2.BeautifulSoup

安装：

pip install beautifulsoup4

安装lxml

pip install lxml

Beautifulsoup正确解析缺失的引号并闭合标签

>>> import lxml
>>> from bs4 import BeautifulSoup
>>> brouken_html = '<ul class=country><li>Area<li>Population</ul>'
>>> soup = BeautifulSoup(brouken_html, 'lxml')
>>> fixed_html = soup.prettify()
>>> prin(fixed_html)
Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    prin(fixed_html)
NameError: name 'prin' is not defined
>>> print(fixed_html)
<html>
 <body>
  <ul class="country">
   <li>
    Area
   </li>
   <li>
    Population
   </li>
  </ul>
 </body>
</html>

查找数据：

>>> ul = soup.find('ul',attrs={'class':'country'})
>>> ul.find('li')
<li>Area</li>
>>> ul.find_all('li')
[<li>Area</li>, <li>Population</li>]

中文文档： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
英文文档： https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

获取国家面积：

>>> from urllib import request
>>> html = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()
>>> soup = BeautifulSoup(html)

Warning (from warnings module):
  File "C:\Users\zhuangyy\AppData\Local\Programs\Python\Python35\lib\site-packages\bs4\__init__.py", line 181
    markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

>>> soup = BeautifulSoup(html,'lxml')
>>> tr = soup.find(attrs={'id':'places_area__row'})
>>> td = tr.find(attrs={'class':'w2p_fw'})
>>> area = td.text
>>> print(area)
647,500 square kilometres

3. lxml

lxml基于libxml2 XML解析库的Python封装，使用C语言编写，解析速度比Beautiful Soup快

>>> import lxml.html
>>> broken_html = '<ul class=country><li>Area<li>Population</ul>'
>>> tree = lxml.html.fromstring(broken_html)
>>> fixed_html = lxml.html.tostring(tree, pretty=True)
Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    fixed_html = lxml.html.tostring(tree, pretty=True)
TypeError: tostring() got an unexpected keyword argument 'pretty'
>>> fixed_html = lxml.html.tostring(tree, pretty_print=True)
>>> pint(fixed_html)
Traceback (most recent call last):
  File "<pyshell#27>", line 1, in <module>
    pint(fixed_html)
NameError: name 'pint' is not defined
>>> print(fixed_html)
b'<ul class="country">\n<li>Area</li>\n<li>Population</li>\n</ul>\n'
>>>

lxml也可以正确解析属性两侧缺失的引号，并闭合标签，不过该模块没有额外添加<html>和<body>标签

XPath选择器和BeautifulSoup的find()类似

CSS选择器:

安装
pip install cssselect

>>> from lxml.cssselect import CSSSelector
>>> import urllib import request
SyntaxError: invalid syntax
>>> from urllib import request
>>> html1 = request.urlopen('http://example.webscraping.com/places/default/view/Afghanistan-1').read()
>>> tree = lxml.html.fromstring(html1)
>>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]
>>> area = td.text_content()
>>> print(area)
647,500 square kilometres

这段代码会先找到ID为places_area__row的表格行元素，然后选择class为w2p_fw的表格数据子标签。

CSS选择器表示选择元素所使用的模式。常用的选择器示例：

选择所有标签	*
选择<a>标签	a
选择所有class="link"的元素	.link
选择class="link"的<a>标签	a.link
选择id="home"的<a>标签	a#home
选择父元素为<a>标签的所有<span>子标签：	a > span
选择<a>标签内部的所有<span>标签	a span
选择title属性为"Home"的所有<a>标签	a[title=Home]