[Python]网络数据采集概述(1)—页面访问及页面元素的解析

最新推荐文章于 2022-05-06 15:55:14 发布

Vi_NSN

最新推荐文章于 2022-05-06 15:55:14 发布

阅读量479

点赞数

分类专栏：爬虫文章标签： python 爬虫数据

本文链接：https://blog.csdn.net/Vi_NSN/article/details/77725442

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

urlib
- 主要用途
BeautifulSoup
数据采集
- 单网站递归遍历内链
- 通过互联网采集外链
解析Json数据

urlib

Python3.x整合了Python2中的urllib2和urllib，合并为urllib，并分为四大主要模块：urllib.request、urllib.error、urllib.parse和urllib.robotparser。

主要用途：

urllib.request：用于打开和读取URLs；
urllib.error：包含urllib.request引发的异常；
urllib.parse：用于解析URLs；
urllib.robotparse：用于解析robot.txt文件。

BeautifulSoup

作用类似于正则表达式，通过定位HTML标签来格式化和组织复杂的网络信息，用Python对象来展现XML结构信息。 可以直接通过标签的属性来定位标签，获取标签内容。标签之间有层次关系。

from urllib.request import urlopen
from bs4 import  BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

nameList = bsObj.findAll("span",{"class":{"green", "red"}})
# findAll(tag, attributes, recursive, text, limit, keywords)
# find(tag, attributes, recursive, text, keywords)
for name in nameList:
    print(name.get_text())

for child in bsObj.find("table", {"id" : "giftList"}).children:     
    # children:子代标签, descendants:后代标签
    print(child)

for sibling in bsObj.find("table", {"id" : "giftList"}).tr.next_siblings:
    # 兄弟标签
    print(sibling)

# 父标签
print(bsObj.find("img", {"src" : "../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

正则表达式

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
images = bsObj.findAll("img", {"src": re.compile("\.\.\/img\/gifts/img.*\.jpg")})   # 正则表达式可以作为BeautifulSoup 语句的任意一个参数
for image in images:
    print(image["src"])

获取属性

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
myImgTags = bsObj.findAll("img", {"src": re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for myImgTag in myImgTags:
    print(myImgTag)
    print(myImgTag.attrs["src"])  # .attrs返回的是一个Python 字典对象

Lambda表达式

BeautifulSoup 允许我们把特定函数类型当作findAll 函数的参数。唯一的限制条件是这些
函数必须把一个标签作为参数且返回结果是布尔类型。BeautifulSoup 用这个函数来评估它
遇到的每个标签对象，最后把评估结果为“真”的标签保留，把其他标签剔除。如：soup.findAll(lambda tag: len(tag.attrs) == 2)

其他的解析库

lxml

大部分源码由C语言编写，可用于解析HTML和XML文档。解析速度快

HTML parser

python自带的解析库

数据采集

单网站递归遍历（内链）

def getLinks(pageUrl): # 自动查找内链，递归遍历链接  
    pages = set()

    def getLinks(pageUrl):
        global pages

    html = urlopen( pageUrl)
    bsObj = BeautifulSoup(html)
    for link in bsObj.findAll("a", href=re.compile("^(https://baike.baidu.com/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #我们遇到了新页面
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
                getLinks("")

通过互联网采集（外链）

这里写图片描述

解析Json数据

from urllib.request import urlopen
import json

def getCountry(ipAddress):
    response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')
    responseJson = json.loads(response)
    return responseJson.get("country_code")
print(getCountry("50.78.253.58"))

参考书籍：
《Python网络数据采集》

Vi_NSN

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
[Python]网络数据采集概述(1)—页面访问及页面元素的解析

urlib主要用途BeautifulSoupurlibPython3.x整合了Python2中的urllib2和urllib，合并为urllib，并分为四大主要模块：urllib.request、urllib.error、urllib.parse和urllib.robotparser。主要用途：urllib.request：用于打开和读取URLs；
复制链接

扫一扫