抓取网页html,Web Scraping爬取HTML网页基础版-CSDN博客

在网络上访问数据有不同方式?：

-爬取HTML网页

-直接下载数据文件，例如csv，txt，pdf文件

-通过应用程序编程接口(API)访问数据，例如电影数据库，Twitter

选择网页爬取，当然了解HTML网页的基本结构，可以参考这个网页：

HTML的基本结构

HTML标记：head，body，p，a，form，table等等

标签会具有属性。例如，标记a具有属性(或属性)href的链接的目标。

class和id是html用来通过级联样式表(CSS)控制每个元素的样式的特殊属性。 id是元素的唯一标识符，而class用于将元素分组以进行样式设置。

一个元素可以与多个类相关联。这些类别之间用空格隔开，例如

伦敦 h2>

下图是来自W3SCHOOL的例子，city的包括三个属性，main包括一个属性，London运用了两个city和main，这两个类，呈现出来的是下图的样子。

可以通过标签相对于彼此的位置来引用标签

child-child是另一个标签内的标签，例如这两个p标签是div标签的子标签。

parent-parent是一个标签，另一个标签在其中，例如 html标签是body标签的parent标签。

siblings-siblings是与另一个标签具有相同parent标签的标签，例如在html示例中，head和body标签是同级标签，因为它们都在html内。两个p标签都是sibling，因为它们都在body里面。

四步爬取网页：

第一步：安装模块

安装requests,beautifulsoup4,用来爬取网页信息

Install modules requests, BeautifulSoup4/scrapy/selenium/....requests: allow you to send HTTP/1.1 requests using Python. To install:Open terminal (Mac) or Anaconda Command Prompt (Windows)code: BeautifulSoup: web page parsing library, to install, use:

第二步：利用安装包来读取网页源码

第三步：浏览网页源码找到需要读取信息的位置

这里不同的浏览器读取源码有差异，下面介绍几个，有相关网页查询详细信息。

Firefox: right click on the web page and select "view page source"Safari: please instruction here to see page source ()Ineternet Explorer: see instruction at

第四步：开始读取

Beautifulsoup: 简单那，支持CSS Selector, 但不支持 XPathscrapy (): 支持 CSS Selector 和XPathSelenium: 可以爬取动态网页 (例如下拉不断更新的)lxml等BeautifulSoup里Tag: an xml or HTML tag 标签Name: every tag has a name 每个标签的名字Attributes: a tag may have any number of attributes. 每个标签有一个到多个属性 A tag is shown as a dictionary in the form of {attribute1_name:attribute1_value, attribute2_name:attribute2_value, ...}. If an attribute has multiple values, the value is stored as a listNavigableString: the text within a tag

上代码：

#Import requests and beautifulsoup packages

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity="all"

# import requests package

import requests

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)

from bs4 import BeautifulSoup

Get web page content

# send a get request to the web page

page=requests.get("A simple example page")

# status_code 200 indicates success.

# a status code >200 indicates a failure

if page.status_code==200:

# content property gives the content returned in bytes

print(page.content) # text in bytes

print(page.text) # text in unicode

#Parse web page content

# Process the returned content using beautifulsoup module

# initiate a beautifulsoup object using the html source and Python’s html.parser

soup=BeautifulSoup(page.content, 'html.parser')

# soup object stands for the **root**

# node of the html document tree

print("Soup object:")

# print soup object nicely

print(soup.prettify())

# soup.children returns an iterator of all children nodes

print("\soup children nodes:")

soup_children=soup.children

print(soup_children)

# convert to list

soup_children=list(soup.children)

print("\nlist of children of root:")

print(len(soup_children))

# html is the only child of the root node

html=soup_children[0]

html

# Get head and body tag

html_children=list(html.children)

print("how many children under html? ", len(html_children))

for idx, child in enumerate(html_children):

print("Child {} is: {}\n".format(idx, child))

# head is the second child of html

head=html_children[1]

# extract all text inside head

print("\nhead text:")

print(head.get_text())

# body is the fourth child of html

body=html_children[3]

# Get details of a tag

# get the first p tag in the div of body

div=list(body.children)[1]

p=list(div.children)[1]

# get the details of p tag

# first, get the data type of p

print("\ndata type:")

print(type(p))

# get tag name (property of p object)

print ("\ntag name: ")

print(p.name)

# a tag object with attributes has a dictionary

# use .attrs to get the dictionary

# each attribute name of the tag is a key

# get all attributes

p.attrs

# get "class" attribute

print ("\ntag class: ")

print(p["class"])

# how to determine if 'id' is an attribute of p?

# get text of p tag

p.get_text()