如何使用Python写一个爬虫软件

最新推荐文章于 2024-07-26 22:40:18 发布

C_Creator

最新推荐文章于 2024-07-26 22:40:18 发布

阅读量3.4k

点赞数 3

分类专栏： python 文章标签：爬虫

本文链接：https://blog.csdn.net/c_creator/article/details/52254163

版权

python 专栏收录该内容

12 篇文章 2 订阅

订阅专栏

如何写一个爬虫软件

一、了解一个网页的各个标签：这里写代码片通过了解各个网页的标签知道如何通过标签来过滤自己想要的的内容。

1. 图片标签：http://www.w3school.com.cn/tags/tag_img.asp

示例：

< img src="/i/eg_tulip.jpg"  alt="上海鲜花港 - 郁金香>

图片标签必须有src和alt属性，src为图片的链接，alt为代替图片的文本

2. 超链接标签：http://www.w3school.com.cn/tags/tag_a.asp

示例：

< a href="http://www.w3school.com.cn">W3School</a>

html中的属性：

属性	值	描述
charset	char_encoding	HTML5 中不支持。规定被链接文档的字符集。
coords	coordinates	HTML5 中不支持。规定链接的坐标。
href	URL	规定链接指向的页面的 URL。
hreflang	language_code	规定被链接文档的语言。
media	media_query	规定被链接文档是为何种媒介/设备优化的。
name	section_name	HTML5 中不支持。规定锚的名称。
rel	text	规定当前文档与被链接文档之间的关系。
rev	text	HTML5 中不支持。规定被链接文档与当前文档之间的关系。
shape	default/rect/circle/poly	规定被下载的超链接目标。
target	_blank/_parent/_self/_top/framename	规定在何处打开链接文档。
type	MIME type	规定被链接文档的的 MIME 类型。

一般来说爬虫主要抓取的便是这两种标签，其他的以后补充，这里主要针对这两种标签写爬虫软件。

二、如何使用python过滤这两种标签

1. 学会使用python的urllib，urllib2库函数：

1.1 urllib库：英文文档：http://docspy3zh.readthedocs.io/en/latest/library/urllib.request.html，中文文档：http://python.usyiyi.cn/python_278/library/urllib.html

获取网页的内容

1.2 BeautifulSoup库：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

解析网页内容

三、使用python扒网页

1.使用python扒图片：

1.1 获取网页的内容：

#!/usr/bin/env python3
# coding=utf-8
import urllib.request as url
#打开网页获得一个文件对象
web_file = url.urlopen('http://www.baidu.com'); 
#依次读取文件中的一行
for line in web_file:
    print(line)
#输出文件对象，而不是文件的内容
print(web_file)

1.2 使用正则表达式过滤图片url

import re
import urllib.request as url

context = url.openurl("http://www.baidu.com").read()
imgs = re.findall("<img.*src=.*?>",context);
for img in imgs:
	re.findall("src")

1.3 使用BeautifulSoup过滤图片标签

from bs4 import BeautifulSoup
import urllib.request as urlrq

def openurl_and_getsoup(url):
    web = urlrq.urlopen(url)
    return BeautifulSoup(str(web.read()),"lxml")
    
url = "http://www.baidu.com"
soup = openurl_and_getsoup(url)
#获取所有的img标签及内容并存储在list中
all_img = soup.find_all(['img'])
#soup.img['src']#可以获取一个图片的url
print(soup.img)
print(all_img)