Python爬虫入门

最新推荐文章于 2024-07-16 19:26:54 发布

Dream__TT

最新推荐文章于 2024-07-16 19:26:54 发布

阅读量629

点赞数

分类专栏： Python 文章标签： python 爬虫 url

本文链接：https://blog.csdn.net/Dream__TT/article/details/76609903

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

近期自学了Python爬虫，跟大家分享一下如何通过Python爬虫爬到贴吧中所有图片以及网页代码。

普及一下网页状态码、网页编码的意思：

网页状态码：200（正常访问）301（重定向）403（禁止访问）404（网页不存在）500（服务器忙），我们经常访问外网或者不存在的网站遇到的就是403以及404错误。

网页编码：是指在网页总特定的字符编码格式的库，例如常见的utf-8、GBK、gb 2312等，代码不同的编码格式，utf-8因为其可以在统一显示不同语言，故现如今较为通用。

接下来先来介绍一下如何访问网站，并将其网页爬取下来。

环境：Python2.7编辑器：Sublime Text3库：urllib,urllib2,re(正则表达式)，BeautifulSoup，库安装可通过：pip install --完成自动安装（若没安装pip，百度一下教程）。

一：爬取网站代码并下载到指定路径

# -*-coding:utf-8 -*-
#导入库函数并读取URL
import urllib

url = "http://www.baidu.com/"

html = urllib.urlopen(url)
##读取百度网页代码
print html.read()
##读取百度网页状态码
print html.getcode()
html.close()

##保存当前页面至桌面

urllib.urlretrieve(url2,"C:\\USers\\Administrator\\Desktop\\baidu.html")

二：爬取网站代码并下载到指定路径

爬取贴吧图片时我们需先看一下该贴吧审查元素

如上图所示通过审查元素后，我们可以看到网页代码中定义图片图片类型class:BDE_Image，后面定义src为---.jpg，故我们可以根据这些特性对其进行编程实现保存该贴吧中所有图片：

1：使用正则表达式方法

# -*- coding:utf-8 -*-
import re #正则表达式
import urllib

def get_content(url):
	"""doc."""
	html = urllib.urlopen(url)
	content= html.read()
	html.close()
	return content

def get_images(info):
	"""doc.
	<img class="j_retract" id="big_img_1501662059644" src="http://imgsrc.baidu.com/forum/w%3D580%3B/sign=0cf15dc417178a82ce3c7fa8c638728d/f3d3572c11dfa9ec04c4f11f6bd0f703908fc1d4.jpg" οnerrοr="this.src='//tb2.bdstatic.com/tb/static-frs/img/v2/picerr.gif';this.width=82;this.height=75;" style="width: 534px; height: 534px; visibility: visible;">
	j_retract
	"""
	regex = r'class="BDE_Image" src="(.+?\.jpg)"'
	##编译正则表达式
	pat = re.compile(regex)

	image_code = re.findall(pat,info)
	#print image_code
	i = 0
	for image_url in image_code:
		print image_url
		urllib.urlretrieve(image_url,'%s.jpg' % i)
		i+=1

info =  get_content('https://tieba.baidu.com/p/3823765471')
print info
#
#print get_images(info)

2：使用BeautifulSoup

# -*- coding:utf-8 -*-
import urllib
from bs4 import BeautifulSoup
def get_content(url):
	html = urllib.urlopen(url)
	content = html.read()
	html.close()
	return content

#<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=f9cf09409c25bc312b5d01906ede8de7/8f0ede0735fae6cdafb377ef0ab30f2443a70fda.jpg" pic_ext="jpeg" changedsize="true" width="560" height="497">
def get_images(info):

	soup = BeautifulSoup(info)
	all_img = soup.find_all('img',class_="BDE_Image")#第二个参数为属性，可得到特定格式
	
	x = 1
	##保存图像
	for img in all_img:
		#print img['src']
		image_name = '%s.jpg' % x
		urllib.urlretrieve(img['src'],image_name)
		x+=1
info = get_content('https://tieba.baidu.com/p/3823765471')
get_images(info

通过这样的方法即可将贴吧中图片保存到本地文件夹

此时你应该熟悉utllib.urlopen(),read(),urllib.urlretrieve()的作用了，分别是打开网址、读取该网址代码以及保存网址指定文件。

其中所有代码已上传至：http://download.csdn.net/detail/dream__tt/9919725

Dream__TT

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫入门

近期自学了Python爬虫，跟大家分享一下如何通过Python爬虫爬到贴吧中所有图片以及网页代码。普及一下网页状态码、网页编码的意思：网页状态码：200（正常访问）301（重定向）403（禁止访问）404（网页不存在）500（服务器忙），我们经常访问外网或者不存在的网站遇到的就是403以及404错误。网页编码：是指在网页总特定的字符编码格式的库，例如常见的utf-8、GBK、gb
复制链接

扫一扫