python3 爬虫（爬取网页、图片基础）

最新推荐文章于 2024-04-16 22:07:34 发布

liujun-st

最新推荐文章于 2024-04-16 22:07:34 发布

阅读量1.1w

点赞数 3

分类专栏： Python3

本文链接：https://blog.csdn.net/Ben_Ben_Niao/article/details/40677869

版权

Python3 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

这一节是学习python3爬虫的相关笔记，记录整过学习的过程。

学习python3时，有些地方不能用python2.的语句，要特别注意。下面的程序为python3.4.2下能运行。

1.要写出完整的爬虫程序，首先的读取网页，获取data,如下：

import urllib.request

response = urllib.request.urlopen("http://acm.hit.edu.cn")
html = response.read()
z_data=html.decode("UTF-8") #转码后才能看见原来的字符，如汉字，如果不对，试一试“GBK”解码
print(z_data)

file=open("txt.html","wb")#python是types格式，得用二进制读写.
file.write(html)
file.close()

其中url可以有分割的部分合成，如下：

data={}
data['key']='python3'
url_values=urllib.parse.urlencode(data)#结果为key=python3

url='http://www.baidu.com/s?'
full_url=url+url_values  #得到的url为：http://www.baidu.com/s?key=python3

2.如果我们知道将要爬取的所有网页的url,那么爬取就会很方便，直接将路径循环用 urllib.request.urlopen（）和file.write()写成.html就可以了，但是这样爬取就没什么意义了，下面,利用正则匹配牌爬取图片。

思路，输入：网址，

输出：爬取的图片存在本地文件夹里。

过程：用urllib.request.urlopen（）打开并读取网址的数据data，利用真则匹配re.compile()生成匹配的模式object，用re.findall（），即object.findall(data)获取所有匹配，然后将匹配成功的用urllin.request.urlretrieve()下载到本地文件夹，此时设计路径模块os,新建目录makedir()等,也可以手动先建立好里面文件夹，例子如下：

import re
import urllib.request
import urllib
import os

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    
    return html.decode('UTF-8')

def getImg(html):
    reg = r'src="(.+?\.jpg)" pic_ext' #要加括号，作为元组返回，抓取淘宝的图片png(先看源码中图片的地址路径)reg = r'data-lazy="(.+?\.png)" '
    imgre = re.compile(reg)
    imglist = imgre.findall(html)
    x = 0
    path = 'D:\\test'
    if not os.path.isdir(path):
        os.makedirs(path)
    paths = path+'\\'      #保存在test路径下
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,'{}{}.jpg'.format(paths,x))
        x = x + 1        
   
html = getHtml("http://tieba.baidu.com/p/2460150866") #淘宝的：html = getHtml(r"http://www.taobao.com/")
getImg(html)

下面为爬取淘宝的所有网页，存在test/1中，如果没有这个路径，则创建这个路径，代码如下：

#coding: utf-8
import re
import urllib.request as request
import urllib
import os
import string

def firstReptile(url):
    urlData = request.urlopen(url).read()
    #3.0现在的参数更改了,现在读取的是bytes-like的,但参数要求是chart-like的,如下解码:
    data = urlData.decode("GBK")  
    objec = re.compile(r'<a href="(http://.+?)" ')
    dir_path = 'D:\\test'
    path = '\\1'
    image_path = dir_path + path
    if not os.path.isdir(image_path):
        os.makedirs(image_path)
    count = 1
    for item in objec.findall(data):
        
        image_dir = image_path + '\\'  +'{}.html'.format(count)  #这些地方都用python3，用python2.会出错。
        html=request.urlopen(item).read()
        with open(image_dir,'wb') as file:            
            file.write(html)
        count +=1
        file.close()

if __name__ == '__main__':
    URL = r'http://www.taobao.com/'
    firstReptile(URL)

爬虫的基础就介绍到这儿,之前老实碰壁，因为在python3中有python的格式输出的语句，现在终于好了，再接再厉。