python爬虫初学

最新推荐文章于 2024-05-15 00:48:49 发布

九层台

最新推荐文章于 2024-05-15 00:48:49 发布

阅读量464

点赞数 1

分类专栏： python 爬虫

本文链接：https://blog.csdn.net/qq_38204481/article/details/93649458

版权

python 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

爬虫

1 篇文章 0 订阅

订阅专栏

0x01环境搭建

import os
import requests
from lxml import etree
from urllib.parse import urljoin
import urllib

pip installl 包名字

0x02介绍
这里写了一个爬网站图片的爬虫脚本

遇到的第一个问题是中文乱码
response.encoding=response.apparent_encoding #解决中文乱码问题
如果不能解决就手动设置在head，meta下是网页编码的方式
response.encoding='gb2312'
其实是设置代码的编码方式和网站的编码方式一样
在这里插入图片描述下面这东西是一套固定的，//表示从开头搜索
div[@class='feilei_a']找到div标签下的元素class='feilei_a’把这一标签下的内容(准确说是内容的地址)存到一个列表中，下面会更精确的定位

esponse=requests.get(url)
response.encoding=response.apparent_encoding  #解决中文乱码问题

root=etree.HTML(response.text)
categorys=root.xpath("//div[@class='feilei_a']/a")

从之前的地址下继续更精确查找或取出来内容
text() 表示取出来文本内容
@href 表示取出来href元素的内容
取出来的内容全是列表，要加[0]

category_name=category.xpath("text()")[0]
 category_href=category.xpath("@href")[0]

保存高清图片和缩略图。
第一个参数为下载网址，第二个为要保存的目录
高清和缩略图只是细微的差别

urllib.request.urlretrieve(img_src,path+"/"+"缩略"+img_name)
urllib.request.urlretrieve(img_src.replace("Files","files").replace("_s.jpg",".jpg"),path+"高清"+"/"+img_name)

创建文件夹第一个参数为路径，第二个是一个选项，设为True可以保证还能再次创建

os.makedirs(path,exist_ok=True)

0x03贴出代码
可以从http://sc.chinaz.com/tupian/下载高清和缩略图保存到本地

\# -*- coding:utf-8 -*-
import os
import requests
from lxml import etree
from urllib.parse import urljoin
import urllib


#乱码原因

url="http://sc.chinaz.com/tupian/"
response=requests.get(url)
response.encoding=response.apparent_encoding  #解决中文乱码问题

root=etree.HTML(response.text)
categorys=root.xpath("//div[@class='feilei_a']/a")
categorys.pop(0)
print (categorys)
for category in categorys:
    category_name=category.xpath("text()")[0]
    category_href=category.xpath("@href")[0]

    category_href=urljoin(url,category_href)
    print(category_name, category_href)
    path="img/"+category_name
    os.makedirs(path,exist_ok=True)
    os.makedirs(path+"高清",exist_ok=True)

    page=0
    while True:
        if page==0:
            pass
        else:
            category_href=category_href.replace(".html","_%s.html"%(page))
        response=requests.get(category_href)
        response.encoding=response.apparent_encoding
        root=etree.HTML(response.text)
        imgs=root.xpath("//div[@id='container']/div/div/a")
#  """
# http://pics.sc.chinaz.com/files/pic/pic9/201906/zzpic18667.jpg
# http://pic2.sc.chinaz.com/Files/pic/pic9/201906/zzpic18667_s.jpg
#   """
        for img in imgs:
            img_name=img.xpath("img/@alt")[0]
            img_src = img.xpath("img/@src2")[0]

            print("\t",img_name,img_src)
            urllib.request.urlretrieve(img_src,path+"/"+"缩略"+img_name)
            urllib.request.urlretrieve(img_src.replace("Files","files").replace("_s.jpg",".jpg"),path+"高清"+"/"+img_name)
        if not imgs:
            break
        page+=1

九层台

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
python爬虫初学

0x01环境搭建import osimport requestsfrom lxml import etreefrom urllib.parse import urljoinimport urllibpip installl 包名字0x02介绍这里写了一个爬网站图片的爬虫脚本遇到的第一个问题是中文乱码response.encoding=response.apparent_en...
复制链接

扫一扫

专栏目录