爬虫点滴

最新推荐文章于 2022-03-26 22:51:22 发布

CodeJames

最新推荐文章于 2022-03-26 22:51:22 发布

阅读量133

点赞数

分类专栏： Python学习文章标签：爬虫 requests.text

本文链接：https://blog.csdn.net/qq_31900497/article/details/81359493

版权

Python学习专栏收录该内容

14 篇文章 1 订阅

订阅专栏

一，爬虫的一般步骤：

1.下载数据；2.分析数据；3.保存数据；

二，cookies：存储在用户本地终端上的数据。

三，try except的使用形式：

    #下载器
    def download(self,url):
        try:
            #返回数据
            return self.session.get(url)
        except Exception as e:
            print(e)

四，response.text和response.content的区别

response.text

类型：str

解码类型：根据http头部对响应的编码做出有根据的推测，推测的文本编码

如何修改编码方式：response.encoding="gbk"

response.content

类型：bytes

解码类型：没有指定

如何修改编码方式：response.content.decode("utf8")

五，正则表达式：主要的作用是从给定的字符串中通过特定的模式，搜索想要的内容；

            html=response.text
            #从数据html中找出所要的东西
            ids=re.findall(r'http://tu.duowan.com/gallery/(\d+).html',html)
            #返回没有重复的ids
            return set(ids)

六，如何保存套图数据

    #根据套图的信息，持久化
    def save_img(self, img_item_info):
        dir_name=strip(img_item_info['gallery_title'].strip())
        print(dir_name)
        if not os.path.exists(dir_name):
            os.makedirs(dir_name)
        for img_info in img_item_info['picInfo']:
            #建立文件夹，并命名
            img_name=strip(img_info['title'].strip())
            #获取图片
            img_url=img_info['url']
            pix=img_url.split('/')[-1].split('.')[-1]
            #图片的全路径
            img_path=os.path.join(dir_name,"%s,%s"%(img_name,pix))
            if not os.path.exists(img_path):
                response=self.download(img_url)
                if response:
                    img_data=response.content
                    with open(img_path,'wb') as f:
                        f.write(img_data)

七，requests.post()和requests.get()，的选择需要根据浏览器进行查询。

CodeJames

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫点滴

一，爬虫的一般步骤：1.下载数据；2.分析数据；3.保存数据；二，cookies：存储在用户本地终端上的数据。三，try except的使用形式： #下载器 def download(self,url): try: #返回数据 return self.session.get(url) ...
复制链接

扫一扫

专栏目录