Python 爬虫实战

最新推荐文章于 2024-08-30 11:57:05 发布

C爬爬

最新推荐文章于 2024-08-30 11:57:05 发布

阅读量316

点赞数 1

分类专栏： Python 小白学Python 爬虫文章标签： Python爬虫案例

本文链接：https://blog.csdn.net/cflcgw/article/details/85303131

版权

Python 同时被 3 个专栏收录

17 篇文章 0 订阅

订阅专栏

小白学Python

8 篇文章 0 订阅

订阅专栏

爬虫

8 篇文章 0 订阅

订阅专栏

学习再多的理论不实际动手，还是不会写，今天抽点空，写了两个常见的例子

一、爬取百度贴吧的图片

import requests
from lxml import etree
import json

class Tieba():
    def __init__(self,name):
        self.name = name
        self.header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko)"}

    def get_url_list(self):
        url = "https://tieba.baidu.com/f?kw="+self.name+"&ie=utf-8&pn={}&"
        url_list = []
        for i in range(5):
            url_list.append(url.format(i*50))
        return url_list

    def parse_url(self):
        url_list = self.get_url_list()
        for url in url_list:
            response = requests.get(url,headers=self.header)
            html = response.text
            html = etree.HTML(html)
            xml = html.xpath("//div[@class = 't_con cleafix']//div[@class = 'threadlist_lz clearfix']/div/a/@href")

        return xml


    def get_img(self):
        link_list = self.parse_url()
        for link in link_list:
            url = "https://tieba.baidu.com" + link
            html = requests.get(url,headers= self.header).text
            html = etree.HTML(html)
            list = html.xpath('//div/img[@class="BDE_Image"]/@src')
        print(list)
        return list


    def save(self):
        list = self.get_img()
        with open('test.txt','w') as f:
            f.write(json.dumps(list, ensure_ascii=False, indent=2))


if __name__ == '__main__':
    tieba=Tieba("六学")
    tieba.save()

在过程中遇到一些问题，首先一直爬取的为空列表，在百度上搜索找到了原因，换了一个user-agent就好了。
二、糗事百科段子的爬取

import requests
from lxml import etree
import json


class Qiushi():
    def __init__(self):
        self.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko)"}
        self.url = "https://www.qiushibaike.com/text/page/{}/"
    def parse_url(self,url):
        response = requests.get(url,headers = self.headers)
        print(response.content)
        html = response.content.decode('utf-8')
        html = etree.HTML(html)
        return html

    def parse_content(self,url):
        html = self.parse_url(url)

        contents = html.xpath("//div[@class='content']/span/text()")
        for content in contents:
            print(content)
            with open('test1.txt','a',encoding='utf-8') as f:
                f.write(content)
                f.write('\n')

    def run(self):
        url = self.url.format(1)
        self.parse_content(url)


if __name__ == '__main__':
    qiushi = Qiushi()
    qiushi.run()

在这过程中，一直爬取的为乱码，发现直接print（content）就不是乱码，写入文件就是乱码，就使用encoding=‘utf-8’,方式进行编码得以解决。
小白一个，如有问题请大家批评指正。