python项目之 爬虫爬取煎蛋jandan的妹子图-下

python项目之 爬虫爬取煎蛋jandan的妹子图-下

函数如下

  1. 读取全部单个txt组合成一个TXT文件,并把网址保存在all_imag_urls中
    read_write_txt_to_main()
  2. 读取单个TXT件的网址
    get_url()
  3. 每一个图片保存在本地
    get_imags(all_imag_urls)

最终结果如下

效果图

源码如下

# coding:utf-8
####################################################
# coding by 刘云飞
####################################################

import requests
import os
import time
import random
from bs4 import BeautifulSoup
import threading

ips = []
all_imag_urls = []

with open('ip2.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        ip_one = "http://" + line.strip()
        ips.append(ip_one)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/42.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Referer': 'http://jandan.net/ooxx/',
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
}


def read_write_txt_to_main():
    for i in range(1520, 1881):
        filename = str(i) + ".txt"
        if os.path.exists(filename):
            print(filename + "OK")
            with open(filename, 'r') as f:
                urls = f.readlines()
                for url in urls:
                    all_imag_urls.append(url)
    with open("all_imgs.txt", 'w+') as fw:
        for url in all_imag_urls:
            fw.write(url + "")
    print("write file ok!!!!!")


def get_url():
    with open("all_imgs.txt", 'r') as fw:
        urls = fw.readlines()
        for url in urls:
            all_imag_urls.append(url.strip("\n"))


def get_imags(urls):
    for url in urls:
        url_a = url.strip("\n")
        filename = url_a[28:]
        if os.path.exists(filename):
            pass
        else:
            host = url_a[7:21]
            headers['Host'] = host
            single_ip_addr = random.choice(ips)
            proxies = {'http': single_ip_addr}
            try:
                res = requests.get(url, headers=headers, proxies=proxies, stream=True)
                print(res.status_code)
                if res.status_code == 200:
                    text = res.content
                    with open(filename, 'wb') as jpg:
                        jpg.write(text)
                    print(filename + "  OK")
                else:
                    print(filename + "  not ok")
            except:
                print(filename + "  not ok")


''' 读取全部单个txt组合成一个TXT文件,并把网址保存在all_imag_urls中。 '''
# read_write_txt_to_main()

''' 读取单个TXT件的网址 '''
get_url()

''' 每一个图片保存在本地 '''
get_imags(all_imag_urls)

print("所有的妹子图保存完毕,请尽情享受!!")
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,我知道了您的要求。可以使用Python中的requests和BeautifulSoup库来实现网页内容的爬取和解析,并使用os和urllib库实现文件的下载和存储。 首先需要安装requests、BeautifulSoup和lxml库,可以通过以下命令来安装: ``` pip install requests pip install beautifulsoup4 pip install lxml ``` 然后,可以按照以下步骤来实现您的需求: 1. 导入所需的库: ``` import requests from bs4 import BeautifulSoup import os import urllib ``` 2. 获取网页内容并解析: ``` # 网页链接 url = 'https://jandan.net/' # 请求头部 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} # 发送请求获取内容 response = requests.get(url, headers=headers) # 解析网页内容 soup = BeautifulSoup(response.text, 'lxml') ``` 3. 获取无聊和树洞内容的链接: ``` # 获取无聊链接 img_links = [] for i in range(2): link = soup.select('.text p a')[i]['href'] img_links.append(link) # 获取树洞内容链接 duanzi_links = [] for i in range(2): link = soup.select('.text .list-post li a')[i]['href'] duanzi_links.append(link) ``` 4. 下载无聊并存储: ``` # 新建无聊文件夹 if not os.path.exists('jandan'): os.mkdir('jandan') # 下载无聊并保存 for i in range(2): response = requests.get(img_links[i], headers=headers) filename = 'jandan/' + str(i+1) + '.jpg' with open(filename, 'wb') as f: f.write(response.content) ``` 5. 获取树洞内容并存储到txt文件: ``` # 新建txt文件 with open('jandan.txt', 'w', encoding='utf-8') as f: # 获取树洞内容并存储到txt文件 for i in range(2): link = duanzi_links[i] response = requests.get(link, headers=headers) soup = BeautifulSoup(response.text, 'lxml') content = soup.select('.text .post')[0].get_text() f.write(content+'\n\n') ``` 这样,就可以实现对jandan.net网站下面无聊和树洞内容的爬取和保存了。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值