用Python爬虫抓取煎蛋(jandan.net)无聊图和妹子图

用Python爬虫抓取煎蛋(jandan.net)无聊图和妹子图,有需要的朋友可以参考下。

初学Python, 练手写了个程序

通过Python爬虫抓取煎蛋无聊图和妹子图,存储到本地硬盘


使用了pyquery包来做html解析,需要另外安装

图片默认下载到D:/Download/Python/ 目录下,无聊图在pic目录中,妹子图在ooxx目录中

程序开始输入三个参数: 开始页码, 结束页码, 无聊图或妹子图

程序:

# -*- coding: utf-8 -*-
"""
Created on Mon Dec 29 13:36:37 2014

@author: Gavin
"""

import sys 
reload(sys) 
sys.setdefaultencoding('utf-8')

from pyquery import PyQuery as pq
from time import ctime
import time
import re
import os
import urllib

def main(page_start, page_end, flag):
    file_path_pre = 'D:/Download/Python/'
    folder_name = 'ooxx' if flag else 'pic'
    page_url = 'http://jandan.net/' + folder_name + '/page-'
    folder_name = file_path_pre + folder_name + '/' + str(page_start) + '-' + str(page_end) + '/'
    for page_num in range(page_start,page_end + 1):
        crawl_page(page_url, page_num, folder_name)
            
def crawl_page(page_url, page_num, folder_name):
    page_url = page_url + str(page_num)
    print 'start handle',page_url
    print '','starting at', ctime()
    t0 = time.time()
    page_html = pq(url = page_url) #获取网页html
    comment_id_patt = r'<li id="comment-(.+?)">'
    comment_ids = re.findall(comment_id_patt, page_html.html())
    name_urls = {}
    for comment_id in comment_ids:
        name_url = dispose_comment(page_html,comment_id)
        if name_url: 
            name_urls.update(name_url)
    if not os.path.exists(folder_name):
        print '','new folder',folder_name
        os.makedirs(folder_name)
    for name_url in name_urls.items():
        file_path = folder_name + 'page-' + str(page_num) + name_url[0]
        img_url = name_url[1]
        if not os.path.exists(file_path): 
            print '','start download',file_path
            #print '','img_url is',img_url
            urllib.urlretrieve(img_url, file_path)
        else:
            print '',file_path,'is already downloaded'               
    print 'finished at', ctime(),',total time',time.time()-t0,'ms'
        
def dispose_comment(page_html,comment_id):
    name_url_dict = {}
    id = '#comment-'+comment_id
    comment_html = page_html(id)
    oo_num = int(comment_html(id + ' #cos_support-' + comment_id).text())
    xx_num = int(comment_html(id + ' #cos_unsupport-'  + comment_id).text())
    oo_to_xx = oo_num/xx_num if xx_num != 0 else oo_num
    if oo_num > 1 and oo_to_xx > 0:
        imgs = comment_html(id + ' img')
        for i in range(0, len(imgs)):
            org_src = imgs.eq(i).attr('org_src')
            src = imgs.eq(i).attr('src')
            img_url = org_src if org_src else src
            if img_url:
                img_suffix = img_url[-4:]
                if not img_suffix.startswith('.'):
                    img_suffix = '.jpg'
                img_name = id + '_oo' + str(oo_num) + '_xx' + str(xx_num) + (('_' + str(i)) if i != 0 else '') + img_suffix
                name_url_dict[img_name] = img_url
            else:
                print '***url not exist'
    return name_url_dict
              
if __name__ == '__main__':  
    page_start = int(raw_input('Input  start page number: '));
    page_end   = int(raw_input('Input  end   page number: '));
    is_ooxx    = int(raw_input('Select 0: wuliao 1: meizi '));
    main(page_start, page_end, is_ooxx)


from: http://www.aichengxu.com/view/40724

好的,我知道了您的要求。可以使用Python中的requests和BeautifulSoup库来实现网页内容的爬取和解析,并使用os和urllib库实现文件的下载和存储。 首先需要安装requests、BeautifulSoup和lxml库,可以通过以下命令来安装: ``` pip install requests pip install beautifulsoup4 pip install lxml ``` 然后,可以按照以下步骤来实现您的需求: 1. 导入所需的库: ``` import requests from bs4 import BeautifulSoup import os import urllib ``` 2. 获取网页内容并解析: ``` # 网页链接 url = 'https://jandan.net/' # 请求头部 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} # 发送请求获取内容 response = requests.get(url, headers=headers) # 解析网页内容 soup = BeautifulSoup(response.text, 'lxml') ``` 3. 获取无聊和树洞内容的链接: ``` # 获取无聊链接 img_links = [] for i in range(2): link = soup.select('.text p a')[i]['href'] img_links.append(link) # 获取树洞内容链接 duanzi_links = [] for i in range(2): link = soup.select('.text .list-post li a')[i]['href'] duanzi_links.append(link) ``` 4. 下载无聊并存储: ``` # 新建无聊文件夹 if not os.path.exists('jandan'): os.mkdir('jandan') # 下载无聊并保存 for i in range(2): response = requests.get(img_links[i], headers=headers) filename = 'jandan/' + str(i+1) + '.jpg' with open(filename, 'wb') as f: f.write(response.content) ``` 5. 获取树洞内容并存储到txt文件: ``` # 新建txt文件 with open('jandan.txt', 'w', encoding='utf-8') as f: # 获取树洞内容并存储到txt文件 for i in range(2): link = duanzi_links[i] response = requests.get(link, headers=headers) soup = BeautifulSoup(response.text, 'lxml') content = soup.select('.text .post')[0].get_text() f.write(content+'\n\n') ``` 这样,就可以实现对jandan.net网站下面无聊和树洞内容的爬取和保存了。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值