用python来爬某电影网站的下载地址

最新推荐文章于 2024-06-28 10:31:48 发布

现实很丰满

最新推荐文章于 2024-06-28 10:31:48 发布

阅读量3.4k

点赞数

分类专栏： python 文章标签： python 爬虫 requests

本文链接：https://blog.csdn.net/u013329107/article/details/47333811

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

首先在这里向大家推荐，极客学院，好不好，用了才知道。

博客，算是我学习各种IT知识之后的一个总结，CSDN上的大神的博客，让我受益良多，除此之外，还有博客园、脚本之家等等很多的好网站。当然脚本之家的广告着实多了点。而极客学院是我最近一个月才有了解的网站，开始时，自己去注册个号，结果悲剧了，只有3天的使用期限，我可是绑定了手机号的，你才给我三天时间，坑啊。然后一次一个人在群里发链接，点进去送了我一个月，后来我才知道，原来邀请送时间的，后来一个月变成一年了，哈哈，然后在这段时间，学了好多东西，python就是其中之一。

唠叨了有点多，进入正题。

一.工具：

1.基本的python环境

2.requests 这个类库要装上

3.pycharm 开发环境。

4.强调一下，所有操作均在windows操作系统上，小弟没钱用不起高大上的Mac(要给我打钱的，可留言，^_^)

二、需要掌握的知识

1.python基础知识。去学极客学院

2.正则表达式基础不了解的，去学。正则表达式

三、代码分析

#coding = utf-8
import requests
import re
import sys
import os
from os.path import join,getsize

reload(sys)
sys.setdefaultencoding('utf8');
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'}
html = requests.get('http://www.dy2018.com/')
html.encoding='gbk'
# print html.text
filehandler = open('F:/document/python/python_dy2018.com.txt', 'wb+')
i=0;
html2 = requests.get('http://www.dy2018.com/')
html2.encoding='gbk'
target = re.findall('<div class="title_all">(.*?)</div>',html2.text,re.S)
for each in target:
    # print each
    content = re.findall('<div class="co_(.*?)</div>',html2.text,re.S)
    for each1 in content:
        a = re.findall('<a href=\'(.*?)\'',html2.text,re.S)
        for each1 in a:
            url =  'http://www.dy2018.com/'+each1
            htmlChild =requests.get(url)
            htmlChild.encoding='gbk'
            lianjie = re.findall('bgcolor="#fdfddf"><a href="(.*?)">ftp',htmlChild.text,re.S);
            title = re.findall('bgcolor="#fdfddf"><a href="(.*?)">ftp',htmlChild.text,re.S);
            for eachtitle1 in title:

                file = open('F:/document/python/'+str(i)+'.txt','wb+')
                print eachtitle1
                for eachtitle in lianjie:
                    print eachtitle
                    file.write(eachtitle+'\n')
            i+=1
                 # eachtitle1 = re.findall('(.*?)',eachtitle,re.S)
                 # for eachtileChild in eachtitle1:
                 #     print eachtileChild

代码大体是这样，写得不好望大家见谅。
1.导入相应的类库
requests
re 正则表达式
sys 这个的作用是防止乱码
2.requests.get(url) 取得该地址网页的源码
3.html.encoding='gbk' 因为网站用的是gbk 所以要保持一致，否则又乱码
4. open() 方法用来操作文件，详情点击
5.re.findall() 利用正则表达式搜索你需要的信息

6.print 方法。合理使用这个方法可以让你开发的更加便捷。当然用pycharm的断点调试也可以。

效果图：