Python爬取De下载站相关代码

Python爬取De下载站相关代码,因为没有设置代理,所以爬到800页左右就被干掉了,后续要加上
import urllib.request
import bs4
import re
import time
from  multiprocessing import Pool
class getLink(object):
    def __init__(self,url):
        self.url = url

    def main(self):
        downFile = open("down.txt", "w", encoding='utf-8')
        downFile.truncate()
        i = 0
        page = 1
        for urlSingle in self.url:

            result = self.getResult(urlSingle)
            print("第%d" % (page) + "页")
            downFile.write("第%d" % (page) + "页\n")
            page += 1
            for rs in result:
                pid, Name = self.getInfo(rs)
                DownUrl0, DownUrl1 = self.getDownUrl(pid)
                i += 1
                print("*******************************************")
                print("正在爬取第%d" % (i) + "个 " + "电影名称: " + Name)
                downFile.write("--------")
                downFile.write("第%d" % (i) + "个" + Name + "\n")
                downFile.write("英语中字: " + DownUrl0 + "\n")
                downFile.write("中英双字: " + DownUrl1 + "\n")

    def getResult(self,url):
        #shift+tab 同时左移
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36Name"
        }
        html = urllib.request.Request(url, headers=headers)
        response = urllib.request.urlopen(html).read().decode('utf-8')
        # 获取pid与电影name
        # 设置正则匹配规则pat1
        bs = bs4.BeautifulSoup(response, "lxml")
        result = bs.find_all(class_="main_top")
        return result

    def getInfo(self,result):
        # 获取名字
        Name = result.find('a').getText()

        # 获取href
        href = result.find('a').get('href')
        # 获取pid
        str1 = href.split('.')
        str2 = str1[2].split('/')
        pid = str2[4]
        return pid, Name

    def getDownUrl(self,pid):
        DownUrl0 = "http://www.dexiazai.cc/newdown/?pid=" + pid + "&linkn=0"
        DownUrl1 = "http://www.dexiazai.cc/newdown/?pid=" + pid + "&linkn=1"
        return DownUrl0, DownUrl1
if __name__ == '__main__':
    pool = Pool(4)
    url = []
    for i in range(1467):
        url.append("http://www.dexiazai.cc" + "/plus/list.php?tid=50&PageNo=" + str(i))
    Link = getLink(url)
    #Link.main()
    pool.map_async(Link.main())
    pool.close()
    pool.join()

python基础学习路线:点击打开链接

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
============================================== PHP5 DeZenders [NWS & RM] [UPDATED 29.11.2011] ============================================== ================== Based on php 5.3.2 ================== Includes the following loaders for encoded files: + ionCube PHP Loader v3.2.31 [CRACKED by sidxx55] + NuSphere PhpExpress v3.2.1 [RETAIL] + Zend Optimizer v3.3.3 [PATCHED by ps2gamer] ======================== Successfully tested with ======================== + Ioncubed files [v6][v7] + NuCoder files [v2] + Zend Guard files [v5] ============ Requirements ============ Microsoft Visual C++ 2008 SP1 Redistributable Package: x86: https://www.microsoft.com/download/en/details.aspx?id=5582 x64: https://www.microsoft.com/download/en/details.aspx?id=2092 Microsoft Visual C++ 2010 SP1 Redistributable Package: x86: https://www.microsoft.com/download/en/details.aspx?id=8328 x64: https://www.microsoft.com/download/en/details.aspx?id=13523 Note: If you are running windows x64, you have to install both x64 and x86 versions. ========== How to use ========== Decoding a file: #1: Put dezender in a path without spaces or dots (example C:\dezender) #2: Drag ioncubed/zended file in DECODE_NWS.bat or DECODE_RM.bat and it should produce a decoded file. --- Decoding multiple files: #1: Put dezender in a path without spaces or dots (example C:\dezender) #2: Put the program that needs dezending in _decode (the whole program is fine), it will only parse php files but it will copy all other files to the decoded folder. #3: Run DECODE_NWS.bat or DECODE_RM.bat #4: Sit back, grab another beer, should be done shortly depending on the size of the program #5: Take a look in _decoded_nws or _decoded_rm (depends on the dezender used) ========= Thanks to ========= sidxx55 ps2gamer =============================== Visit us at http://deioncube.in Sudu's Blog:http://www.sudu.us ===============================
Python中,你可以使用`requests`库来获取网页的源代码,然后再用`BeautifulSoup`库来解析网页,从而实现爬取股票信息的目的。以下是一个简单的例子,展示了如何使用这些库来爬取一个股票网页的源代码: ```python import requests from bs4 import BeautifulSoup # 目标股票网站的URL,这里以示例为目的,实际URL需要根据你要爬取的网站来替换 url = 'http://finance.yahoo.com/quote/AAPL' # 发送HTTP请求 response = requests.get(url) # 检查请求是否成功 if response.status_code == 200: # 获取网页的源代码 html_content = response.text # 使用BeautifulSoup解析网页源代码 soup = BeautifulSoup(html_content, 'html.parser') # 根据网页结构提取你想要的信息,以下是一个示例,实际的选择器需要根据网页的具体结构来定 stock_info = soup.select_one('#quote-header-info').get_text() print(stock_info) else: print('网页请求失败,状态码:', response.status_code) ``` 在使用上述代码之前,请确保你已经安装了`requests`和`BeautifulSoup`库。如果尚未安装,可以使用pip进行安装: ```bash pip install requests pip install beautifulsoup4 ``` 这段代码是一个基础的爬虫示例,实际应用中可能需要处理更复杂的情况,如登录、处理JavaScript生成的内容(可能需要使用Selenium或Pyppeteer等工具)、遵守robots.txt协议、设置合理的请求间隔以免对服务器造成过大压力等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

不冬眠的小钱学长

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值