Python Notebook 爬虫实践案例分享2/Dappa

哈老师的区块链

已于 2022-10-31 09:42:48 修改

阅读量193

点赞数

文章标签： python 爬虫开发语言

于 2022-10-31 09:42:43 首次发布

本文链接：https://blog.csdn.net/m0_56574080/article/details/127609032

版权

与第一个任务一起的，申东华教授又让我做了第二个任务。

Top Blockchain Dapps | DappRadar

在这个网站上，我们需要所有BNB下方的的所有Dappa的名称，categories，并且点开每一个图片的链接（如下图）

首先检查是否每一个网站拥有Buy CAKE的按钮，如果有，我们需要爬取这个按钮的链接。如果没有，记录没有。

然后点击下方的 Smart Contract的按钮，你会看到一共有多少个Smart Contract，然后在这个的基础上我们需要把展示出来的所有Smart Contract都爬取出来。

from bs4 import BeautifulSoup
import requests
import pandas as pd
# 导入webdriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time

options = webdriver.ChromeOptions()
options.add_argument('headless')

url_basic = 'https://dappradar.com'


url_basic = 'https://dappradar.com'
df = pd.read_excel('with_href_1_171.xlsx')
href_list = df['href'].tolist()
len(href_list)

column_list = ['Blockchain_Dapps_link','smart_contract_name','smart_contract_link']
all_tokens = pd.DataFrame(columns=['Blockchain_Dapps_link','smart_contract_name','smart_contract_link'])
all_tokens.loc[0] = [0,0,0]

for i in range(700,1000):
    driver = webdriver.Chrome()
    Blockchain_Dapps_link = url_basic + href_list[i]
    driver.get(url_basic + href_list[i])

    driver.implicitly_wait(10)
    time.sleep(2)

    SeeMore = driver.find_elements(by=By.XPATH,
                value=("//*[contains(text(),'we are tracking')]"))
    original_t = time.time()
    while True:
        try:
            SeeMore[0].click()
            print(i)
            break
        except BaseException as E:
    
            if time.time() - original_t >= 5 :
                driver.refresh()
                original_t =time.time()
            else:
                time.sleep(2)
            SeeMore = driver.find_elements(by=By.XPATH,
                                                value=("//*[contains(text(),'we are tracking')]"))
            pass
    # try:
    #     wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "sc-bMthRQ.crCteR"))).click()
    #     print('done')
    # except:
    #     print('error')

    wait=WebDriverWait(driver, 10)
    time.sleep(2)

    html = driver.page_source
    driver.quit()
    bs = BeautifulSoup(html, 'lxml')
    # subjects = bs.select('tbody ul[class="sc-eKszNL dYKdGj"]')
    # subjects = bs.select('li div[class="sc-jxgYAQ eOGoAs"]') # I think it changes everyday. <div class="sc-bMthRQ crCteR">For this data we are tracking:<div>10&nbsp;Alien Worlds&nbsp;Smart Contracts<svg width="13px" height="13px" viewBox="0 0 17 14" xmlns="http://www.w3.org/2000/svg" fill="#006cff"><path fill-rule="evenodd" clip-rule="evenodd" d="M8.793.293a1 1 0 0 1 1.414 0l6 6a1 1 0 0 1 0 1.414l-6 6a1 1 0 0 1-1.414-1.414L13.086 8H1.5a1 1 0 0 1 0-2h11.586L8.793 1.707a1 1 0 0 1 0-1.414Z"></path></svg></div></div>
    subjects = bs.select('li div[class="sc-jxgYAQ eOGoAs"]')
    # subjects =bs.select(by= By.XPATH)

    tokens = pd.DataFrame(columns=['Blockchain_Dapps_link','smart_contract_name','smart_contract_link'])
    token = []
    for i in range(len(subjects)):
        # print(len(subjects))
        smart_contract_soup = subjects[i].find('a',class_="text-primary")
        smart_contract_name = subjects[i].get_text()
        smart_contract_link = subjects[i].find_all('a', href=True)[0]['href']
        token = [Blockchain_Dapps_link,smart_contract_name,smart_contract_link]
        tokens.loc[i]= token
    
    frames = [all_tokens, tokens]
    all_tokens = pd.concat(frames)

    all_tokens.to_csv('smart_contract_700.csv')
all_tokens.to_excel('smart_contract_.xlsx')

这个问题中比较困难的就是如何点击那个按钮。jupyter notebook有些时候会出现第一次点击无法出现正确的反应，必须点击两次才能得到结果的情况。

然后关于如何寻找这个按钮，现在版本的selenium没有了根据class选择的function但是我们可以用多种选择的方法，只不过这个代码使用的时候我们需要自己来选择class的形式。

之后，我们需要去尝试点击BUY TOKEN的按钮，并且我们希望得到如果可以有，这个链接对应的链接。

column_list = ['Blockchain_Dapps_link','smart_contract_name','smart_contract_link']
all_tokens = pd.DataFrame(columns=['Blockchain_Dapps_link','smart_contract_name','smart_contract_link'])
all_tokens_2 = pd.DataFrame(columns=['Blockchain_Dapps_link','Buy_token','address'])
all_tokens.loc[0] = [0,0,0]
all_tokens_2.loc[0] = [0,0,0]
# 
for i in range(3000, len(href_list)):
    driver = webdriver.Chrome()
    # 向指定url发起请求
    Blockchain_Dapps_link = url_basic + href_list[i]
    driver.get(url_basic + href_list[i])
    # print(Name + ' Smart Contracts')
    # wait=WebDriverWait(driver, 10)
    driver.implicitly_wait(10)
    time.sleep(2)

    BuyButton = driver.find_elements(by=By.XPATH,
                                    value=("//*[contains(text(),'Buy ')]"))
    
                           
    try:
        tokens_2 = pd.DataFrame(columns=['Blockchain_Dapps_link','Buy_token','address']) 
        BuyButton[0].click()
        driver.implicitly_wait(10)
        time.sleep(2)
        linking = driver.current_url
        buy_token = 1
        token_2 = [Blockchain_Dapps_link,buy_token,linking]
        tokens_2.loc[i]= token_2


    except BaseException as E:
        print('Oof!')
        buy_token = 0
        linking = 0
        token_2 = [Blockchain_Dapps_link,buy_token,linking]
        tokens_2.loc[i]= token_2
    
    driver.quit()

    frames = [all_tokens_2, tokens_2]
    all_tokens_2 = pd.concat(frames)
    all_tokens_2.to_csv('buy_token_3000_.csv')
all_tokens_2.to_excel('buy_token_3000_.xlsx')
#     all_tokens_2.to_csv('buy_token.csv')
# all_tokens_2.to_excel('buy_token.xlsx')

这里面因为上面提到的问题，所以我选择一个比较笨的方法，反复去尝试点击这个按钮，如果没有的话，我们就进行下一步。

然后在运行的时候，我发现大部分都是没有buy token的，但是很多没有buy token的运行时间比较长。但是如果用notebook进行多核爬取的话，整体时间还算可以接受。

哈老师的区块链

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python Notebook 爬虫实践案例分享2/Dappa

然后点击下方的 Smart Contract的按钮，你会看到一共有多少个Smart Contract，然后在这个的基础上我们需要把展示出来的所有Smart Contract都爬取出来。然后关于如何寻找这个按钮，现在版本的selenium没有了根据class选择的function但是我们可以用多种选择的方法，只不过这个代码使用的时候我们需要自己来选择class的形式。这里面因为上面提到的问题，所以我选择一个比较笨的方法，反复去尝试点击这个按钮，如果没有的话，我们就进行下一步。如果没有，记录没有。
复制链接

扫一扫