与第一个任务一起的,申东华教授又让我做了第二个任务。
Top Blockchain Dapps | DappRadar
在这个网站上,我们需要所有BNB下方的的所有Dappa的名称,categories,并且点开每一个图片的链接(如下图)
首先检查是否每一个网站拥有Buy CAKE的按钮,如果有,我们需要爬取这个按钮的链接。如果没有,记录没有。
然后点击下方的 Smart Contract的按钮,你会看到一共有多少个Smart Contract,然后在这个的基础上我们需要把展示出来的所有Smart Contract都爬取出来。
from bs4 import BeautifulSoup
import requests
import pandas as pd
# 导入webdriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
options = webdriver.ChromeOptions()
options.add_argument('headless')
url_basic = 'https://dappradar.com'
url_basic = 'https://dappradar.com'
df = pd.read_excel('with_href_1_171.xlsx')
href_list = df['href'].tolist()
len(href_list)
column_list = ['Blockchain_Dapps_link','smart_contract_name','smart_contract_link']
all_tokens = pd.DataFrame(columns=['Blockchain_Dapps_link','smart_contract_name','smart_contract_link'])
all_tokens.loc[0] = [0,0,0]
for i in range(700,1000):
driver = webdriver.Chrome()
Blockchain_Dapps_link = url_basic + href_list[i]
driver.get(url_basic + href_list[i])
driver.implicitly_wait(10)
time.sleep(2)
SeeMore = driver.find_elements(by=By.XPATH,
value=("//*[contains(text(),'we are tracking')]"))
original_t = time.time()
while True:
try:
SeeMore[0].click()
print(i)
break
except BaseException as E:
if time.time() - original_t >= 5 :
driver.refresh()
original_t =time.time()
else:
time.sleep(2)
SeeMore = driver.find_elements(by=By.XPATH,
value=("//*[contains(text(),'we are tracking')]"))
pass
# try:
# wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "sc-bMthRQ.crCteR"))).click()
# print('done')
# except:
# print('error')
wait=WebDriverWait(driver, 10)
time.sleep(2)
html = driver.page_source
driver.quit()
bs = BeautifulSoup(html, 'lxml')
# subjects = bs.select('tbody ul[class="sc-eKszNL dYKdGj"]')
# subjects = bs.select('li div[class="sc-jxgYAQ eOGoAs"]') # I think it changes everyday. <div class="sc-bMthRQ crCteR">For this data we are tracking:<div>10 Alien Worlds Smart Contracts<svg width="13px" height="13px" viewBox="0 0 17 14" xmlns="http://www.w3.org/2000/svg" fill="#006cff"><path fill-rule="evenodd" clip-rule="evenodd" d="M8.793.293a1 1 0 0 1 1.414 0l6 6a1 1 0 0 1 0 1.414l-6 6a1 1 0 0 1-1.414-1.414L13.086 8H1.5a1 1 0 0 1 0-2h11.586L8.793 1.707a1 1 0 0 1 0-1.414Z"></path></svg></div></div>
subjects = bs.select('li div[class="sc-jxgYAQ eOGoAs"]')
# subjects =bs.select(by= By.XPATH)
tokens = pd.DataFrame(columns=['Blockchain_Dapps_link','smart_contract_name','smart_contract_link'])
token = []
for i in range(len(subjects)):
# print(len(subjects))
smart_contract_soup = subjects[i].find('a',class_="text-primary")
smart_contract_name = subjects[i].get_text()
smart_contract_link = subjects[i].find_all('a', href=True)[0]['href']
token = [Blockchain_Dapps_link,smart_contract_name,smart_contract_link]
tokens.loc[i]= token
frames = [all_tokens, tokens]
all_tokens = pd.concat(frames)
all_tokens.to_csv('smart_contract_700.csv')
all_tokens.to_excel('smart_contract_.xlsx')
这个问题中比较困难的就是如何点击那个按钮。jupyter notebook有些时候会出现第一次点击无法出现正确的反应,必须点击两次才能得到结果的情况。
然后关于如何寻找这个按钮,现在版本的selenium没有了根据class选择的function但是我们可以用多种选择的方法,只不过这个代码使用的时候我们需要自己来选择class的形式。
之后,我们需要去尝试点击BUY TOKEN的按钮,并且我们希望得到如果可以有,这个链接对应的链接。
column_list = ['Blockchain_Dapps_link','smart_contract_name','smart_contract_link']
all_tokens = pd.DataFrame(columns=['Blockchain_Dapps_link','smart_contract_name','smart_contract_link'])
all_tokens_2 = pd.DataFrame(columns=['Blockchain_Dapps_link','Buy_token','address'])
all_tokens.loc[0] = [0,0,0]
all_tokens_2.loc[0] = [0,0,0]
#
for i in range(3000, len(href_list)):
driver = webdriver.Chrome()
# 向指定url发起请求
Blockchain_Dapps_link = url_basic + href_list[i]
driver.get(url_basic + href_list[i])
# print(Name + ' Smart Contracts')
# wait=WebDriverWait(driver, 10)
driver.implicitly_wait(10)
time.sleep(2)
BuyButton = driver.find_elements(by=By.XPATH,
value=("//*[contains(text(),'Buy ')]"))
try:
tokens_2 = pd.DataFrame(columns=['Blockchain_Dapps_link','Buy_token','address'])
BuyButton[0].click()
driver.implicitly_wait(10)
time.sleep(2)
linking = driver.current_url
buy_token = 1
token_2 = [Blockchain_Dapps_link,buy_token,linking]
tokens_2.loc[i]= token_2
except BaseException as E:
print('Oof!')
buy_token = 0
linking = 0
token_2 = [Blockchain_Dapps_link,buy_token,linking]
tokens_2.loc[i]= token_2
driver.quit()
frames = [all_tokens_2, tokens_2]
all_tokens_2 = pd.concat(frames)
all_tokens_2.to_csv('buy_token_3000_.csv')
all_tokens_2.to_excel('buy_token_3000_.xlsx')
# all_tokens_2.to_csv('buy_token.csv')
# all_tokens_2.to_excel('buy_token.xlsx')
这里面因为上面提到的问题,所以我选择一个比较笨的方法,反复去尝试点击这个按钮,如果没有的话,我们就进行下一步。
然后在运行的时候,我发现大部分都是没有buy token的,但是很多没有buy token的运行时间比较长。但是如果用notebook进行多核爬取的话,整体时间还算可以接受。