模仿人操作点击浏览器,再爬取数据
主要使用的 python package: selenium, pandas
使用selenium中的方法:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
通过webdriv打开Chrome,get网页
driver = webdriver.Chrome(options =chrome_options)
driver.get(MAIN_PAGE_URL)
登录
def login(driver):
# click continue with email with XPATH - doable(通过XPATH获取按钮)
email_btn = driver.find_element(By.XPATH, XPATH_email_btn)
# 点击按钮
email_btn.click()
# 必须强制休眠!!!否则会被侦察到脚本操作
time.sleep(3)
# try input user info - workable
email_input = driver.find_element(By.NAME, "user[email]")
email_input.send_keys(USER_EMAIL)
# find "continue" button and click to submit email
login_form = driver.find_element(By.XPATH, XPATH_login_form)
login_form.submit()
# TODO: after submitting, not shown (might need to sleep)
time.sleep(17)
# find password input form and input password
pw_input = driver.find_element(By.NAME, "user[password]")
pw_input.send_keys(PASSWORD)
pw_input.submit()
time.sleep(2)
爬取数据
最关键的一步:获取page_source
html_source = driver.page_source
接着将page_source split 直到得到想要的数据
举例:
先创建空字典
df_dict = {
“confirmation”: [],
“status”: [],
}
page_source split
all_entries = html_source.split(“Booking details”)[1].split(“18px;”>“)[1:]
confirmation = all_entries[4].split(”<div “)[0]
status = html_source.split(”<div class=“_smggas”>“)[1].split(”>“)[1].split(”<")[0]
存入字典
df[“confirmation”].append(confirmation)
df[“status”].append(status)
数据保存csv
df_ = pd.DataFrame(df_dict)
df_ .to_csv(PATH, index=False)
反扒技术:
-
添加 USER_AGENT:
获取 local USER_AGENT:在Chrome输入:chrome://version
chrome_options = Options()
chrome_options.add_argument(“–user-agent=” + USER_AGENT) -
伪装 USER_AGENT:
可在Chrome安装插件 User-Agent
安装之后可伪装成不同的USER_AGENT -
换ip地址
举例:
proxy_host = ‘127.0.0.1’
proxy_port = “8888”
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘–proxy-server=http://{}:{}’.format(proxy_host, proxy_port)) -
重置cookies
创建第一个driver0,get网页,并保存cookies
driver0 = webdriver.Chrome(options =chrome_options)
driver0.get(MAIN_PAGE_URL)
save_Cookied = driver0.get_cookies()
创建第二个driver,get网页,先删除cookies(反扒关键),再加入保存的cookies
driver = webdriver.Chrome(options =chrome_options)
driver.get(MAIN_PAGE_URL)
driver.delete_all_cookies()
for cookie in save_Cookied:
driver.add_cookie(cookie)