爬虫学习第三部分

最新推荐文章于 2023-01-31 00:40:20 发布

lulin1991

最新推荐文章于 2023-01-31 00:40:20 发布

阅读量303

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/lulin1991/article/details/105758853

版权

python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本次学习内容session和cookie，ip代理知识，selenium的使用，拔高：实现丁香园的模拟登录爬取留言板。

挑战项目：模拟登录丁香园，并抓取论坛页面所有的人员基本信息与回复帖子内容。
丁香园论坛：http://www.dxy.cn/bbs/thread/626626#626626 。

以下实现selenium模拟登录并打印输出人员基本信息与回复帖子内容。

import requests, json, re, random,time
from bs4 import BeautifulSoup
from selenium import webdriver
from lxml import etree

"""
使用selenium进行模拟登陆
1.初始化ChromDriver
2.打开登陆页面
3.找到用户名的输入框，输入用户名
4.找到密码框，输入密码
5.提交用户信息
"""
username = ''
userpasswd = ''
def login(name,passwd):
    driver = webdriver.Chrome('D:\Anaconda3\chromedriver\chromedriver.exe')
    driver.get('http://www.dxy.cn/bbs/thread/626626#626626')
    # 将窗口调整最大
    driver.maximize_window()
    # 休息5s
    time.sleep(5)
    current_window_1 = driver.current_window_handle
    print(current_window_1)
    time.sleep(5)
    #跳转到登录页面
    button = driver.find_element_by_xpath('//*[@id="headerwarp"]/div[2]/div[1]/div/a[1]')
    button.click()
    print("执行点击")
    current_window_2 = driver.current_window_handle
    print(current_window_2)
    print("执行点击到登录框")
    time.sleep(3)
    button = driver.find_element_by_xpath('/html/body/div[2]/div[2]/div[1]/a[2]')
    button.click()
    uname = driver.find_element_by_name('username')
    #email = driver.find_element_by_xpath('//input[@name="email"]')
    uname.send_keys(name)
    password = driver.find_element_by_name('password')
    #password = driver.find_element_by_xpath("//input[@name='password']")
    password.send_keys(passwd)
    submit = driver.find_element_by_xpath('//*[@id="user"]/div[1]/div[3]/button')
    time.sleep(10)
    submit.click()
    time.sleep(10)
    print(driver.page_source)
    text=driver.page_source
    #driver.quit()
    return text

def getdata(content):
    html = content
    soup = BeautifulSoup(html,'html.parser')
    userinfo=soup.find_all('td',{'class':'tbs'})
    continfo=soup.find_all('td',{'class':'tbc'})
    lenth=len(userinfo)
    print(lenth)
    for i in range(0,lenth):
            user=userinfo[i].find('div',{'class':'auth'}).find('a').get_text()   
            num=userinfo[i].find_all('div',{'class':'num'})
            score=num[0].find('a').get_text()
            depiao=num[1].find('a').get_text()
            dingdang=num[2].find('a').get_text()
            content=continfo[i].find('td',{'class':'postbody'}).get_text()  
            print(user,score,depiao,dingdang,content)
        

getdata(login(username,userpasswd))

lulin1991

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习第三部分

本次学习内容session和cookie，ip代理知识，selenium的使用，拔高：实现丁香园的模拟登录爬取留言板。挑战项目：模拟登录丁香园，并抓取论坛页面所有的人员基本信息与回复帖子内容。丁香园论坛：http://www.dxy.cn/bbs/thread/626626#626626。以下实现selenium模拟登录import requests, json, re, ran...
复制链接

扫一扫

专栏目录