【Python】爬虫入门强烈推荐系列三

最新推荐文章于 2024-07-16 21:57:11 发布

wujiekd

最新推荐文章于 2024-07-16 21:57:11 发布

阅读量242

点赞数 2

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_43999137/article/details/105755931

版权

爬虫专栏收录该内容

4 篇文章 3 订阅

订阅专栏

在系列一中，我们重点学习了网页的基本组成与网页代码的简单分析，并且学习了requests库的实战操作。requests是python实现的最简单易用的HTTP库，因此强烈建议爬虫使用requests。
系列一链接：【Python】爬虫入门强烈推荐系列一

在系列二中，我们重点学习了解析和提取 HTML 数据的三个库，分别是re，Xpath和Beautiful Soup。推荐重点学习的是正则表达式re。
系列二链接：【Python】爬虫入门强烈推荐系列二
推荐学习：selenium的基本使用和常用语法

Selenium

Selenium是什么？
Selenium是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样。
Selenium应用场景：用代码的方式去模拟浏览器操作过程（如：打开浏览器、在输入框里输入文字、回车等），在爬虫方面有大帮助。

简单理解：它可以帮我们登录网站时省去琐碎的输入账号密码等操作，但同时不会让网站检测出这是机器的行为～

安装selenium

pip install selenium

安装chromedriver

（一个驱动程序，用以启动chrome浏览器）

一、首先需要查看你的Chrome版本，在浏览器中输入chrome://version/
可查看到：Google Chrome 81.0.4044.113 (正式版本) （64 位）
在这里插入图片描述
二、进入该网站，选取对应版本的驱动程序下载，Click～

在这里插入图片描述
三、下载完毕，就是一个小程序，我们这里放到桌面，获取它的路径。

访问Chorme

1、导入模块：

from selenium import webdriver  # 启动浏览器需要用到
from selenium.webdriver.common.keys import Keys  # 提供键盘按键支持（最后一个K要大写）

2、创建一个WebDriver实例：

driver = webdriver.Chrome("/Users/lukeda/Desktop/chromedriver")

3、打开一个页面:

driver.get("https://competition.huaweicloud.com/information/1000037843/introduction")  # 这个时候chromedriver会打开一个Chrome浏览器窗口，显示的是网址所对应的页面
‘

在这里插入图片描述

4、关闭页面

driver.close()  # 关闭浏览器一个Tab

实战（模拟登录丁香园）

import requests, json, re, random,time
from bs4 import BeautifulSoup
from selenium import webdriver
from lxml import etree

class getUrl(object):
	"""docstring for getUrl"""
	def __init__(self):
		self.headers={
            "Connection": "keep-alive",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 "  
                          "(KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-Encoding": "gzip, deflate, sdch",
            "Accept-Language": "zh-CN,zh;q=0.8"
        };

	def run(self):
		browser = webdriver.Chrome('/Users/lukeda/Desktop/chromedriver')
		browser.get('https://auth.dxy.cn/accounts/login?service=http://www.dxy.cn/bbs/index.html')
		time.sleep(1)
		#切换账号密码登录表单
		js1 = 'document.querySelector("#j_loginTab1").style.display="none";'
		browser.execute_script(js1)
		time.sleep(1)
		js2 = 'document.querySelector("#j_loginTab2").style.display="block";'
		browser.execute_script(js2)
		#输入账号密码
		input_name = browser.find_element_by_name('username')
		input_name.clear()
		input_name.send_keys('*')# 这里为自己账号和密码
		input_pass = browser.find_element_by_name('password')
		input_pass.clear()
		input_pass.send_keys('*')
		browser.find_element_by_xpath('//*[@class="form__button"]/button').click()
		#此步骤应该有验证码，先跳过
		time.sleep(10)
		cookie = browser.get_cookies()
		cookie_dict = {i['name']:i['value'] for i in cookie}
		#转到抓取页面
		browser.get("http://www.dxy.cn/bbs/thread/626626#626626");
		html = browser.page_source
		tree = etree.HTML(html)
		user = tree.xpath('//div[@id="postcontainer"]//div[@class="auth"]/a/text()')
		content = tree.xpath('//td[@class="postbody"]')
		for i in range(0,len(user)):
			result = user[i].strip()+":"+content[i].xpath('string(.)').strip()
			#写入文件
			dir_file = open("DXY_records.txt",'a', encoding="utf-8")
			dir_file.write(result+"\n")
			dir_file.write('*' * 80+"\n")
			dir_file.close()
		print('*' * 5 +"抓取结束"+'*' * 5)


if __name__ == '__main__':
	geturl = getUrl()
	geturl.run()