python的爬虫相关模块使用

最新推荐文章于 2023-07-25 17:09:12 发布

ShanWu__

最新推荐文章于 2023-07-25 17:09:12 发布

阅读量732

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/wushan1992/article/details/80791824

版权

python 专栏收录该内容

41 篇文章 0 订阅

订阅专栏

BeautifulSoup的使用

首先安装BeautifulSoup
pip install beautifulsoup4
BeautifulSoup默认支持Python的标准HTML解析库，但是它也支持一些第三方的解析库：

序号	解析库	使用方法	优势	劣势
1	Python标准库	BeautifulSoup(html,’html.parser’)	Python内置标准库；执行速度快	容错能力较差
2	lxml HTML解析库	BeautifulSoup(html,’lxml’)	速度快；容错能力强	需要安装，需要C语言库
3	lxml XML解析库	BeautifulSoup(html,[‘lxml’,’xml’])	速度快；容错能力强；支持XML格式	需要C语言库
4	htm5lib解析库	BeautifulSoup(html,’htm5llib’)	以浏览器方式解析，最好的容错性	速度慢

- 导入库：
from bs4 import BeautifulSoup

下面是简单的例子

import requests
from bs4 import BeautifulSoup
url = "http://www.baidu.com"

session = requests.session()
res = session.get(url=url)
res.encoding = res.apparent_encoding
html_doc = res.text
# print(html_doc)
# print(type(html_doc))

soup = BeautifulSoup(html_doc,'html.parser')

#格式化输出内容：
print(soup.prettify())

BeautifulSoup将复杂的html文档转换为树形结构，每一个节点都是一个对象，这些对象可以归纳为几种：
Tag
print(soup.title)
输出结果
百度一下，你就知道
find_all()方法：
find_all( name , attrs , recursive , text , **kwargs )
name参数
name参数可以查找所有名字为name的Tag，字符串对象自动忽略掉。
print(soup.find_all('a'))
kwyowrds关键字
查找id是css的
print(soup.find_all(id='css'))
text参数
用来搜索文档中的字符串内容，text参数也接收字符串、正则表达式、列表、True等参数。
print(soup.find_all(text=re.compile('^abc')))

Selenium的使用

selenium 是一套完整的web应用程序测试系统，包含了测试的录制（selenium IDE）,编写及运行（Selenium Remote Control）和测试的并行处理（Selenium Grid）。Selenium的核心Selenium Core基于JsUnit，完全由JavaScript编写，因此可以用于任何支持JavaScript的浏览器上。selenium可以模拟真实浏览器，自动化测试工具，支持多种浏览器，爬虫中主要用来解决JavaScript渲染问题。

编写简单的程序自动打开指定页面

import os

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get("http://www.python.org")
time.sleep(10)
driver.quit()

常用的查找元素的方法
find_element_by_name
find_element_by_id
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
访问网站，输入用户名和密码，登录购物网站

import random
import time

import os

from selenium import webdriver


def randomSleep(minS, maxS):
    time.sleep((maxS-minS)*random.random() + minS)


url = 'https://passport.jd.com/new/login.aspx?ReturnUrl=https%3A%2F%2Fwww.jd.com%2F'
chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get(url)
randomSleep(1, 3)
driver.find_elements_by_xpath('//a[@clstag="pageclick|keycount|login_pc_201804112|10"]')[0].click()

randomSleep(1, 2)
driver.find_element_by_id('loginname').clear()
randomSleep(1, 3)
driver.find_element_by_id('loginname').send_keys("**************")
randomSleep(1, 2)
driver.find_element_by_id('nloginpwd').send_keys("*****")

randomSleep(3, 5)
driver.find_element_by_id('loginsubmit').click()
randomSleep(5, 10)

print(driver.get_cookies())

driver.close()

ShanWu__

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python的爬虫相关模块使用

BeautifulSoup的使用首先安装BeautifulSoup pip install beautifulsoup4BeautifulSoup默认支持Python的标准HTML解析库，但是它也支持一些第三方的解析库：序号解析库使用方法优势劣势 1 Python标准库 BeautifulSoup(html,’html.parse...
复制链接

扫一扫