爬虫相关知识

最新推荐文章于 2023-05-19 22:56:58 发布

影子浅笑

最新推荐文章于 2023-05-19 22:56:58 发布

阅读量71

点赞数

分类专栏： python 文章标签： python git

本文链接：https://blog.csdn.net/u014199409/article/details/103732079

版权

python 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

爬虫相关知识

知识点

beautifulsoup

soup = BeautifulSoup(demo, “html.parser”）
“”"
demo 表示被解析的html格式的内容
html.parser表示解析用的解析器
“”"
print(‘a标签的href属性是：’, soup.a.attrs[‘href’]) # 同样，通过字典的方式获取a标签的href属性

soup.find(name=‘div’,attrs={“class”:“xxxxx”}) #定位一个标签
soup.find__all(name=‘div’,attrs={“class”:“xxxxx”}) #定位多个返回一个列表类型
img.get(‘src’) #获取标签的属性
div.text #获取标签的值

find例子：
print(‘所有a标签的内容：’, soup.find_all(‘a’)) # 使用find_all()方法通过标签名称查找a标签,返回的是一个列表类型
print(‘a标签和b标签的内容：’, soup.find_all([‘a’, ‘b’])) # 把a标签和b标签作为一个列表传递，可以一次找到a标签和b标签

脚本范例：
爬取天极网：

import os
import re
import requests
from bs4 import BeautifulSoup

file_dir = os.path.dirname(os.path.abspath(__file__))

respon = requests.get(url='http://pic.yesky.com/c/6_3655_6.shtml')
respon.encoding='gbk'
text =respon.text
soup = BeautifulSoup(text,'html.parser')
div_obj = soup.find(name='div', attrs={"class": "lb_box"})
li_list = div_obj.find_all(name='dd')

for i in li_list:
    img = i.find(name = 'a')
    alt = img.get('title')
    alt = re.sub('[\/:*?"<>|]', '-', alt)#####去掉特殊的标点字符
    Base_dir = os.path.join(file_dir,'7160',alt)
    big_image  = img.get('href')
    big_respon = requests.get(url=big_image)
    big_respon.encoding='gbk'
    big_text = big_respon.text
    big_soup = BeautifulSoup(big_text,'html.parser')
    big_obj = big_soup.find(name='div',attrs={"class":"overview"})
    big_list = big_obj.find_all(name='img')
    for l in big_list:
        big_src = l.get('src').replace('113x113','740x-') #根据大图和小图的特性进行替换
        File_dir = os.path.join(file_dir, '7160', alt, big_src.rsplit('/', 1)[-1])
        print(File_dir)

        if os.path.exists(Base_dir):
            with open(File_dir,'wb') as f :
                res = requests.get(url=big_src)
                f.write(res.content)
        else:
            os.makedirs(Base_dir)
            with open(File_dir,'wb') as f :
                res = requests.get(url=big_src)
                f.write(res.content)
'''





## selenium:


## 1、浏览器初始化设置(Chrome浏览器为例)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('window-size=1920x3000') #指定浏览器分辨率
chrome_options.add_argument('--disable-gpu') #谷歌文档提到需要加上这个属性来规避bug
chrome_options.add_argument('--hide-scrollbars') #隐藏滚动条, 应对一些特殊页面
chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度
chrome_options.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败
chrome_options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" #手动指定使用的浏览器位置

driver=webdriver.Chrome(chrome_options=chrome_options)
driver.get('https://www.baidu.com')
print('hao123' in driver.page_source)

driver.close() #切记关闭浏览器，回收资源



## 2、

影子浅笑

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫相关知识

爬虫相关知识知识点beautifulsoupsoup = BeautifulSoup(demo, “html.parser”）“”"demo 表示被解析的html格式的内容html.parser表示解析用的解析器“”"print(‘a标签的href属性是：’, soup.a.attrs[‘href’]) # 同样，通过字典的方式获取a标签的href属性soup.find(na...
复制链接

扫一扫