爬虫notes

‘’’

爬取思路:
1、requests(url)
2、requests + json
3、requests + XPath
4、requests + BeautifulSoup
5、selenium
6、scrapy框架
7、scrapy-redis 及分布式

===============================================
OS:
import os
os.system(“C: && p.txt”)
os.system(“ping 127.0.0.1”)

===============================================
requests:
requests.get(url, headers=headers, data={’’:’’}, proxies=proxies)

===============================================
Proxies:
proxies = {‘http’: ‘124.207.82.166:8008’} # 47.98.129.198
response = requests.get(request_url, proxies=proxies) # 发起请求

===============================================
File:
with open(path,‘w’) as f:
f.write(text)

===============================================
Threading:
import threading
threading.Thread(target=fun, kwargs={‘list_url’:list_url,‘path_order’:path_order1}).start()

===============================================
requests、json:
1.data = json.load(open(“package1.json”,encoding=“utf-8”))
response = requests.get(url, headers=headers)
print(response.text)

2.response = requests.get(url)
data = response.text
obj = json.loads(data)

===============================================
requests、XPath
from lxml import etree
response = requests.get(list_url, headers=headers)
content = response.content
selector = etree.HTML(scontent) # 将页面装入etree树
items = selector.xpath(path_order) # 按照XPath查找树,返回迭代,
title = item.xpath("./div/p[1]/a/text()")[0].strip() # 迭代对象item可继续用XPath查找

===============================================
requests、BeautifulSoup
from bs4 import BeautifulSoup
response = requests.get(url)
html= response.text
soup = BeautifulSoup(html, ‘lxml’)
soup_str = soup.prettify() # 标准化html
tag = soup.b
tag的一系类操作

===============================================
selenium: 安装对应chrome版本的 Selenium driver https://www.cnblogs.com/JHblogs/p/7699951.html
并且安装依赖库 pip install selenium
from selenium import webdriver
chromedriver = “G:/4Anaconda/chromedriver.exe” # 驱动若在python路径下 即可省略这一步
browser = webdriver.Chrome(chromedriver)
#打开一个网页
browser.get(“http://www.baidu.com”)
browser.find_element_by_id(“kw”).send_keys(“selenium”)
browser.find_element_by_id(“su”).click()
browser.title
browser.set_window_size(480, 800) #参数数字为像素点
browser.back()
browser.forward()
#退出并关闭窗口的每一个相关的驱动程序

browser.quit()

#关闭当前窗口
#browser.close()

隐式等待

from selenium import webdriver

browser = webdriver.Chrome()

这里用implicitly_wait()实现了隐式等待

browser.implicitly_wait(10)
browser.get(‘https://www.zhihu.com/explore’)
input = browser.find_element_by_class_name(‘zu-top-add-question’)
print(input)

显示等待
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Chrome()
browser.get(‘https://www.taobao.com/’)
wait = WebDriverWait(browser, 10)
input = wait.until(EC.presence_of_element_located((By.ID, ‘q’)))
button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ‘.btn-search’)))
print(input, button)

‘’’

‘’’
#一、 创建scrapy项目(cmd):
scrapy startproject weibospider
cd weibospider

    #二、 创建sipder语句cmd:scrapy genspider WeiboSpider image.baidu.com
    
    拒绝爬虫协议 ROBOTSTXT_OBEY = False 
    
    运行爬虫 scrapy crawl baiduimg
    
    #三、 设置数据结构
    name = scrapy.Field()
    
    #四、 导入 数据 from hotnewsSpider.items import WeiboSpiderItem
    使用 weiboitem = WeiboSpiderItem()
    weiboitem['name'] = '123'
    返回 yield weiboitem
    
    #五、 发送请求传递 (在parse中)
    yield scrapy.Request(url=url, headers=self.headers, cookies=self.cookies, callback=self.clickFindMore)
    # 发送请求传递并回调,加参 callback
    yield scrapy.Request(link,callback=self.parse_detail)
    
    #六、重写初始化请求
    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, headers=self.headers, cookies=self.cookies, callback=self.parse)
            
    #七、接收response
    def parse(self,response):
        pass

‘’’

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
为了在Python中使用谷歌浏览器驱动进行爬虫,您需要安装Chromedriver插件。以下是安装Chromedriver插件的步骤: 1. 打开谷歌浏览器,并进入设置面板。 2. 查看当前谷歌浏览器的版本号。 3. 点击插件下载链接,找到与您的谷歌浏览器版本号最相近的版本。下载地址可以在插件下载页面上找到。选择最新版本靠近"icons/"的版本。 4. 进入下载页面后,点击"notes.txt"查看该版本与您的Chrome浏览器版本是否对应。 5. 返回上一页,在相应的操作系统下载zip压缩包(适用于Windows的32位和64位通用版本)。 6. 解压缩下载的压缩包,并将解压后的"chromedriver.exe"文件移动到您的Python安装目录下(或者放在项目文件夹中也可以)。如果将其放在Python目录下,您就无需每次使用都去下载;如果放在项目文件夹中,会更加方便。 这样,您就成功安装了Python爬虫所需的谷歌驱动。<span class="em">1</span><span class="em">2</span> #### 引用[.reference_title] - *1* *2* [Python爬虫常用:谷歌浏览器驱动——Chromedriver 插件安装教程](https://blog.csdn.net/m0_67575344/article/details/123433729)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 100%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值