![](https://img-blog.csdnimg.cn/20190918140213434.png?x-oss-process=image/resize,m_fixed,h_224,w_224)
爬虫实战模块
基础部分更新完毕,反爬虫技术模块,粘贴即用
小蜗笔记
热爱建模和计算机
展开
-
网络时间校验代码
【代码】网络时间校验代码。原创 2024-03-17 09:13:43 · 429 阅读 · 0 评论 -
批量查询经纬度
【代码】批量查询经纬度。原创 2022-09-07 17:15:45 · 1258 阅读 · 0 评论 -
百度爬取经纬度(百度地图的经纬度是存在偏移加密的)
【代码】百度爬取经纬度(百度地图的经纬度是存在偏移加密的)原创 2022-08-21 19:45:04 · 604 阅读 · 0 评论 -
股票财务信息,董事会,监事会等高管信息爬取
股票财务信息,董事会,监事会等高管信息爬取,留言获取,私密仓库。原创 2022-07-16 20:44:13 · 211 阅读 · 1 评论 -
chromedriver对应版本下载
如何找到和自己谷歌浏览器对应的驱动版本????1.查询比如我的版本号是74.0.3729.169,只复制74.0.3729就行了,然后把复制下来的数字加到如例如我的“https://chromedriver.storage.googleapis.com/LATEST_RELEASE_74.0.3729”。访问后得到的就是对应的谷歌驱动号下载链接:...原创 2022-07-01 10:21:54 · 259 阅读 · 0 评论 -
上市公司注册地址
声明:代码仅作学习交流用途,代码分享者与创作者不承担任何由他人恶意运行而导致的责任,勿擅自修改限制频率的参数,勿恶意攻击网页,请学习浏览者遵守社会公德与法律秩序,爬虫导致的网页崩溃等损失由计算机操作者负全部责任,造成严重后果的需要承担刑事责任爬虫代写:邮箱 leon_leon@yeah.netimport requestsfrom fake_useragent import UserAgentfrom lxml import etreefrom time import sleepfrom r原创 2021-03-30 15:54:45 · 488 阅读 · 0 评论 -
爬虫网易财经上市公司财务数据
import requestsfrom lxml import etreefrom time import sleepfrom fake_useragent import UserAgentimport pandas as pdfrom random import randintplace_name = pd.read_csv(r'C:\Users\Admin\PycharmProjects\untitled\股票数据\上市公司代码.csv',encoding='utf-8')length原创 2021-01-08 18:31:31 · 1372 阅读 · 2 评论 -
新发地市场价格爬取
声明:代码仅作学习交流用途,代码分享者与创作者不承担任何由他人恶意运行而导致的责任,勿擅自修改限制频率的参数,勿恶意攻击网页,请学习浏览者遵守社会公德与法律秩序,爬虫导致的网页崩溃等损失由计算机操作者负全部责任,造成严重后果的需要承担刑事责任import requestsfrom lxml import etreefrom time import sleepfrom fake_useragent import UserAgentimport pandas as pdname_all = []原创 2020-12-28 10:52:15 · 780 阅读 · 6 评论 -
爬取市场价格,全国农产品商务信息公共服务平台
声明:代码仅作学习交流用途,代码分享者与创作者不承担任何由他人恶意运行而导致的责任,勿擅自修改限制频率的参数,勿恶意攻击网页,请学习浏览者遵守社会公德与法律秩序,爬虫导致的网页崩溃等损失由计算机操作者负全部责任,造成严重后果的需要承担刑事责任全国农产品商务信息公共服务平台爬取import requestsfrom fake_useragent import UserAgentfrom lxml import etreefrom time import sleepfrom random impor原创 2020-12-27 18:49:15 · 1419 阅读 · 2 评论 -
海关爬虫7代(圣佛版)
救救我这个娃在这里插入代码片原创 2020-12-20 12:48:48 · 43886 阅读 · 2 评论 -
海关爬爬虫3代(成熟变态版)
import requestsfrom fake_useragent import UserAgentfrom lxml import etreefrom time import sleepfrom random import randintimport pandas as pdfrom selenium import webdriverfrom selenium.webdriver.common.by import By#from selenium.webdriver.support.wa原创 2020-11-17 19:48:35 · 1221 阅读 · 0 评论 -
selenium爬取中国经济与社会发展统计数据库
from selenium import webdriverfrom time import sleepfrom lxml import etreefrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import Selectimport pandas as pdfrom random import randintedge = webdriver.Edge()edge.get("htt.原创 2020-11-10 15:39:27 · 1398 阅读 · 1 评论 -
python爬虫 selenium 对浏览器标签页进行关闭和切换
python selenium 对浏览器标签页进行关闭和切换1. 关闭浏览器全部标签页driver.quit()2. 关闭当前标签页(从标签页A打开新的标签页B,关闭标签页A)driver.close()3. 关闭当前标签页(从标签页A打开新的标签页B,关闭标签页B)可利用浏览器自带的快捷方式对打开的标签进行关闭Firefox自身的快捷键分别为:Ctrl+t 新建tabCtrl+w 关闭tabCtrl+Tab /Ctrl+Page_Up 定位当前标签页的下一个标签页Ctrl+原创 2020-11-10 12:38:02 · 2161 阅读 · 0 评论 -
爬虫优质笔记
selenium selenium用法 https://selenium-python.readthedocs.io/index.html原创 2020-09-11 08:40:07 · 65 阅读 · 0 评论 -
百度高德批量爬取经纬度并计算距离
#百度爬取经纬度,耗费网络资源import requestsfrom fake_useragent import UserAgentimport pandas as pdfrom urllib.parse import quoteimport refrom time import sleep from random import randintimport random#文件读取类,URL管理类place_name = pd.read_csv(r'C:\name.csv',encodin原创 2020-09-02 10:30:04 · 1868 阅读 · 0 评论 -
爬虫--结语
基础 结语爬虫基础部分资料更新与学习已经完成,有疑问的小伙伴可以留言或私聊我,下一步计划是学习与更新tensorflow的相关知识原创 2020-08-20 10:48:46 · 233 阅读 · 0 评论 -
(32)scrapy 登录
import scrapyclass LogSpider(scrapy.Spider): name = 'log' allowed_domains = ['sxt.cn'] # start_urls = ['http://sxt.cn/'] def start_requests(self): url='' form_data={ 'user': , 'password': ,原创 2020-08-15 11:24:47 · 95 阅读 · 0 评论 -
(31)爬虫--scrapy动态ua,IP
http_ua.pyimport scrapyclass HttpUaSpider(scrapy.Spider): name = 'http_ua' allowed_domains = ['http://httpbin.org/get'] start_urls = ['http://httpbin.org/get'] def parse(self, response): print(response.text)setting.py# Scrap原创 2020-08-11 14:09:30 · 414 阅读 · 0 评论 -
(30)爬虫--CrawlSpider自动获取爬取链接
import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass ZwrSpider(CrawlSpider): name = 'zwr' allowed_domains = ['zwdu.com'] start_urls = ['https://www.zwdu.com/book/10304/'] rules原创 2020-08-11 11:22:44 · 361 阅读 · 0 评论 -
(29)爬虫小说实例
mainfrom scrapy.cmdline import executeexecute('scrapy crawl zw'.split())zw.pyimport scrapyclass ZwSpider(scrapy.Spider): name = 'zw' allowed_domains = ['zwdu.com'] start_urls = ['https://www.zwdu.com/book/32934/13022999.html'] def原创 2020-08-10 15:39:06 · 202 阅读 · 0 评论 -
(28)爬虫scrapy的实践使用
maoyan.pyimport scrapyclass MaoyanSpider(scrapy.Spider): name = 'maoyan' allowed_domains = ['maoyan.com'] start_urls = ['https://maoyan.com/films?showType=3'] def parse(self, response): names = response.xpath('''//div[@class='c原创 2020-08-09 11:53:47 · 108 阅读 · 0 评论 -
爬虫下载twisted(杂记)
python 非官方包下载地址点击加载地址 https://www.lfd.uci.edu/~gohlke/pythonlibs/原创 2020-08-08 10:55:40 · 140 阅读 · 0 评论 -
(27)爬虫类方法综合实例
import requestsfrom fake_useragent import UserAgentfrom lxml import etree#url管理class URLManger(object): def __init__(self): self.new_url=[] self.old_url=[] def get_new_url(self): url = self.new_url.pop() self.ol原创 2020-08-08 10:38:32 · 185 阅读 · 0 评论 -
(26)爬虫下载图片
import requestsfrom fake_useragent import UserAgentfrom lxml import etreeurl='https://ronnie.tuchong.com/69224531/#image497472016'header={ 'User-Agent' : UserAgent().Chrome}response= requests.get(url=url, headers= header)e = etree.HTML(response原创 2020-08-08 10:37:15 · 63 阅读 · 0 评论 -
(25)爬虫selenium滚动条的使用
from selenium import webdriverfrom lxml import etreefrom time import sleepfrom random import randinturl ='https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&wq=%E6%89%8B%E6%9C%BA&pvid=4e03dd16c6b24f729048a28bcd2363f3'edge =原创 2020-08-02 18:12:46 · 168 阅读 · 0 评论 -
(24)爬虫selenium虎牙练习
from selenium import webdriverfrom time import sleepfrom lxml import etreefrom random import randintedge = webdriver.Edge()edge.get("https://www.huya.com/g/lol")sleep(10)n = 1while True: html = edge.page_source e = etree.HTML(html) name原创 2020-08-02 14:13:50 · 103 阅读 · 0 评论 -
(25)selenium3 +Edge+win10配置出错解决方案
参考文章原创 2020-08-01 09:36:36 · 630 阅读 · 0 评论 -
(23)爬虫pyquery综合实战
import requestsfrom fake_useragent import UserAgentfrom time import sleepfrom random import randintfrom pyquery import PyQuerydef get_html(url): headers={ 'User-Agent':UserAgent().firefox } proxies = { "http": "http://35.原创 2020-07-19 15:20:08 · 113 阅读 · 0 评论 -
(22)爬虫re库综合实战
import requestsfrom fake_useragent import UserAgentfrom time import sleepfrom random import randintimport redef get_html(url): headers={ 'User-Agent':UserAgent().firefox } proxies = { "http": "http://35.236.158.232:8080"原创 2020-07-19 12:50:24 · 102 阅读 · 0 评论 -
(21)爬虫beautiful soup库 实战
import requestsfrom fake_useragent import UserAgentfrom time import sleepfrom random import randintfrom bs4 import BeautifulSoupdef get_html(url): headers={ 'User-Agent':UserAgent().firefox } proxies = { "http": "http://35原创 2020-07-19 11:40:42 · 119 阅读 · 0 评论 -
(20)爬虫xpath综合实战
import requestsfrom fake_useragent import UserAgentfrom lxml import etreefrom time import sleepfrom random import randintdef get_html(url): headers={ 'User-Agent':UserAgent().firefox } proxies = { "http": "http://35.236.15原创 2020-07-18 19:22:39 · 106 阅读 · 0 评论 -
(19)爬虫 selenium
from selenium import webdriverd=webdriver.Edge()d.get('https://www.baidu.com/')原创 2020-07-17 21:15:34 · 107 阅读 · 0 评论 -
(18)爬虫xpath实战
import requestsfrom lxml import etreefrom fake_useragent import UserAgenturl = 'https://tech.163.com/20/0716/07/FHL0LPK300097U7T.html'headers={ 'User-Agent':UserAgent().chrome}response = requests.get(url, headers=headers)e=etree.HTML(response.t.原创 2020-07-17 12:49:04 · 146 阅读 · 0 评论 -
(17)爬虫—多线程
from threading import Threadfrom queue import Queuefrom fake_useragent import UserAgentimport requestsfrom lxml import etree# 爬虫类class CrawlInfo(Thread): def __init__(self, url_queue, html_queue): Thread.__init__(self) self.url_原创 2020-07-16 11:01:40 · 75 阅读 · 0 评论 -
(16)爬虫jsonpath
from jsonpath import jsonpathimport requestsurl=''headers={ 'User-Agent':'Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50'}response=requests.get(url,headers=headers)name=jsonpath(json.loads(r.原创 2020-07-15 16:45:42 · 83 阅读 · 0 评论 -
(15)爬虫pyquery
from pyquery import PyQuery as pqimport requestsurl='https://ip.jiangxianli.com/?page=1'headers={ 'User-Agent':'Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50'}response=requests.get(url,headers原创 2020-07-15 12:14:20 · 77 阅读 · 0 评论 -
(14)爬虫xpath使用
from lxml import etreeimport requestsurl = 'https://www.qidian.com/all'headers={ 'User-Agent':'Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50'}response = requests.get(url,headers=headers)e = e原创 2020-07-14 21:53:40 · 149 阅读 · 0 评论 -
(13)爬虫re
import restr = 'I Study python3.7 everyday'print('--'*50)m1 = re.match(r'.',str)print(m1.group())#m2=re.search(r'S\w+',str)#print(m2.group())m3 = re.search(r'p\w+.\w',str)print(m3.group())print('--'*50)f1 = re.findall(r'y',str)print(f1)print.原创 2020-07-14 18:25:36 · 84 阅读 · 0 评论 -
(12)爬虫beautifulsoup使用
soup=BeautifulSoup(str,'lxml')print(soup.title)print(soup.div.attrs)print(soup.div.text)print(soup.div.string)print(soup.strong.prettify)print('__'*100)print(soup.find_all(class_='info'))print(soup.find_all('div',attrs={'float':'left'}))prin原创 2020-07-14 18:21:01 · 80 阅读 · 0 评论 -
(11)爬虫requests---cookie
import requestssession=requests.Session()header = { 'User-Agent':'Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50'}login_url = 'https://cnpassport.youku.com/newlogin/login.do?appName=youku&f原创 2020-07-09 12:40:41 · 102 阅读 · 0 评论