爬虫
小雷君
少壮不努力,老大徒伤悲。
展开
-
python requests库 爬取视频
import requests# 下载视频def download(url): with requests.get(url, stream=True) as r: print('开始下载。。。') with open('v.mp4', 'wb')as f: for i in r.iter_content(chunk_size=...原创 2019-03-27 15:26:41 · 2690 阅读 · 0 评论 -
cookie池
from selenium import webdriverfrom fake_useragent import UserAgentimport timeimport requestsimport hashlibimport pymysqlimport randomimport jsonclass CookiesPool: def __init__(self): ...原创 2019-04-01 17:22:30 · 506 阅读 · 0 评论 -
selenium模拟登录
from selenium import webdriverimport timedef login(): driver = webdriver.Chrome() try: driver.maximize_window() driver.get('http://www.weibo.com/login.php') time.sl...原创 2019-04-01 17:37:04 · 607 阅读 · 0 评论 -
多线程爬虫
import requestsfrom lxml import etreefrom queue import Queueimport threadingimport jsonclass thread_crawl(threading.Thread): ''' 抓取线程类 ''' def __init__(self, threadID): ...原创 2019-04-11 11:47:13 · 203 阅读 · 0 评论 -
python selenium模拟滑动验证码
此文以B站模拟登录,滑动验证码的测试。import randomimport timeimport requestsfrom selenium.webdriver import ActionChainsfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.supp...原创 2019-04-03 09:16:27 · 1205 阅读 · 0 评论 -
代理IP池的实现
本文获取西刺网免费代理IP,与mysql数据交互。实现代理IP池from selenium import webdriverfrom bs4 import BeautifulSoupfrom fake_useragent import UserAgentimport timeimport requestsimport hashlibimport pymysqlimport rand...原创 2019-04-03 16:31:20 · 1110 阅读 · 0 评论 -
传统图形验证码文字获取,字体工整
黑白图片文字获取,字体必须工整 from PIL import Image import pytesseract image = Image.open('./images/tesseracttest.jpg') # image.show() text = pytesseract.image_to_string(image) print(text)...原创 2019-04-03 16:53:20 · 850 阅读 · 0 评论 -
scrapy框架 中间件 设置selenium、ip池、随机ua
from scrapy import signalsimport randomfrom selenium import webdriverfrom scrapy.http import HtmlResponsefrom fake_useragent import UserAgentimport base64from bilibili.proxyIPPool import ProxyIp...原创 2019-04-09 19:07:28 · 4687 阅读 · 0 评论