自定义博客皮肤VIP专享

*博客头图:

格式为PNG、JPG,宽度*高度大于1920*100像素,不超过2MB,主视觉建议放在右侧,请参照线上博客头图

请上传大于1920*100像素的图片!

博客底图:

图片格式为PNG、JPG,不超过1MB,可上下左右平铺至整个背景

栏目图:

图片格式为PNG、JPG,图片宽度*高度为300*38像素,不超过0.5MB

主标题颜色:

RGB颜色,例如:#AFAFAF

Hover:

RGB颜色,例如:#AFAFAF

副标题颜色:

RGB颜色,例如:#AFAFAF

自定义博客皮肤

-+
  • 博客(31)
  • 收藏
  • 关注

转载 SQL kaggle learn : WHERE AND

WHERE trip_start_timestamp Between '2017-01-01' And '2017-07-01' and trip_seconds > 0 and trip_miles > 0WHERE trip_start_timestamp > '2017-01-01' and trip_start_timestamp &lt...

2019-04-04 16:49:00 140

转载 SQL kaggle learn with as excercise

rides_per_year_query = """SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS year ,COUNT(unique_key) AS num_tripsFROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`GROUP BY yearORD...

2019-04-04 11:37:00 171

转载 SQL count(1)

If you are ever unsure what to put inside aCOUNT()aggregation, you can doCOUNT(1)to count the rows in each group. Most people find it especially readable, because we know it's not focusing ...

2019-04-02 20:58:00 174

转载 sql where 里面判定要加 ' '

WHERE year>=2010 and year<=2017 and indicator_code = 'SE.XPD.TOTL.GD.ZS'转载于:https://www.cnblogs.com/bamboozone/p/10644973.html

2019-04-02 20:09:00 166

转载 kaggle learn python

def has_lucky_number(nums): return any([num % 7 == 0 for num in nums])def menu_is_boring(meals): """Given a list of meals served over some period of time, return True if the...

2019-03-24 21:47:00 123

转载 pandas

df = reviews.loc[:99,['country','variety']] or df = reviews.loc[[1,2,3,4],['country','variety']]df = reviews.loc[[0,1,10,100],['country','province','region_1','region_2']] 两颜色不能互换,必须index在前iloc...

2019-03-24 19:17:00 56

转载 实习僧的字体加密破解

1,https://www.hitoy.org/tool/file_base64.php 选择base64编码生成文件格式选择ttf2,https://fontdrop.info/ 上传ttf鼠标悬停会显示所有的被加密字和对应关系转载于:https://www.cnblogs.com/bamboozone/p/10555027.html...

2019-03-18 21:13:00 362

转载 cookiejar

referer:https://www.cnblogs.com/why957/p/9297779.html文章介绍了四种模拟登陆方法yield Request()可以将一个新的请求返回给爬虫执行在发送请求时cookie的操作,meta={'cookiejar':1}表示开启cookie记录,首次请求时写在Request()里meta={'cookiejar':response...

2019-03-09 11:54:00 690

转载 煎蛋ooxx

pipeline.pyclass Jiandanline(FilesPipeline): def get_media_requests(self, item, info): for file_url in item['file_urls']: yield scrapy.Request(file_url) de...

2019-03-08 20:04:00 342

转载 纪念一下学写pipeline时脑子里的坑

用的是filespipeline,用的存储地址是images的地址测试煎蛋ooxx首页,shell测试的时候返回很多列表,但是实际爬的时候一直只返回一条,很烦,一直测一直测,就是不行,后来才发现,首页已经刷新了就是只有一条。。。。def file_path 写不好的话,会被def item_completed当成无效文件过滤掉file path只是写一个路径名,只是一个路径名...

2019-03-08 15:58:00 176

转载 scrapy流程图

refer:https://blog.yongli1992.com/2015/02/08/python-scrapy-module/这里是一张Scrapy架构图的展示。Scrapy Engine负责整个程序的运行。Scheduler负责调度要访问的网址。Downloader负责从网络获取响应。Spider负责分析响应,从响应中解析出我们要的数据,同时也负责找出接下来要访问的后续网...

2019-03-08 13:36:00 101

转载 改写pipeline

为什么要改写方法:get_media_requests,他们的区别在哪里def get_media_requests(self, item, info):#原始的 return [Request(x) for x in item.get(self.images_urls_field, [])]def get_media_requests(self, ...

2019-03-08 13:30:00 111

转载 super()

fromhttps://mozillazg.com/2016/12/python-super-is-not-as-simple-as-you-thought.html# 这个作者真的牛逼在单继承中super就像大家所想的那样,主要是用来调用父类的方法的。class A: def __init__(self): self.n = 2 ...

2019-03-07 10:15:00 131

转载 os.path.join

os.path.join()函数:第一个以”/”开头的参数开始拼接,之前的参数全部丢弃。以上一种情况为先。在上一种情况确保情况下,若出现”./”开头的参数,会从”./”开头的参数的上一个参数开始拼接import os print("1:",os.path.join('aaaa','/bbbb','ccccc.txt')) print("2:...

2019-03-06 21:51:00 89

转载 scrapy item pipeline

item pipelineprocess_item(self, item, spider) #这个是所有pipeline都必须要有的方法在这个方法下再继续编辑具体怎么处理另可以添加别的方法open_spider(self, spider) This method is called when the spider is opened.close...

2019-03-05 21:05:00 198

转载 学习使用scrapy itemspipeline过程

开始非常不理解fromhttps://www.jianshu.com/p/18ec820fe706 找到了一个比较完整的借鉴,然后编写自己的煎蛋pipeline首先在items里创建image_urls = scrapy.Field() #images = scrapy.Field() #这两个是必须的image_paths = sc...

2019-03-05 20:16:00 77

转载 dygod.net

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass DgSpider(CrawlSpider): name = 'dg' # a...

2019-03-03 10:08:00 2697

转载 https://scrapingclub.com/exercise/detail_sign/

def parse(self, response): # pattern1 = re.compile('token=(.*?);') # token = pattern1.findall(response.headers.getlist("set-cookie")[1].decode("utf-8"))[0] patt...

2019-03-02 11:21:00 137

转载 https://scrapingclub.com/exercise/basic_captcha/

def parse(self, response): # set_cookies = response.headers.getlist("set-cookie").decode("utf-8") pattern1 = re.compile('csrftoken=(.*?);') pattern2 = re.compil...

2019-03-01 16:52:00 524

转载 https://scrapingclub.com/exercise/basic_login/

遇到的问题:csrftoken cfduid 是在request.headers里面的,一直在找怎么在scrapy里get request.header,从scrapy shell ,then fetch then request.headers可以get正确的内容,但是scrapy project中,不知道怎么写代码,网上找到response.request.headers,这个写...

2019-03-01 11:21:00 270

转载 Python scrapy - Login Authenication Issue

https://stackoverflow.com/questions/37841409/python-scrapy-login-authenication-issuefrom scrapy.crawler import CrawlerProcessimport scrapyfrom scrapy.http import Requestclass FirstS...

2019-03-01 10:44:00 125

转载 https://scrapingclub.com/exercise/detail_cookie/

def parse(self, response): pattern=re.compile('token=(.*?);') token=pattern.findall( response.headers.get("set-cookie").decode("utf-8"))[0] cookie = { ...

2019-02-27 14:47:00 305

转载 scrapy:get cookie from response

scrapy shellfetch('your_url')response.headers.getlist("Set-Cookie")https://stackoverflow.com/questions/46543143/scrapy-get-cookies-from-response-request-headers response.headers 返回...

2019-02-27 10:04:00 440

转载 css selectors tips

from https://saucelabs.com/resources/articles/selenium-tips-css-selectorsSauce Labs uses cookies to give you the best online experience. If you continue to use this site, you agree to the use o...

2019-02-24 18:30:00 345

转载 css选择问题

<div class="col-lg-4 col-md-6 mb-4"><div class="card"><a href="/exercise/list_basic_detail/90008-E/"><img class="card-img-top img-fluid" src="/static/img/90008-E.jpg"...

2019-02-23 19:32:00 116

转载 从js中提取数据

<script language="JavaScript" type="text/javascript+gk-onload"> SKART = (SKART) ? SKART : {}; SKART.analytics = SKART.analytics || {}; SKART.analytics["category"] = "tele...

2019-02-21 12:35:00 1022

转载 F12搜索json内容

转载于:https://www.cnblogs.com/bamboozone/p/10411256.html

2019-02-21 11:19:00 2069

转载 materials

http://interactivepython.org/runestone/static/pythonds/index.htmlhttps://blog.michaelyin.info/scrapy-exercises-make-you-prepared-for-web-scraping-challenge/https://scrapingclub.com/https://...

2019-02-21 09:00:00 162

转载 xpath ,css

https://docs.scrapy.org/en/latest/intro/tutorial.htmlxpath @选择属性 .当前目录下选择 //任意路径选择/bookstore/book[position()<3],选取最前面的两个属于 bookstore 元素的子元素的 book 元素cssspan.text::textresponse.css("...

2019-02-13 20:32:00 61

转载 chromedriver 全屏 翻页 错误

from selenium import webdriverfrom selenium.common.exceptions import TimeoutException, StaleElementReferenceExceptionfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.sup...

2019-01-29 14:51:00 192

转载 Pycharm学习python路

import 模块之后是灰色的表明没有被引用过lxml找不到的话用anaconda prompt :pip uninstall lxml 重新安装用request时,写的reg无法正确解析网页,先print然后再写regpyquery 的attr()获取不到值,因为只获取第一个值,具体参照https://www.cnblogs.com/airnew/p/10056551...

2019-01-29 14:01:00 290

空空如也

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

提示
确定要删除当前文章?
取消 删除