Python，爬虫_悦来客栈的老板的博客-CSDN博客

Python，爬虫

关注

关注数：文章数：57 文章阅读量：433770 文章收藏量：576

作者: 悦来客栈的老板

这个作者很懒，什么都没留下…

展开

JS逆向|写给小白的浏览器环境补充指北

现在越来越多的JavaScript代码都加入了浏览器的特征，如果你用node去运行扣下来的JavaScript源代码，可能会报错，也可能得到的结果与浏览器上的不一致，因此也就无法通过服务器的参数校验。所以，补充浏览器环境就显得尤为重要了。不管它怎么检测浏览器环境，逃不过下面两点:①，用来判断，改变逻辑②，值参与加密运算不管怎样的形式，在源代码的最前面补上就好了，尽量不要去更改浏览器特征所对应的源代码(值写死)，除非确实没办法。在不知道怎么补环境的情况下...

原创 2020-06-07 14:24:53 · 10253 阅读 · 0 评论
JS练手分析：两行代码解决抠下来的RSA代码报错问题

RSA是网站中经常用到了的加密算法，看到群友忙活了半天还搞不定抠下来的代码，总是报错，都不知道怎么解决这些报错问题。我在这里稍微提一下吧。上次提到的Github地址：https://github.com/travist/jsencrypt/blob/master/bin/jsencrypt.js将整个代码复制下来，保存到电脑上，我这里保存到了F盘(rsa.js)，...

原创 2019-12-28 17:59:57 · 1957 阅读 · 1 评论
调试小技巧：用浏览器来调试你抠出来的JS代码

为啥要用浏览器来调试代码，因为随着各大网站检测浏览器指纹的增多，在浏览器上面调试显得尤为重要。试想一下，如果抠的代码在浏览器上面都跑不通，在node上面更不可能跑通了。我在调试JS的时候绝大部分的顺序是这样的：浏览器调试OK ---> node 缺啥补啥 ---> Python调用 JS代码。看到群友抠RSA的代码，头都大了，我们就以...

原创 2019-12-28 17:57:16 · 2369 阅读 · 0 评论
抠出来的代码没法用？手把手教你缺啥补啥

原标题：爬虫入门之查找JS入口篇(十) ---下之前一个留了个题给大家，但是下载的人貌似不多，今天讲讲怎么来解它吧。请确保电脑已安装node环境。下载文件，保存到电脑上，我这里保存到了E盘。链接: https://pan.baidu.com/s/1agS_1ytojgXyGms_ZfwLPw 提取码: 753u 复制这段内容后打开百度网盘手机App，操作更方便...

原创 2019-10-13 10:00:41 · 2026 阅读 · 0 评论
Python爬虫实战(三):简单爬取网页图片

先上代码:#coding=utf-8import urllib.requestfor i in range(1,41): imgurl = "http://mtl.ttsqgs.com/images/img/11552/" imgurl += str(i) + ".jpg" urllib.request.urlretrieve(imgurl,'%s.jpg' % i

原创 2017-09-04 13:00:00 · 5170 阅读 · 0 评论
Python爬虫实战(二):爬取天涯帖子(只看楼主)

先上代码#coding=utf-8import requestsfrom bs4 import BeautifulSoupdef getHtml(url): page = requests.get(url) html =page.text return htmldef getText(html): global i listautho

原创 2017-09-04 12:52:11 · 34443 阅读 · 0 评论
Python爬虫实战(四) :下载煎蛋网所有妹子照片

煎蛋网是一个适合开车的网站，各种妹子福利图片。网站:http://jandan.net/ooxx里面有两种格式的图片:gif和jpg，写个程序将所有页面的妹子图全部下载下来。#coding=utf-8import requestsimport urllib.requestfrom bs4 import BeautifulSoupdef getHtml(url):

原创 2017-09-04 22:03:47 · 17383 阅读 · 1 评论
Python爬虫实战(五) :下载百度贴吧帖子里的所有图片

准备工作：目标网址：https://tieba.baidu.com/p/5113603072目的：下载该页面上的所有楼层里的照片第一步：分析网页源码火狐浏览器 ---> 在该页面上右击 “查看页面源代码”，会打开一个新的标签页。第二步：查找图片源地址在新标签页上ctrl + F，输入jpg，找到第一个图片的源地址BTW，怎么知道这个链接是不是第

原创 2017-09-05 12:31:16 · 49513 阅读 · 2 评论
Ptyhon爬虫实战(七):爬取汽车公告网上的批次排量等信息

网址：http://www.cn357.com/notice/直接上代码。#coding=utf-8import reimport requestsdef getHtml(url): try: page = requests.get(url) html = page.text return html

原创 2017-09-08 12:48:43 · 30425 阅读 · 1 评论
Python爬虫实战(九)：爬取动态网页

#coding=utf-8import reimport jsonimport requestsfrom prettytable import PrettyTabledef getHtml(url): data = { 'page':1, 'num':40, 'sort':'symbol', 'asc':1,

原创 2017-10-29 22:50:15 · 2345 阅读 · 0 评论
Python爬虫实战(十一)：两种简单的方法爬取动态网页

#一网页POST方式#coding=utf-8 import requestsfrom bs4 import Tagfrom bs4 import BeautifulSoupfrom prettytable import PrettyTable def getHtml(url,pageNo): data = {#反复分析得出只需要提交这两个参数即可

原创 2017-11-07 19:21:57 · 2876 阅读 · 0 评论
Python爬虫实战(六)：爬取糗事百科段子

直接上代码：#coding=utf-8 import requestsimport urllib.request from bs4 import BeautifulSoupdef getHtml(url): page = requests.get(url) html =page.text return htmldef getImg(h

原创 2017-09-05 12:33:56 · 34778 阅读 · 1 评论
Python爬虫实战(一)：爬取豆瓣电影top250排名

先上代码#coding=utf-8import reimport urllib.requestdef getHtml(url): page = urllib.request.urlopen(url) html = page.read() html = html.decode('utf-8') return htmldef getItem(ht

原创 2017-09-04 12:34:18 · 5725 阅读 · 0 评论
Xpath实战一：解析爬取糗事百科段子

#coding=utf-8import requestsfrom lxml import etreedef getHtml(url): page = requests.get(url) html = page.text return htmldef getImg(html): texts = [] html = etree.HTML(

原创 2017-11-13 19:43:58 · 2431 阅读 · 0 评论
Xpath实战二：下载百度贴吧的照片

#coding=utf-8 import requestsimport urllib.requestfrom lxml import etreedef getHtml(url): page = requests.get(url) html =page.text return htmldef getImg(html): html = etr

原创 2017-11-13 20:14:47 · 1699 阅读 · 0 评论
Xpath实战三：下载煎蛋网妹子照片

#coding=utf-8 import requestsimport urllib.requestfrom lxml import etreedef getHtml(url): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/

原创 2017-11-13 20:40:41 · 2221 阅读 · 0 评论
Python爬虫实战(十)：爬取Linux公社资源站的所有电子资源

#coding=utf-8import reimport requestsfrom tenacity import retry, stop_after_attempt@retry(stop=stop_after_attempt(3))def get_html(url): '''获取页面源代码''' headers = {'User-Agent': 'Mozilla/5.

原创 2017-11-04 15:16:50 · 3396 阅读 · 0 评论
Python爬虫实战(八)：爬取电影天堂的电影下载链接

#coding=utf-8import reimport requestsimport xlsxwriterfrom bs4 import BeautifulSoupdef getHtml(url): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100

原创 2017-10-29 18:15:01 · 11342 阅读 · 4 评论
Xpath实战四：W3S网上的例子

#coding=utf-8 import requestsfrom lxml import etreedef getHtml(url): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'} page =

原创 2017-11-14 10:25:37 · 1839 阅读 · 0 评论
使用selenium下载煎蛋网加密妹子图

#!/usr/bin/env python# -*- coding: utf-8 -*-#coding=utf-8 import requests import urllib.requestfrom bs4 import BeautifulSoupfrom selenium import webdriverurls = ('http://jandan.net/ooxx/pa

原创 2017-11-26 16:36:40 · 2766 阅读 · 0 评论
selenium实战一：播放音悦台高清Mv

from selenium import webdriverfrom selenium.webdriver.common.action_chains import ActionChainsfrom selenium.webdriver.common.keys import Keys PostUrl = "http://www.yinyuetai.com/"driver=webdriv

原创 2017-11-14 22:36:17 · 1724 阅读 · 0 评论
selenium实战二：登入QQ空间

from selenium import webdriverfrom selenium.webdriver.common.action_chains import ActionChainsfrom selenium.webdriver.common.keys import Keys PostUrl = "https://qzone.qq.com/index.html"driver=w

原创 2017-11-14 22:38:29 · 2707 阅读 · 0 评论
爬取百度贴吧某帖子的所有照片

#coding=utf-8import randomimport requestsimport urllib.request as urllibfrom lxml import etreefrom bs4 import BeautifulSoupuser_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/53

原创 2017-11-16 14:07:01 · 1945 阅读 · 0 评论
爬取百度贴吧所有精品贴照片

#coding=utf-8import osimport randomimport requestsfrom lxml import etreefrom urllib.parse import urlparseimport urllib.request as urllibfrom bs4 import BeautifulSoupuser_agent_list = ["Mozil

原创 2017-11-16 19:12:58 · 1931 阅读 · 0 评论
python实现12306验证和登录

原文地址：http://blog.csdn.net/sinat_36772813/article/details/768047991.获取验证码分析：这里可以看出验证码的获取地址，最后一个参数不知道是什么意思，我们直接去掉，然后发现在浏览器中仍然能请求到验证码。验证码连接：https://kyfw.12306.cn/passport/captcha/capt

转载 2017-12-01 16:43:42 · 5511 阅读 · 3 评论
2种方法简单爬取JS加载的动态数据

参考原文:http://www.cnblogs.com/buzhizhitong/p/5697683.html需要爬取的网站数据: http://gkcx.eol.cn/soudaxue/queryProvince.html?page=1 一共是165页，将page=1 变成其他的数字即可访问。获取所有的url: urls = ('http://gkcx.eol.cn/s

原创 2017-12-02 11:51:18 · 89618 阅读 · 7 评论
Python3 css选择器实战(二)：爬取猫眼电影网

#coding=utf-8import reimport timeimport requestsfrom requests.exceptions import RequestExceptionfrom bs4 import BeautifulSoupfrom prettytable import PrettyTable def getHtml(url): try: ...

原创 2018-10-01 16:07:44 · 2099 阅读 · 0 评论
python3 爬取豆瓣电影TOP250，漂亮表格彩色显示

#coding=utf-8 import requestsimport refrom bs4 import BeautifulSoupfrom prettytable import PrettyTablefrom colorama import Fore,Style def getHtml(url): headers = {'User-Agent': 'Mozill...

原创 2018-09-25 21:28:06 · 2010 阅读 · 4 评论
Python3 Ajax加载的网页爬取

url：今日头条，搜索“街拍”并打开https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D浏览器：firefox分析：打开页面，空白处单击鼠标右键，选择 ”查看元素”在下面弹出元素框内选择网络，并在右边的框内选择 XHR网页往下拉。。。。。直到元素框有数据出现。#coding=utf-8import r...

原创 2018-10-02 08:53:34 · 2389 阅读 · 0 评论
Python3 爬取Ajax加载的网页信息

url：http://www.kfc.com.cn/kfccda/storelist/index.aspx#coding=utf-8import reimport timeimport requestsfrom requests.exceptions import RequestExceptiondef getHtml(url,page): try: he...

原创 2018-10-02 09:33:04 · 2529 阅读 · 0 评论
Python3 Scrapy框架学习一：爬取猫眼Top100榜

以下操作基于Windows平台。打开CMD命令提示框：输入如下命令：打开项目里的items.py文件，定义如下变量，用于存储。class MaoyanItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() movie = scrap...

原创 2018-10-03 10:26:11 · 3287 阅读 · 1 评论
Python3 爬取豆瓣图书Top250并存入Excel中

#coding=utf-8import reimport xlwtimport requestsfrom bs4 import BeautifulSoup def getHtml(url): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/201001...

原创 2018-09-26 21:34:21 · 3983 阅读 · 0 评论
Python3 css选择器实战(一)

首先安装cssselectpip install cssselect再安装lxmlpip install lxml#coding=utf-8 import requestsfrom lxml import etree def getHtml(url): page = requests.get(url) html =page.text ...

原创 2018-09-26 22:03:50 · 3188 阅读 · 0 评论
Python3 CssSelector定位方式实例详解

例子：html = """ <div id='content'> <ul class='list'> <li class='one'>哈哈</li> <li class='two'>Two</li> &l

原创 2018-09-26 23:07:17 · 8430 阅读 · 4 评论
Python3 Scrapy框架学习二：爬取豆瓣电影Top250

打开项目里的items.py文件，定义如下变量，import scrapyfrom scrapy import Item,Fieldclass DoubanItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() movie = Field()...

原创 2018-10-04 08:15:43 · 2263 阅读 · 0 评论
Python3 黑板客爬虫闯关第一关

#coding=utf-8import reimport requestsfrom requests.exceptions import RequestExceptionfrom bs4 import BeautifulSoupdef getHtml(url): try: headers = {'User-Agent': 'Mozilla/5.0 (Windo...

原创 2018-10-11 21:26:16 · 1915 阅读 · 0 评论
Python3 黑板客爬虫闯关第二关

#coding=utf-8import requestsfrom requests.exceptions import RequestExceptionfrom bs4 import BeautifulSoupdef getHtml(url,i): data = {"username":"admin", "password":i,} try: ...

原创 2018-10-11 21:36:41 · 1730 阅读 · 0 评论
Python3 黑板客爬虫闯关第三关

黑板客爬虫闯关第二关成功后的页面：http://www.heibanke.com/accounts/login/?next=/lesson/crawler_ex02/需要注册，注册后登陆：来到这个站点：http://www.heibanke.com/lesson/crawler_ex02/#coding=utf-8import requestsif __name__==...

原创 2018-10-12 00:05:55 · 1783 阅读 · 0 评论
Python3 Scrapy框架学习四：爬取的数据存入MongoDB

1. 新建一个scrapy项目：2.使用PyCharm打开该项目3.在settings.py文件中添加如下代码：#模拟浏览器，应对反爬USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.3...

原创 2018-10-07 08:03:31 · 3269 阅读 · 0 评论
Python3 操作MongoDB数据库

以上一篇的数据为例子。In [1]: import pymongo #引入pymongo模块In [2]: client = pymongo.MongoClient(host = 'localhost',port = 27017) #进行连接In [3]: db = client.maoyan #指定数据库In [4]: collection = db.MaoyanI...

原创 2018-10-07 09:19:03 · 5895 阅读 · 3 评论

Python，爬虫

作者: 悦来客栈的老板

JS逆向|写给小白的浏览器环境补充指北

JS练手分析：两行代码解决抠下来的RSA代码报错问题

调试小技巧：用浏览器来调试你抠出来的JS代码

抠出来的代码没法用？手把手教你缺啥补啥

Python爬虫实战(三):简单爬取网页图片

Python爬虫实战(二):爬取天涯帖子(只看楼主)

Python爬虫实战(四) :下载煎蛋网所有妹子照片

Python爬虫实战(五) :下载百度贴吧帖子里的所有图片

Ptyhon爬虫实战(七):爬取汽车公告网上的批次排量等信息

Python爬虫实战(九)：爬取动态网页

Python爬虫实战(十一)：两种简单的方法爬取动态网页

Python爬虫实战(六)：爬取糗事百科段子

Python爬虫实战(一)：爬取豆瓣电影top250排名

Xpath实战一：解析爬取糗事百科段子

Xpath实战二：下载百度贴吧的照片

Xpath实战三：下载煎蛋网妹子照片

Python爬虫实战(十)：爬取Linux公社资源站的所有电子资源

Python爬虫实战(八)：爬取电影天堂的电影下载链接

Xpath实战四：W3S网上的例子

使用selenium下载煎蛋网加密妹子图

selenium实战一：播放音悦台高清Mv

selenium实战二：登入QQ空间

爬取百度贴吧某帖子的所有照片

爬取百度贴吧所有精品贴照片

python实现12306验证和登录

2种方法简单爬取JS加载的动态数据

Python3 css选择器实战(二)：爬取猫眼电影网

python3 爬取豆瓣电影TOP250，漂亮表格彩色显示

Python3 Ajax加载的网页爬取

Python3 爬取Ajax加载的网页信息

Python3 Scrapy框架学习一：爬取猫眼Top100榜

Python3 爬取豆瓣图书Top250并存入Excel中

Python3 css选择器实战(一)

Python3 CssSelector定位方式实例详解

Python3 Scrapy框架学习二：爬取豆瓣电影Top250

Python3 黑板客爬虫闯关第一关

Python3 黑板客爬虫闯关第二关

Python3 黑板客爬虫闯关第三关

Python3 Scrapy框架学习四：爬取的数据存入MongoDB

Python3 操作MongoDB数据库