python爬虫
castingA3T
这个作者很懒,什么都没留下…
展开
-
抓取淘宝评论
import requestsheaders={ 'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' }'''url='https://rate.tmall.com/l原创 2017-12-21 20:11:04 · 1162 阅读 · 0 评论 -
python爬取中国大学前100名
import requestsfrom bs4 import BeautifulSoupheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' }u原创 2018-01-05 23:04:22 · 892 阅读 · 0 评论 -
python爬虫selenium+firefox抓取京东商品评论
from selenium import webdriverfrom selenium.webdriver.firefox.firefox_binary import FirefoxBinarycaps=webdriver.DesiredCapabilities().FIREFOXcaps['marionette']=Truebinary=FirefoxBinary(r'D:\Progr原创 2018-01-18 18:28:06 · 1274 阅读 · 0 评论 -
京东商品评论抓取(抓包方法)
import requestsimport json#import pprint#找到评论真实urlurl='https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv13&productId=5725260&score=0&sortType=5&page=0&pageSize原创 2018-01-19 14:30:46 · 4504 阅读 · 0 评论 -
python三种方法爬取糗事百科时间对比
# -*- coding: utf-8 -*-"""Created on Fri Jan 19 22:59:33 2018@author: Administrator"""import requestsimport timeheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit原创 2018-01-19 23:23:31 · 581 阅读 · 0 评论 -
python农产品集购网获取现货商城信息
# -*- coding: utf-8 -*-"""Created on Wed Feb 7 23:33:26 2018@author: Administrator"""'''农产品集购网获取现货商城信息'''import requestsfrom bs4 import BeautifulSoupimport timefrom multiprocessing import...原创 2018-02-09 18:34:42 · 621 阅读 · 0 评论 -
python爬取安居客(BeautifulSoup)
# -*- coding: utf-8 -*-"""Created on Fri Dec 15 10:26:06 2017@author: Administrator"""import requestsfrom bs4 import BeautifulSoupimport timeheaders={ 'User-Agent':'Mozilla/5.0 (Wind原创 2018-01-20 21:52:15 · 6353 阅读 · 3 评论 -
python多进程爬取安居客
# -*- coding: utf-8 -*-"""Created on Sat Jan 20 18:08:21 2018@author: Administrator"""import requestsfrom bs4 import BeautifulSoupimport timefrom multiprocessing import Poolheaders={原创 2018-01-20 22:17:24 · 562 阅读 · 0 评论 -
串行爬虫
import requestsimport timelink_list=[]with open('C:/Users/Administrator/Desktop/alexa.txt','r') as file: file_list = file.readlines() for eachone in file_list: link = eachone.split原创 2018-02-06 13:19:32 · 437 阅读 · 0 评论 -
股票信息爬取
# -*- coding: utf-8 -*-"""Created on Sat Apr 7 16:39:18 2018@author: Administrator"""import requestsimport refrom bs4 import BeautifulSoupheaders={ 'user-agent':'Mozilla/5.0 (Windo...原创 2018-04-08 13:05:15 · 460 阅读 · 0 评论 -
豆瓣top250电影爬取
requests.get发送请求得到回应(headers模拟浏览器),注意查看res.status_code是否返回200,res.encoding是否与网页charset一致,res.text查看内容soup解析网页内容,因为在一个页面中就能获取信息,先找能涵盖信息的大标签,再在大标签下提取小标签字符串处理主要用到strip,split,join# -*- coding: u...原创 2018-07-26 20:25:35 · 759 阅读 · 0 评论 -
前程无忧招聘信息详细爬取
#!/usr/bin/env python3# -*- coding: utf-8 -*-"""Created on Wed Aug 1 14:25:43 2018@author: ding"""import requestsfrom bs4 import BeautifulSoupheaders={'User-Agent':'Mozilla/5.0 (X11; Ubun...原创 2018-08-01 23:20:05 · 2686 阅读 · 2 评论 -
python爬取京东笔记本标题、品牌、价格、评论数
import requestsimport reimport jsonimport timefrom bs4 import BeautifulSoupheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3原创 2018-01-14 23:46:37 · 1606 阅读 · 0 评论 -
python爬取链家网40页二手房信息
import requestsfrom bs4 import BeautifulSoupimport reimport timeheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/原创 2017-12-28 22:38:57 · 700 阅读 · 0 评论 -
xpath方法抓取豆瓣电影top250
import requestsfrom lxml import etreeimport timeheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'原创 2018-01-01 12:18:40 · 2658 阅读 · 0 评论 -
xpath爬取小猪短租信息
import requestsfrom lxml import etreeimport timeheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'原创 2018-01-01 18:21:34 · 1322 阅读 · 0 评论 -
获取新浪新闻信息
'''import requestsfrom bs4 import BeautifulSoupres=requests.get('http://news.sina.com.cn/china/')res.encoding='utf-8'soup=BeautifulSoup(res.text,'html.parser')for news in soup.select('.news-item'原创 2017-12-10 11:39:48 · 470 阅读 · 0 评论 -
python爬虫——获取新浪新闻前两页新闻信息
import requestsimport jsonfrom bs4 import BeautifulSoupimport reimport jsonfrom datetime import datetimecommentURL='http://comment5.news.sina.com.cn/page/info?version=1&\ format=j原创 2017-12-10 20:33:51 · 703 阅读 · 0 评论 -
python抓取博客正文
import requestsfrom bs4 import BeautifulSouplink = "http://www.santostang.com/2017/10/22/%E5%9B%BD%E5%86%85%E4%B8%8B%E8%BD%BDanaconda%E9%80%9F%E5%BA%A6%E6%85%A2%EF%BC%8C%E8%AF%B7%E4%BD%BF%E7%94%A8%E6原创 2017-12-10 21:43:16 · 570 阅读 · 0 评论 -
爬取酷狗top500
import requestsfrom bs4 import BeautifulSoupheaders={ 'UserAgent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' }def原创 2017-12-13 12:19:24 · 1773 阅读 · 0 评论 -
python爬取起点中文网,原创榜单
import requestsfrom bs4 import BeautifulSoupheaders={ 'UserAgent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' }total原创 2017-12-13 18:24:10 · 3963 阅读 · 0 评论 -
链家网爬虫
import requestsfrom bs4 import BeautifulSoupheaders={ 'UserAgent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}def get_detai原创 2017-12-13 22:41:40 · 1241 阅读 · 0 评论 -
爬取斗破苍穹整本小说
import requestsfrom bs4 import BeautifulSoupheaders={ 'UserAgent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' }url='原创 2017-12-15 12:30:30 · 1564 阅读 · 0 评论 -
前程无忧python工作薪资爬取及数据分析
# -*- coding: utf-8 -*-"""Created on Fri Dec 15 15:31:51 2017@author: Administrator"""'''获取前程无忧python相关工作地点、薪水、公司、职位'''import requestsfrom bs4 import BeautifulSoupheaders={ 'UserAgent':原创 2017-12-17 21:33:50 · 6656 阅读 · 1 评论 -
诸葛找房房源信息爬取
import requestsfrom bs4 import BeautifulSoupimport pymongoimport datetimeimport re lg = '15001927982ttcc'lgttcc = re.sub("\D", "", lg)headers={ 'UserAgent':'Mozilla/5.0 (Windows NT 6.1; WOW原创 2017-12-20 22:44:16 · 1268 阅读 · 0 评论 -
python爬取安居客(BeautifulSoup先找大盒子,在大盒子里找各个小盒子)
# -*- coding: utf-8 -*-"""Created on Sat Jan 20 22:16:40 2018@author: Administrator"""import requestsfrom bs4 import BeautifulSoupimport timeimport refrom multiprocessing import Poolheade原创 2018-01-21 13:27:48 · 977 阅读 · 0 评论 -
猫眼电影评分字体截图识别
from selenium import webdriverimport pytesseractfrom PIL import Imagefrom io import BytesIObrowser = webdriver.Chrome('./chromedriver')browser.get("https://maoyan.com/films/1206875?_v_=yes")...原创 2019-01-20 11:41:32 · 948 阅读 · 0 评论