python
python学习
Xiaoweidumpb
这个作者很懒,什么都没留下…
展开
-
Pyppeteer实战
import loggingfrom pyppeteer import launchfrom pyppeteer.errors import TimeoutErrorimport asyncioimport jsonfrom os import makedirsfrom os.path import existsRESULTS_DIR='results'exists(RESULTS_DIR)or makedirs(RESULTS_DIR)async def save_data(dat原创 2020-11-18 21:36:28 · 231 阅读 · 0 评论 -
bs4提取p标签内的值href
第一种情况提取p标签内的值time=soup.find('p', attrs={'real-wea-info'}).string实时天气:22:45发布<td data-title="IP">171.35.170.230</td>IP = item.find('td', attrs={'data-title': "IP"}).string171.35.170.230第二种情况提取href中的值<a href="/guilin1d/57957.htm" o原创 2020-10-31 10:02:07 · 4626 阅读 · 0 评论 -
pyecharts生成地图
import tkinter as tkimport jiebaimport reimport pandas as pdimport PIL.Image as imageimport PILimport numpy as npfrom wordcloud import WordCloudfrom GUI.Viewdata import Viewdatafrom pyecharts import options as optsfrom pyecharts.charts import Map原创 2020-08-21 21:15:18 · 447 阅读 · 0 评论 -
bs4爬虫美女图片
练习,请不要大量爬取图片可以自行设置 requests 的timeout 和代理import timeimport requestsfrom bs4 import BeautifulSoupheaders = { "referer": "https://www.vmgirls.com/13344.html", "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like G原创 2020-10-31 10:28:45 · 469 阅读 · 0 评论 -
爬虫请求word,txt,csv,pdf,docx格式
分别读取以下网站的csv,pdf,word在线文档,并解析相关数据 url: http://www.pythonscraping.com/files/MontyPythonAlbums.csv http://www.pythonscraping.com/pages/warandpeace/chapter1.pdf http://www.pythonscraping.com/pages/AWordDocument.docxtxt:from urllib.request im原创 2020-10-31 10:35:24 · 181 阅读 · 0 评论 -
pycharm 镜像安装第三方库错误
Looking in indexes: http://mirrors.aliyun.com/pypi/simple/Collecting pandaThe repository located at mirrors.aliyun.com is not a trusted or secure host and is being ignored. If this repository is available via HTTPS we recommend you use HTTPS instead, oth原创 2020-09-05 12:19:32 · 3079 阅读 · 0 评论 -
模拟手机端爬取微博评论
import requestsfrom pyquery import PyQuery as pqdef get_html(): url='https://m.weibo.cn/api/container/getIndex?type=uid&value=2970452952&containerid=1076032970452952&since_id=4462856005495672' response=requests.get(url=url) res=re原创 2020-11-02 15:59:44 · 1022 阅读 · 0 评论 -
今日头条爬取新闻视频用户
import requestsimport timeimport randomimport pymongoheaders = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}base_url = 'https://www.toutiao.com/api/search/co原创 2020-12-10 00:07:40 · 407 阅读 · 1 评论 -
scrapy常用配置与命令
pip install scrapy1.创建一个scrapy项目scrapy startproject 项目名scrapy startproject spider_zhihu2.生成一个爬虫scrapy genspider 爬虫名 允许爬取的范围允许爬取的范围:域名3.提取数据完善spider,使用xpath等方法4.保存数据在pipeline中保存数据5.启动爬虫scrapy crawl 爬虫名通过在setting.py中进行以下设置可以被用来配置logging:原创 2020-11-11 09:06:34 · 453 阅读 · 0 评论 -
模拟手机端爬取博主微博
'https://m.weibo.cn/api/container/getIndex?type=uid&value=2970452952&containerid=1076032970452952''https://m.weibo.cn/api/container/getIndex?type=uid&value=2970452952&containerid=1076032970452952&since_id=4519809071656587'import reque原创 2020-11-03 09:27:38 · 624 阅读 · 0 评论 -
xpath+lxml爬取猫眼电影排行
import requestsfrom lxml import etreedef get_source(url): headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36','Referer': 'https://maoyan.com/board/4?offset=0'}原创 2020-10-01 10:56:23 · 446 阅读 · 0 评论 -
基本库urllib的使用
基本库urllib的使用import urllib.requestresponse= urllib.request.urlopen( ' https://www.python.org')print(response. read(). decode (' utf-8')) #抓取python官网# 返回结果# HTML网页print(type(response)) #查看返回类型#返回结果# <class 'http.client.HTTPResponse'>'''HTTPRes原创 2020-10-01 10:33:46 · 141 阅读 · 0 评论 -
bs4爬取天气写入MongoDB
import timeimport requestsfrom bs4 import BeautifulSoupimport pymongo‘’‘一定要加origin 否则请求不到网页!‘’’header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36", "Host"原创 2020-10-31 10:38:26 · 233 阅读 · 0 评论 -
爬取博客文章保存到本地(失败)
用 Selenium实现自动化比较好…import requestsfrom bs4 import BeautifulSoupimport reheader = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}article_info = {}def get_ht原创 2020-10-31 17:10:21 · 129 阅读 · 0 评论 -
Pyppeteer学习
上手体验import asynciofrom pyppeteer import launchfrom pyquery import PyQuery as pqasync def main(): browser = await launch() page = await browser.newPage() await page.goto('https://dynamic5.scrape.cuiqingcai.com/') await page.waitForSele原创 2020-11-18 18:28:54 · 273 阅读 · 0 评论 -
爬取B站视频下载
https://www.ku6.com/video/detail?id=udfY7DjsSXbg8ghbDnhUwNTinOYimport timeimport requestsfrom bs4 import BeautifulSoupheaders = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safa原创 2020-11-09 13:22:40 · 991 阅读 · 1 评论 -
爬虫代理池
import requestsimport timefrom bs4 import BeautifulSoupheader = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36 Edg/81.0.416.64'}ip_dict = {}#返回请求到的bs4def get原创 2020-10-27 09:45:11 · 210 阅读 · 0 评论 -
robots协议
User_agent:baiduspider #百度爬虫'Disallow:/' #不允许抓取任何页面'Allow:/public/ #允许抓取public页面Allow一般和Disallow一起使用,用来排除某些限制.现在我们设置为/public/,则表示所有页面不允许爬取,但可以抓取public目录1.User_agent:* Disallow:/ #不允许抓取任何目录2.User_agent:* Disallow: 允许抓取任何目录另外直接把robots.txt文件留空原创 2020-10-01 10:47:32 · 234 阅读 · 0 评论 -
正则爬取小说
title: 正则爬取小说import requestsimport reimport osfrom multiprocessing.dummy import Poolstart_url = 'https://www.kanunu8.com/files/dushi/201106/3099.html'def get_source(url): html = requests.get(url) return html.content.decode('gbk')def get_toc原创 2020-08-21 20:55:57 · 365 阅读 · 0 评论 -
淘宝+Selenium
title: 淘宝+seleniumcategories: [爬虫笔记,代码]from selenium import webdriverfrom selenium.common.exceptions import TimeoutExceptionfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.sup原创 2020-08-21 21:00:32 · 398 阅读 · 0 评论 -
python 判断字符串为全角
【代码】python 判断字符串为全角。原创 2022-09-20 10:13:05 · 1109 阅读 · 0 评论 -
PIL验证码图片去噪九宫格
. 感谢海涛老师,给我新的视角,这部分代码有些变量命名不是很好.......验证码图片彩色图片是由 3个二维矩阵来表示的from PIL import Imageimport numpy as np imag = Image.open(path) data = np.array(imag) print(data.shape)通过以上代码,即可看到彩色图片矩阵的形状 (25, 80, 3)灰度化方法im = Image.open(path).conver原创 2020-12-22 17:50:45 · 362 阅读 · 0 评论 -
linux在后台运行脚本
普通运行命令,会在会话结束中断python3 weibo.pylinux在后台运行脚本nohup python /path/to/python/file.py &执行完上面的命令后按任意键退回到shell回到shell之后不断直接关闭自己的终端,必须输入exit来退出SSH才能保证该脚本会一直在后台运行Linux查询正在运行的python程序ps -ef | grep python...原创 2020-11-06 20:55:05 · 250 阅读 · 0 评论