CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup 的模型卡
目录
模型详细信息
模型描述
使用OpenCLIPtimm
在 LAION-2B(英语)(LAION-5B的一个子集)上训练的一系列 CLIP ConvNeXt-XXLarge(自定义的ConvNeXt 大小)模型。
模型 | 数据集 | 解决 | 八月注册 | Top-1 ImageNet Zero-Shot(%) |
---|---|---|---|---|
convnext_xxlarge.laion2b_s34b_b82k-augreg | 拉伊奥-2B | 256x256 | RRC(0.33,1.0)、RE(0.35)、SD(0.1) | 79.1 |
convnext_xxlarge.laion2b_s34b_b82k-augreg-倒带 | 拉伊奥-2B | 256x256 | RRC(0.3、1.0)、RE(0.4)、SD(0.1) | 79.3 |
convnext_xxlarge.laion2b_s34b_b82k-augreg-汤 | 拉伊奥-2B | 256x256 | 不适用 | 79.4 |
RRC = 随机调整大小裁剪(裁剪百分比),RE = 随机擦除(概率),SD = 随机深度(概率)——仅限图像塔 |
核心训练运行在约 2 个月的时间内分阶段进行。核心运行的全局批大小为 81920。最后约 10% 的训练以 95744 的全局批大小重新进行,LR 和 aug 比原始完成时更高。将两者平均放在“汤”中。有关详细信息,请参阅训练详细信息。
目标:
- 将最大卷积 CLIP 图像塔的尺寸推入 ViT-g 到 ViT-G 的性能范围,并改进图像尺寸缩放以供下游使用。
第一:
- 已发布的最大的 ConvNeXt 模型经过预训练(847M 个参数,198 GMAC 和 125 MActs @ 256x256 图像)
- 非 ViT 图像塔 CLIP 模型(没有先前的图像塔预训练)实现 > 79% ImageNet top-1 零样本
该模型利用:
- timm ConvNeXt -XXLarge 模型 (
convnext_xxlarge
) 作为图像塔 - 影像塔末端的标准投影
- 与 ViT-H-14 和 ViT-g-14 型号大小相同的文本塔(1024 个,16 个头,24 个深度)
这些模型是在 256x256 图像分辨率下训练的。组合图像 + 文本 CLIP 模型的大小为 1.2B 参数,222 GMAC 和 146 MActs。在 256x256 分辨率下,ConvNext-XXLarge 在 FLOPS 和参数方面略高于 ViT-H-14 CLIP 配置,但激活计数较低。它远低于 g-14 和 G-14,但在功能上介于两者之间。
模型 | 图片大小 | 嵌入维度 | 通用汽车 | 麦克茨 | 多参数 | 图片_gmacs | 图片 | 图像参数 | text_gmacs | text_macts | text_mparams |
---|---|---|---|---|---|---|---|---|---|---|---|
维特-H-16 | 224 | 1024 | 150.96 | 122.01 | 986.26 | 127.4 | 100.81 | 632.23 | 23.57 | 21.2 | 354.03 |
维生素 H-14 | 224 | 1024 | 190.97 | 160.61 | 986.11 | 167.4 | 139.41 | 632.08 | 23.57 | 21.2 | 354.03 |
ViT-L-14-336 | 336 | 768 | 197.76 | 278.19 | 427.94 | 191.1 | 270.24 | 304.29 | 6.66 | 7.95 | 123.65 |
convnext_xxlarge | 256 | 1024 | 221.66 | 145.66 | 1200.58 | 198.09 | 124.45 | 846.54 | 23.57 | 21.2 | 354.03 |
RN50x64 | 448 | 1024 | 276.8 | 249.73 | 623.26 | 265.02 | 239.13 | 420.38 | 11.78 | 10.6 | 202.88 |
维生素G-14 | 224 | 1024 | 290.74 | 213.84 | 1366.68 | 267.18 | 192.64 | 1012.65 | 23.57 | 21.2 | 354.03 |
convnext_xxlarge_320 | 320 | 1024 | 333.08 | 215.66 | 1200.58 | 309.52 | 194.46 | 846.54 | 23.57 | 21.2 | 354.03 |
维特-H-14-336 | 336 | 1024 | 414.53 | 428.74 | 986.52 | 390.97 | 407.54 | 632.49 | 23.57 | 21.2 | 354.03 |
ViT-bigG-14 | 224 | 1280 | 532.92 | 310.71 | 2539.57 | 483.96 | 275.37 | 1844.91 | 48.96 | 35.34 | 694.66 |
Ross Wightman 在stable.ai集群和JUWELS Booster超级计算机上进行了模型训练。请参阅下文致谢。
用途
与原始OpenAI CLIP 模型卡一样,此模型旨在作为研究社区的研究成果。我们希望该模型能够帮助研究人员更好地理解和探索零样本、任意图像分类。我们还希望它可以用于跨学科研究此类模型的潜在影响。
OpenAI CLIP 论文讨论了潜在的下游影响,为此类分析提供了示例。此外,LAION-5B 博客 ( LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION ) 和即将发表的论文还讨论了与训练数据集特别相关的其他讨论。
直接使用
零样本图像分类、图像和文本检索等。
下游用途
图像分类和其他图像任务微调、线性探测图像分类、图像生成引导和调节等。
超出范围的使用
根据 OpenAI 模型,
该模型的任何已部署用例(无论是否商业)目前都不在范围内。除非使用特定的固定类分类法对模型进行彻底的域内测试,否则也不建议使用未部署的用例(例如受限环境中的图像搜索)。这是因为我们的安全评估表明,对特定任务的测试需求很高,尤其是考虑到 CLIP 在不同类分类法下的性能差异。这使得在任何用例中未经测试和不受约束地部署模型目前都可能造成危害。
某些属于监控和面部识别领域的用例始终超出范围,无论模型性能如何。这是因为目前使用人工智能执行此类任务可能为时过早,因为缺乏测试规范和检查来确保其公平使用。
由于该模型尚未针对英语以外的任何语言进行有针对性的训练或评估,因此其使用应仅限于英语语言用例。
此外,根据上述通知,用于训练这些模型的 LAION-5B 数据集还有其他考虑因素,见下文。
培训详情
训练数据
该模型使用 LAION-2B(LAION-5B 的 20 亿个样本英语子集)进行训练(LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION)。
重要提示:创建数据集的动机是使围绕大规模多模态模型训练和处理从公开可用的互联网上爬取的未经整理的大规模数据集的研究和实验民主化。因此,我们建议将数据集用于研究目的。请注意,这个大规模数据集是未经整理的。请记住,数据集的未经整理的性质意味着收集的链接可能会导致人类观看者感到非常不舒服和不安的内容。因此,请谨慎使用演示链接,风险自负。可以通过根据安全标签过滤掉样本(使用我们构建的定制训练的 NSFW 分类器)来提取“安全”子集。虽然这大大降低了在观看时遇到潜在有害内容的机会,但我们不能完全排除有害内容在安全模式下仍然存在的可能性,因此警告在那里也成立。我们认为,向广泛的研究机构和其他感兴趣的社区公开提供数据集将使我们能够透明地调查训练大型模型带来的好处,以及在使用仍局限于小社区的封闭式大型数据集时可能未被报告或未被注意到的陷阱和危险。虽然我们公开了数据集,但我们不建议使用它来创建随时可用的工业产品,因为关于此类大型模型的一般属性和安全性的基础研究仍在进行中,我们希望通过此次发布来鼓励这些研究。
训练过程
主要训练运行在全局批次大小为 81920 的情况下进行,针对 256 个检查点间隔的 135.6M 个样本,在训练过程中总共看到约 34B 个样本。
在训练此模型时,模型数值稳定性、集群稳定性和性能方面都遇到了很多困难。最初尝试使用 float16 AMP 和默认 adam beta2 进行训练,导致损失激增,最终导致 NaN 爆炸。beta2
降至 0.97 有所帮助,但损失/zs 曲线没有按预期跟踪。切换到 PyTorch 夜间版后,可以使用 bfloat16 + AMP 进行训练(与最近的 H/14、g/14 和 G/14 模型一样),beta2 恢复到 0.98 并改进了指标。
检查点间隔 | 簇 | # GPU | # 节点 | 图形处理器 | 本地 BS | 样本/秒 | 样本/s/gpu | 精确 | 亚当 beta2 |
---|---|---|---|---|---|---|---|---|---|
1 - 2 | 稳定 | 1024 | 128 | A100 40GB | 80 | 37-40千 | 36-39 | 放大器 + fp16 | 0.97 |
3 - 32 | 稳定 | 512 | 64 | A100 80GB | 160 | 27-32千 | 52-62 | 放大器 + fp16 | 0.97 |
33 - 75 | 助推器 | 1024 | 256 | A100 40GB | 80 | 48千 | 四十七 | 放大器 + fp16 | 0.97 |
76 - 165 | 助推器 | 1024 | 256 | A100 40GB | 80 | 51千 | 50 | 放大器 + bf16 | 0.98 |
166 - 232 | 稳定 | 320 | 40 | A100 80GB | 256 | 18-19千 | 56-59 | 放大器 + bf16 | 0.98 |
233 - 249 | 助推器 | 1024 | 256 | A100 40GB | 80 | 51千 | 50 | 放大器 + bf16 | 0.98 |
250 - 256 | 稳定 | 1024 | 128 | A100 40GB | 80 | 27-31千 | 26-30 | 放大器 + bf16 | 0.98 |
JUWELS Booster 每个节点有 4 个 A100 GPU,每个节点有 4 个 HDR-200 IB 适配器(每个 GPU 200Gbit/秒)。使用的稳定性设置是每个节点有 8 个 A100 GPU,每个节点有 400Gbit/秒 EFA 网络(每个 GPU 50 GBit/秒)。在各种配置中观察到训练效率(每个 GPU 的吞吐量)存在显著差异。两个集群的 1024 GPU 配置特别容易崩溃(或者很难用“良好”的 GPU 运行)。
下面是针对 128 8-GPU(40GB A100)配置的 slurm srun 命令行:
srun --cpu_bind=v --accel-bind=gn python -m training.main \
--save-frequency 1 \
--name "xxlarge-2b-81920-bf16" \
--resume "latest" \
--logs "/runs" \
--log-every-n-steps 50 \
--train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
--train-num-samples 135646078 \
--dataset-type webdataset \
--warmup 10000 \
--batch-size=80 \
--epochs=256 \
--dataset-resampled \
--aug-cfg use_timm=True scale='(0.33, 1.0)' re_prob=0.35 \
--precision amp_bfloat16 \
--grad-clip-norm 5.0 \
--lr 1e-3 \
--workers=6 \
--beta2 0.98 \
--model "convnext_xxlarge" \
--seed 0 \
--ddp-static-graph \
--local-loss \
--gather-with-grad \
--grad-checkpointing \
--report-to "tensorboard"
对于最后 10% 的倒带,使用了更高的全局批量大小 95744,具有更高的 LR 和略微增加的增强强度。
检查点间隔 | 簇 | # GPU | # 节点 | 图形处理器 | 本地 BS | 样本/秒 | 样本/s/gpu | 精确 | 亚当 beta2 |
---|---|---|---|---|---|---|---|---|---|
231 - 256 | 稳定 | 1088 | 136 | A100 40GB | 88 | 32-35千 | 29-32 | 放大器 + bf16 | 0.98 |
适用于 136 个 8-GPU(40GB A100)节点的 slurm srun 命令行:
srun --cpu_bind=v --accel-bind=gn python -m training.main \
--save-frequency 1 \
--name "xxlarge-2b-81920-r-bf16" \
--resume "latest" \
--logs "/runs" \
--log-every-n-steps 50 \
--train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
--train-num-samples 135646078 \
--dataset-type webdataset \
--warmup 10000 \
--batch-size=88 \
--epochs=256 \
--dataset-resampled \
--aug-cfg use_timm=True scale='(0.3, 1.0)' re_prob=0.4 \
--precision amp_bfloat16 \
--grad-clip-norm 5.0 \
--lr 2e-3 \
--workers=6 \
--beta2 0.98 \
--model "convnext_xxlarge" \
--seed 0 \
--ddp-static-graph \
--local-loss \
--gather-with-grad \
--grad-checkpointing \
--report-to "tensorboard"
评估
使用LAION CLIP Benchmark 套件中的代码完成评估。
测试数据、因素和指标
测试数据
测试使用 VTAB+(VTAB(https://arxiv.org/abs/1910.04867)与其他稳健性数据集的组合)进行分类,使用 COCO 和 Flickr 进行检索。
结果
这些模型在 ImageNet-1k 上实现了 79.1 到 79.4 之间的 top-1 零样本准确率。
放大最后 10% 并倒回:
已在更广泛的数据集上进行了第一轮基准测试,可在https://github.com/LAION-AI/CLIP_benchmark/blob/main/benchmark/results.ipynb上查看
致谢
感谢stable.ai和高斯超级计算中心 ( http://gauss-centre.eu ) 通过位于 Jülich 超级计算中心 (JSC) 的 GCS 超级计算机 JUWELS Booster 上的约翰·冯·诺依曼计算研究所 (NIC) 提供计算时间为这部分工作提供资金。
Model card for CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup
Table of Contents
Model Details
Model Description
A series of CLIP ConvNeXt-XXLarge (a custom timm
ConvNeXt size) models trained on LAION-2B (english), a subset of LAION-5B, using OpenCLIP.
Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
---|---|---|---|---|
convnext_xxlarge.laion2b_s34b_b82k-augreg | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1) | 79.1 |
convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind | LAION-2B | 256x256 | RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
convnext_xxlarge.laion2b_s34b_b82k-augreg-soup | LAION-2B | 256x256 | N/A | 79.4 |
RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only |
The core training run was performed in pieces over a period of ~ 2 months. The global batch size for the core run was 81920. The last ~10% of training was re-done at a 95744 global batch size w/ higher LR and aug than original finish. The two were averaged together in a 'soup'. See more details in Training Details.
Goals:
- Push the size of largest convolutional CLIP image tower into the performance range of ViT-g to ViT-G w/ improved image size scaling for downstream use.
Firsts:
- Largest released ConvNeXt model pretrained (847M params w/ 198 GMAC and 125 MActs @ 256x256 for image)
- A non-ViT image tower CLIP model (with no previous image tower pretrain) achieving > 79% ImageNet top-1 zero-shot
The models utilize:
- the timm ConvNeXt-XXLarge model (
convnext_xxlarge
) as the image tower - a standard projection at end of image tower
- a text tower with same size (with 1024, heads 16, depth 24) as ViT-H-14 and ViT-g-14 models
The models are trained at 256x256 image resolution. The size of the combined image + text CLIP model is 1.2B params w/ 222 GMAC and 146 MActs. At 256x256, the ConvNext-XXLarge sits just above a ViT-H-14 CLIP configuration in FLOPS and params while being lower in activation counts. It is well under both g-14 and G-14 while being between them in capabilities.
model | image_size | embed_dim | gmacs | macts | mparams | image_gmacs | image_macts | image_mparams | text_gmacs | text_macts | text_mparams |
---|---|---|---|---|---|---|---|---|---|---|---|
ViT-H-16 | 224 | 1024 | 150.96 | 122.01 | 986.26 | 127.4 | 100.81 | 632.23 | 23.57 | 21.2 | 354.03 |
ViT-H-14 | 224 | 1024 | 190.97 | 160.61 | 986.11 | 167.4 | 139.41 | 632.08 | 23.57 | 21.2 | 354.03 |
ViT-L-14-336 | 336 | 768 | 197.76 | 278.19 | 427.94 | 191.1 | 270.24 | 304.29 | 6.66 | 7.95 | 123.65 |
convnext_xxlarge | 256 | 1024 | 221.66 | 145.66 | 1200.58 | 198.09 | 124.45 | 846.54 | 23.57 | 21.2 | 354.03 |
RN50x64 | 448 | 1024 | 276.8 | 249.73 | 623.26 | 265.02 | 239.13 | 420.38 | 11.78 | 10.6 | 202.88 |
ViT-g-14 | 224 | 1024 | 290.74 | 213.84 | 1366.68 | 267.18 | 192.64 | 1012.65 | 23.57 | 21.2 | 354.03 |
convnext_xxlarge_320 | 320 | 1024 | 333.08 | 215.66 | 1200.58 | 309.52 | 194.46 | 846.54 | 23.57 | 21.2 | 354.03 |
ViT-H-14-336 | 336 | 1024 | 414.53 | 428.74 | 986.52 | 390.97 | 407.54 | 632.49 | 23.57 | 21.2 | 354.03 |
ViT-bigG-14 | 224 | 1280 | 532.92 | 310.71 | 2539.57 | 483.96 | 275.37 | 1844.91 | 48.96 | 35.34 | 694.66 |
Model training done by Ross Wightman across both the stability.ai cluster and the JUWELS Booster supercomputer. See acknowledgements below.
Uses
As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model.
The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION) and upcoming paper include additional discussion as it relates specifically to the training dataset.
Direct Use
Zero-shot image classification, image and text retrieval, among others.
Downstream Use
Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others.
Out-of-Scope Use
As per the OpenAI models,
Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful.
Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.
Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.
Further the above notice, the LAION-5B dataset used in training of these models has additional considerations, see below.
Training Details
Training Data
This model was trained with LAION-2B -- A 2 billion sample English subset of LAION-5B (LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION).
IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress.
Training Procedure
The main training run was done at global batch size of 81920 for 256 checkpoint intervals of 135.6M samples for a total of ~34B samples seen over training.
Many difficulties w/ both model numerical stability and cluster stability and performance were encountered while training this model. Initial attempts to train with float16 AMP and default adam beta2 resulted in loss spikes and eventually NaN blow ups. beta2
was reduced to 0.97 which helped, but the loss / zs curves were not tracking as expected. After switching to PyTorch nightlies, it was possible to use bfloat16 + AMP for training (as with rececnt H/14, g/14, and G/14 models), beta2 was returned to 0.98 and metrics improved.
Checkpoint Interval | Cluster | # GPUs | # Nodes | GPU | local BS | sample/s | sample/s/gpu | precision | adam beta2 |
---|---|---|---|---|---|---|---|---|---|
1 - 2 | Stability | 1024 | 128 | A100 40GB | 80 | 37-40k | 36-39 | amp + fp16 | 0.97 |
3 - 32 | Stability | 512 | 64 | A100 80GB | 160 | 27-32k | 52-62 | amp + fp16 | 0.97 |
33 - 75 | Booster | 1024 | 256 | A100 40GB | 80 | 48k | 47 | amp + fp16 | 0.97 |
76 - 165 | Booster | 1024 | 256 | A100 40GB | 80 | 51k | 50 | amp + bf16 | 0.98 |
166 - 232 | Stability | 320 | 40 | A100 80GB | 256 | 18-19k | 56-59 | amp + bf16 | 0.98 |
233 - 249 | Booster | 1024 | 256 | A100 40GB | 80 | 51k | 50 | amp + bf16 | 0.98 |
250 - 256 | Stability | 1024 | 128 | A100 40GB | 80 | 27-31k | 26-30 | amp + bf16 | 0.98 |
JUWELS Booster has 4x A100 GPU per node w/ 4x HDR-200 IB adapters per node (200Gbit/sec per GPU). Stability setup used was 8x A100 GPU per node w/ 400Gbit/sec EFA networking per node (50 GBit/sec per GPU). Significant variation in training efficiency (throughput per GPU) as observed across the various configurations. The 1024 GPU configurations across both clusters were particularly prone to crashing (or very difficult to get running w/ a 'good' set of GPUs).
A slurm srun command line below for a 128 8-GPU (40GB A100) configuration:
srun --cpu_bind=v --accel-bind=gn python -m training.main \
--save-frequency 1 \
--name "xxlarge-2b-81920-bf16" \
--resume "latest" \
--logs "/runs" \
--log-every-n-steps 50 \
--train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
--train-num-samples 135646078 \
--dataset-type webdataset \
--warmup 10000 \
--batch-size=80 \
--epochs=256 \
--dataset-resampled \
--aug-cfg use_timm=True scale='(0.33, 1.0)' re_prob=0.35 \
--precision amp_bfloat16 \
--grad-clip-norm 5.0 \
--lr 1e-3 \
--workers=6 \
--beta2 0.98 \
--model "convnext_xxlarge" \
--seed 0 \
--ddp-static-graph \
--local-loss \
--gather-with-grad \
--grad-checkpointing \
--report-to "tensorboard"
For the rewind of last 10%, a higher global batch size of 95744 was used w/ a higher LR and slightly increased augmentation strength.
Checkpoint Interval | Cluster | # GPUs | # Nodes | GPU | local BS | sample/s | sample/s/gpu | precision | adam beta2 |
---|---|---|---|---|---|---|---|---|---|
231 - 256 | stability | 1088 | 136 | A100 40GB | 88 | 32-35k | 29-32 | amp + bf16 | 0.98 |
The slurm srun command line for 136 8-GPU (40GB A100) nodes:
srun --cpu_bind=v --accel-bind=gn python -m training.main \
--save-frequency 1 \
--name "xxlarge-2b-81920-r-bf16" \
--resume "latest" \
--logs "/runs" \
--log-every-n-steps 50 \
--train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
--train-num-samples 135646078 \
--dataset-type webdataset \
--warmup 10000 \
--batch-size=88 \
--epochs=256 \
--dataset-resampled \
--aug-cfg use_timm=True scale='(0.3, 1.0)' re_prob=0.4 \
--precision amp_bfloat16 \
--grad-clip-norm 5.0 \
--lr 2e-3 \
--workers=6 \
--beta2 0.98 \
--model "convnext_xxlarge" \
--seed 0 \
--ddp-static-graph \
--local-loss \
--gather-with-grad \
--grad-checkpointing \
--report-to "tensorboard"
Evaluation
Evaluation done with code in the LAION CLIP Benchmark suite.
Testing Data, Factors & Metrics
Testing Data
The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval.
Results
These models achieve between 79.1 and 79.4 top-1 zero-shot accuracy on ImageNet-1k.
A zoom-in on final 10% w/ rewind:
An initial round of benchmarks have been performed on a wider range of datasets, to be viewable at CLIP_benchmark/benchmark/results.ipynb at main · LAION-AI/CLIP_benchmark · GitHub
Acknowledgements
Acknowledging stability.ai and the Gauss Centre for Supercomputing e.V. (http://gauss-centre.eu) for funding this part of work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC).
Citation
BibTeX:
LAION-5B
@inproceedings{schuhmann2022laionb,
title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
author={Christoph Schuhmann and
Romain Beaumont and
Richard Vencu and
Cade W Gordon and
Ross Wightman and
Mehdi Cherti and
Theo Coombes and
Aarush Katta and
Clayton Mullis and
Mitchell Wortsman and
Patrick Schramowski and
Srivatsa R Kundurthy and
Katherine Crowson and
Ludwig Schmidt and
Robert Kaczmarczyk and
Jenia Jitsev},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022},
url={https://openreview.net/forum?id=M3Y74vmsMcY}
}
OpenCLIP software
@software{ilharco_gabriel_2021_5143773,
author = {Ilharco, Gabriel and
Wortsman, Mitchell and
Wightman, Ross and
Gordon, Cade and
Carlini, Nicholas and
Taori, Rohan and
Dave, Achal and
Shankar, Vaishaal and
Namkoong, Hongseok and
Miller, John and
Hajishirzi, Hannaneh and
Farhadi, Ali and
Schmidt, Ludwig},
title = {OpenCLIP},
month = jul,
year = 2021,
note = {If you use this software, please cite it as below.},
publisher = {Zenodo},
version = {0.1},
doi = {10.5281/zenodo.5143773},
url = {https://doi.org/10.5281/zenodo.5143773}
}
OpenAI CLIP paper
@inproceedings{Radford2021LearningTV,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
booktitle={ICML},
year={2021}
}
@Article{liu2022convnet,
author = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
title = {A ConvNet for the 2020s},
journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2022},
}
@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}
@InProceedings{pmlr-v162-wortsman22a,
title = {Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
author = {Wortsman, Mitchell and Ilharco, Gabriel and Gadre, Samir Ya and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Morcos, Ari S and Namkoong, Hongseok and Farhadi, Ali and Carmon, Yair and Kornblith, Simon and Schmidt, Ludwig},
booktitle = {Proceedings of the 39th International Conference on Machine Learning},
pages = {23965--23998},
year = {2022},
editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
volume = {162},
series = {Proceedings of Machine Learning Research},
month = {17--23 Jul},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v162/wortsman22a/wortsman22a.pdf},
url = {https://proceedings.mlr.press/v162/wortsman22a.html}
}
引用
BibTeX:
拉昂-5B
@inproceedings{schuhmann2022laionb,
title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
author={Christoph Schuhmann and
Romain Beaumont and
Richard Vencu and
Cade W Gordon and
Ross Wightman and
Mehdi Cherti and
Theo Coombes and
Aarush Katta and
Clayton Mullis and
Mitchell Wortsman and
Patrick Schramowski and
Srivatsa R Kundurthy and
Katherine Crowson and
Ludwig Schmidt and
Robert Kaczmarczyk and
Jenia Jitsev},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022},
url={https://openreview.net/forum?id=M3Y74vmsMcY}
}
OpenCLIP 软件
@software{ilharco_gabriel_2021_5143773,
author = {Ilharco, Gabriel and
Wortsman, Mitchell and
Wightman, Ross and
Gordon, Cade and
Carlini, Nicholas and
Taori, Rohan and
Dave, Achal and
Shankar, Vaishaal and
Namkoong, Hongseok and
Miller, John and
Hajishirzi, Hannaneh and
Farhadi, Ali and
Schmidt, Ludwig},
title = {OpenCLIP},
month = jul,
year = 2021,
note = {If you use this software, please cite it as below.},
publisher = {Zenodo},
version = {0.1},
doi = {10.5281/zenodo.5143773},
url = {https://doi.org/10.5281/zenodo.5143773}
}
OpenAI CLIP 论文
@inproceedings{Radford2021LearningTV,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
booktitle={ICML},
year={2021}
}
@Article{liu2022convnet,
author = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
title = {A ConvNet for the 2020s},
journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2022},
}
@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}
@InProceedings{pmlr-v162-wortsman22a,
title = {Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
author = {Wortsman, Mitchell and Ilharco, Gabriel and Gadre, Samir Ya and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Morcos, Ari S and Namkoong, Hongseok and Farhadi, Ali and Carmon, Yair and Kornblith, Simon and Schmidt, Ludwig},
booktitle = {Proceedings of the 39th International Conference on Machine Learning},
pages = {23965--23998},
year = {2022},
editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
volume = {162},
series = {Proceedings of Machine Learning Research},
month = {17--23 Jul},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v162/wortsman22a/wortsman22a.pdf},
url = {https://proceedings.mlr.press/v162/wortsman22a.html}
}