论文解读(4)-UrbanCross-CSDN博客

本文链接：https://blog.csdn.net/weixin_63767221/article/details/140609834

UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation
[2404.14241] UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation (arxiv.org)
在之前的基础上引入了跨域的理念（也就是不同国家）

摘要

因为以前的卫星图像都是来自一个国家的，因此在不同国家之间不具有泛化能力，所以本文引入了一个包含了三个国家的新数据集。
然后就是：
LLM for textual refinment
SAM（一种分割模型） for visual augmentation

1. 介绍

首先介绍了目前对于卫星图像的基本处理方法，一个是content-based，直接用图像去生成标题，然后再对标题进行文字的对比，这个方法存在信息丢失。
而像CLIP那种方法就可以很好的避免这样的情况。

数据视角：利用卫星图中的地理标签对图像进行描述（不是用自己的文字描述）
模型角度：（这里没有特别理解，好像是说）

主要的贡献：
1）数据增强：就是上述的那一块
2）Cross-Domain Adaptation：
引入了个Adaptive Curriculum-based Source Sampler，用来根据他们之间的相似度来管理数据
原文：

Adaptive Curriculum-based Source Sampler, which initially samples source data based on similarity to the target domain

后续用image-text的操作进行微调

the Adversarial Cross-Domain Image-Text Fine-tuning Module for subsequent fine-tuning. This integrated strategy ensures a seamless transition from simpler to complex samples, applying weighting to align with domain-specific traits, thus effectively addressing the challenges posed by diverse data distributions across domains.

3）extensive experiment（广泛的实验）：
就是讲最后的表现

2. Preliminaries

基础的介绍，这里内容没有什么新颖的

在这里插入图片描述

3. 方法

分三个步骤：
1）Image Caption and Segmentation：（图像描述和分割）
描述部分就是用的geo tag，然后以卫星图和geo tag 为提示词输入进LLM里，得到一段描述。
然后同事用图像分割来对卫星图进行处理，得到不同的比例，并与刚刚得到的文本进行相似度计算（similarity calculation）（这一块相似度计算非常有意思）

2）multi-modal pre-training：
首先image，segments和text这三样独立编码，最后通过成对对比损失（pairwise contrastice loss）融合到一起，相似的靠得近，反之，不同的距离较远。

3）adaptive adversarial domain adaptation：
在这里使用了adapztive curriculum-based sampler处理source和target domains
然后慢慢微调，从上面的相似的部分开始，逐渐调整完。