AAAI-2024-卫星遥感图像相关论文 20篇

最新推荐文章于 2025-09-17 10:41:30 发布

智尊宝人工智能社区

最新推荐文章于 2025-09-17 10:41:30 发布

阅读量1.4k

点赞数 13

CC 4.0 BY-SA版权

分类专栏：遥感图像相关论文文章标签：人工智能深度学习神经网络大模型遥感图像计算机视觉

本文链接：https://blog.csdn.net/weixin_42155685/article/details/142931900

遥感图像相关论文专栏收录该内容

6 篇文章

订阅专栏

AAAI-2024-卫星遥感图像相关论文 20篇

在这里插入图片描述

MDFL: Multi-Domain Diffusion-Driven Feature Learning

文章解读: http://www.studyai.com/xueshu/paper/detail/2deb4c9ca3

文章链接: (10.1609/aaai.v38i8.28710)

摘要

High-dimensional images, known for their rich semantic information, are widely applied in remote sensing and other fields.
The spatial information in these images reflects the object’s texture features, while the spectral information reveals the potential spectral representations across different bands.
Currently, the understanding of high-dimensional images remains limited to a single-domain perspective with performance degradation.
Motivated by the masking texture effect observed in the human visual system, we present a multi-domain diffusion-driven feature learning network (MDFL) , a scheme to redefine the effective information domain that the model really focuses on.
This method employs diffusion-based posterior sampling to explicitly consider joint information interactions between the high-dimensional manifold structures in the spectral, spatial, and frequency domains, thereby eliminating the influence of masking texture effects in visual models.
Additionally, we introduce a feature reuse mechanism to gather deep and raw features of high-dimensional data.
We demonstrate that MDFL significantly improves the feature extraction performance of high-dimensional data, thereby providing a powerful aid for revealing the intrinsic patterns and structures of such data.
The experimental results on three multi-modal remote sensing datasets show that MDFL reaches an average overall accuracy of 98.25%, outperforming various state-of-the-art baseline schemes.
Code available at https://github.com/LDXDU/MDFL-AAAI-24…

LDS2AE: Local Diffusion Shared-Specific Autoencoder for Multimodal Remote Sensing Image Classification with Arbitrary Missing Modalities

文章解读: http://www.studyai.com/xueshu/paper/detail/340c8f39fd

文章链接: (10.1609/aaai.v38i13.29391)

摘要

Recent research on the joint classification of multimodal remote sensing data has achieved great success.
However, due to the limitations imposed by imaging conditions, the case of missing modalities often occurs in practice.
Most previous researchers regard the classification in case of different missing modalities as independent tasks.
They train a specific classification model for each fixed missing modality by extracting multimodal joint representation, which cannot handle the classification of arbitrary (including multiple and random) missing modalities.
In this work, we propose a local diffusion shared-specific autoencoder (LDS2AE), which solves the classification of arbitrary missing modalities with a single model.
The LDS2AE captures the data distribution of different modalities to learn multimodal shared feature for classification by designing a novel local diffusion autoencoder which consists of a modality-shared encoder and several modality-specific decoders.
The modality-shared encoder is designed to extract multimodal shared feature by employing the same parameters to map multimodal data into a shared subspace.
The modality-specific decoders put the multimodal shared feature to reconstruct the image of each modality, which facilitates the shared feature to learn unique information of different modalities.
In addition, we incorporate masked training to the diffusion autoencoder to achieve local diffusion, which significantly reduces the training cost of model.
The approach is tested on widely-used multimodal remote sensing datasets, demonstrating the effectiveness of the proposed LDS2AE in addressing the classification of arbitrary missing modalities.
The code is available at https://github.com/Jiahuiqu/LDS2AE…

Quantile-Based Maximum Likelihood Training for Outlier Detection

文章解读: http://www.studyai.com/xueshu/paper/detail/4cb6dca9fa

文章链接: (10.1609/aaai.v38i19.30159)

摘要

Discriminative learning effectively predicts true object class for image classification.
However, it often results in false positives for outliers, posing critical concerns in applications like autonomous driving and video surveillance systems.
Previous attempts to address this challenge involved training image classifiers through contrastive learning using actual outlier data or synthesizing outliers for self-supervised learning.
Furthermore, unsupervised generative modeling of inliers in pixel space has shown limited success for outlier detection.
In this work, we introduce a quantile-based maximum likelihood objective for learning the inlier distribution to improve the outlier separation during inference.
Our approach fits a normalizing flow to pre-trained discriminative features and detects the outliers according to the evaluated log-likelihood.
The experimental evaluation demonstrates the effectiveness of our method as it surpasses the performance of the state-of-the-art unsupervised methods for outlier detection.
The results are also competitive compared with a recent self-supervised approach for outlier detection.
Our work allows to reduce dependency on well-sampled negative training data, which is especially important for domains like medical diagnostics or remote sensing…

DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling

文章解读: http://www.studyai.com/xueshu/paper/detail/5560fa294f

文章链接: (10.1609/aaai.v38i20.30235)

摘要

Many applications use computer vision to detect and count objects in massive image collections.
However, automated methods may fail to deliver accurate counts, especially when the task is very difficult or requires a fast response time.
For example, during disaster response, aid organizations aim to quickly count damaged buildings in satellite images to plan relief missions, but pre-trained building and damage detectors often perform poorly due to domain shifts.
In such cases, there is a need for human-in-the-loop approaches to accurately count with minimal human effort.
We propose DISCount -a detector-based importance sampling framework for counting in large image collections.
DISCount uses an imperfect detector and human screening to estimate low-variance unbiased counts.
We propose techniques for counting over multiple spatial or temporal regions using a small amount of screening and estimate confidence intervals.
This enables end-users to stop screening when estimates are sufficiently accurate, which is often the goal in real-world applications.
We demonstrate our method with two applications: counting birds in radar imagery to understand responses to climate change, and counting damaged buildings in satellite imagery for damage assessment in regions struck by a natural disaster.
On the technical side we develop variance reduction techniques based on control variates and prove the (conditional) unbiasedness of the estimators.
DISCount leads to a 9-12x reduction in the labeling costs to obtain the same error rates compared to naive screening for tasks we consider, and surpasses alternative covariate-based screening approaches…

Solving Spectrum Unmixing as a Multi-Task Bayesian Inverse Problem with Latent Factors for Endmember Variability

文章解读: http://www.studyai.com/xueshu/paper/detail/56cc8e6be4

文章链接: (10.1609/aaai.v38i14.29518)

摘要

With the increasing customization of spectrometers, spectral unmixing has become a widely used technique in fields such as remote sensing, textiles, and environmental protection.However, endmember variability is a common issue for unmixing, where changes in lighting, atmospheric, temporal conditions, or the intrinsic spectral characteristics of materials, can all result in variations in the measured spectrum.Recent studies have employed deep neural networks to tackle endmember variability.
However, these approaches rely on generic networks to implicitly resolve the issue, which struggles with the ill-posed nature and lack of effective convergence constraints for endmember variability.
This paper proposes a streamlined multi-task learning model to rectify this problem, incorporating abundance regression and multi-label classification with Unmixing as a Bayesian Inverse Problem, denoted as BIPU.
To address the issue of the ill-posed nature, the uncertainty of unmixing is quantified and minimized through the Laplace approximation in a Bayesian inverse solver.
In addition, to improve convergence under the influence of endmember variability, the paper introduces two types of constraints.
The first separates background factors of variants from the initial factors for each endmember, while the second identifies and eliminates the influence of non-existent endmembers via multi-label classification during convergence.The effectiveness of this model is demonstrated not only on a self-collected near-infrared spectral textile dataset (FENIR), but also on three commonly used remote sensing hyperspectral image datasets, where it achieves state-of-the-art unmixing performance and exhibits strong generalization capabilities…

Semi-supervised Open-World Object Detection

文章解读: http://www.studyai.com/xueshu/paper/detail/58af2a7eef

文章链接: (10.1609/aaai.v38i5.28227)

摘要

Conventional open-world object detection (OWOD) problem setting first distinguishes known and unknown classes and then later incrementally learns the unknown objects when introduced with labels in the subsequent tasks.
However, the current OWOD formulation heavily relies on the external human oracle for knowledge input during the incremental learning stages.
Such reliance on run-time makes this formulation less realistic in a real-world deployment.
To address this, we introduce a more realistic formulation, named semi-supervised open-world detection (SS-OWOD), that reduces the annotation cost by casting the incremental learning stages of OWOD in a semi-supervised manner.
We demonstrate that the performance of the state-of-the-art OWOD detector dramatically deteriorates in the proposed SS-OWOD setting.
Therefore, we introduce a novel SS-OWOD detector, named SS-OWFormer, that utilizes a feature-alignment scheme to better align the object query representations between the original and augmented images to leverage the large unlabeled and few labeled data.
We further introduce a pseudo-labeling scheme for unknown detection that exploits the inherent capability of decoder object queries to capture object-specific information.
On the COCO dataset, our SS-OWFormer using only 50% of the labeled data achieves detection performance that is on par with the state-of-the-art (SOTA) OWOD detector using all the 100% of labeled data.
Further, our SS-OWFormer achieves an absolute gain of 4.8% in unknown recall over the SOTA OWOD detector.
Lastly, we demonstrate the effectiveness of our SS-OWOD problem setting and approach for remote sensing object detection, proposing carefully curated splits and baseline performance evaluations.
Our experiments on 4 datasets including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of our approach.
Our source code, models and splits are available here https://github.com/sahalshajim/SS-OWFormer.

Combining Deep Learning and Street View Imagery to Map Smallholder Crop Types

文章解读: http://www.studyai.com/xueshu/paper/detail/5c1a46d297

文章链接: (10.1609/aaai.v38i20.30225)

摘要

Accurate crop type maps are an essential source of information for monitoring yield progress at scale, projecting global crop production, and planning effective policies.
To date, however, crop type maps remain challenging to create in lowand middle-income countries due to a lack of ground truth labels for training machine learning models.
Field surveys are the gold standard in terms of accuracy but require an often-prohibitively large amount of time, money, and statistical capacity.
In recent years, street-level imagery, such as Google Street View, KartaView, and Mapillary, has become available around the world.
Such imagery contains rich information about crop types grown at particular locations and times.
In this work, we develop an automated system to generate crop type ground references using deep learning and Google Street View imagery.
The method efficiently curates a set of street-view images containing crop fields, trains a model to predict crop types using either weakly-labeled images from disparate out-of-domain sources or zero-shot labeled street view images with GPT-4V, and combines the predicted labels with remote sensing time series to create a wall-to-wall crop type map.
We show that, in Thailand, the resulting country-wide map of rice, cassava, maize, and sugarcane achieves an accuracy of 93%.
We publicly release the first-ever crop type map for all of Thailand 2022 at 10m-resolution with no gaps.
To our knowledge, this is the first time a 10m-resolution, multi-crop map has been created for any smallholder country.
As the availability of roadside imagery expands, our pipeline provides a way to map crop types at scale around the globe, especially in underserved smallholder regions…

End-to-End RGB-D Image Compression via Exploiting Channel-Modality Redundancy

文章解读: http://www.studyai.com/xueshu/paper/detail/6e6539eb21

文章链接: (10.1609/aaai.v38i7.28588)

摘要

As a kind of 3D data, RGB-D images have been extensively used in object tracking, 3D reconstruction, remote sensing mapping, and other tasks.
In the realm of computer vision, the significance of RGB-D images is progressively growing.
However, the existing learning-based image compression methods usually process RGB images and depth images separately, which cannot entirely exploit the redundant information between the modalities, limiting the further improvement of the Rate-Distortion performance.
With the goal of overcoming the defect, in this paper, we propose a learning-based dual-branch RGB-D image compression framework.
Compared with traditional RGB domain compression scheme, a YUV domain compression scheme is presented for spatial redundancy removal.
In addition, Intra-Modality Attention (IMA) and Cross-Modality Attention (CMA) are introduced for modal redundancy removal.
For the sake of benefiting from cross-modal prior information, Context Prediction Module (CPM) and Context Fusion Module (CFM) are raised in the conditional entropy model which makes the context probability prediction more accurate.
The experimental results demonstrate our method outperforms existing image compression methods in two RGB-D image datasets.
Compared with BPG, our proposed framework can achieve up to 15% bit rate saving for RGB images…

Enhancing Hyperspectral Images via Diffusion Model and Group-Autoencoder Super-resolution Network

文章解读: http://www.studyai.com/xueshu/paper/detail/7071b40f37

文章链接: (10.1609/aaai.v38i6.28392)

摘要

Existing hyperspectral image (HSI) super-resolution (SR) methods struggle to effectively capture the complex spectral-spatial relationships and low-level details, while diffusion models represent a promising generative model known for their exceptional performance in modeling complex relations and learning high and low-level visual features.
The direct application of diffusion models to HSI SR is hampered by challenges such as difficulties in model convergence and protracted inference time.
In this work, we introduce a novel Group-Autoencoder (GAE) framework that synergistically combines with the diffusion model to construct a highly effective HSI SR model (DMGASR).
Our proposed GAE framework encodes high-dimensional HSI data into low-dimensional latent space where the diffusion model works, thereby alleviating the difficulty of training the diffusion model while maintaining band correlation and considerably reducing inference time.
Experimental results on both natural and remote sensing hyperspectral datasets demonstrate that the proposed method is superior to other state-of-the-art methods both visually and metrically…

Early Detection of Extreme Storm Tide Events Using Multimodal Data Processing

文章解读: http://www.studyai.com/xueshu/paper/detail/7208e6507d

文章链接: (10.1609/aaai.v38i20.30194)

摘要

Sea-level rise is a well-known consequence of climate change.
Several studies have estimated the social and economic impact of the increase in extreme flooding.
An efficient way to mitigate its consequences is the development of a flood alert and prediction system, based on high-resolution numerical models and robust sensing networks.
However, current models use various simplifying assumptions that compromise accuracy to ensure solvability within a reasonable timeframe, hindering more regular and cost-effective forecasts for various locations along the shoreline.
To address these issues, this work proposes a hybrid model for multimodal data processing that combines physics-based numerical simulations, data obtained from a network of sensors, and satellite images to provide refined wave and sea-surface height forecasts, with real results obtained in a critical location within the Port of Santos (the largest port in Latin America).
Our approach exhibits faster convergence than data-driven models while achieving more accurate predictions.
Moreover, the model handles irregularly sampled time series and missing data without the need for complex preprocessing mechanisms or data imputation while keeping low computational costs through a combination of time encoding, recurrent and graph neural networks.
Enabling raw sensor data to be easily combined with existing physics-based models opens up new possibilities for accurate extreme storm tide events forecast systems that enhance community safety and aid policymakers in their decision-making processes…

Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

文章解读: http://www.studyai.com/xueshu/paper/detail/7511cd17c0

文章链接: (10.1609/aaai.v38i14.29492)

摘要

Multi-domain learning (MDL) aims to train a model with minimal average risk across multiple overlapping but non-identical domains.
To tackle the challenges of dataset bias and domain domination, numerous MDL approaches have been proposed from the perspectives of seeking commonalities by aligning distributions to reduce domain gap or reserving differences by implementing domain-specific towers, gates, and even experts.
MDL models are becoming more and more complex with sophisticated network architectures or loss functions, introducing extra parameters and enlarging computation costs.
In this paper, we propose a frustratingly easy and hyperparameter-free multi-domain learning method named Decoupled Training (D-Train).
D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi-heads, and finally fine-tunes the heads by fixing the backbone, enabling decouple training to achieve domain independence.
Despite its extraordinary simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems…

UV-SAM: Adapting Segment Anything Model for Urban Village Identification

文章解读: http://www.studyai.com/xueshu/paper/detail/7850aa3e69

文章链接: (10.1609/aaai.v38i20.30260)

摘要

Urban villages, defined as informal residential areas in or around urban centers, are characterized by inadequate infrastructures and poor living conditions, closely related to the Sustainable Development Goals (SDGs) on poverty, adequate housing, and sustainable cities.
Traditionally, governments heavily depend on field survey methods to monitor the urban villages, which however are time-consuming, labor-intensive, and possibly delayed.
Thanks to widely available and timely updated satellite images, recent studies develop computer vision techniques to detect urban villages efficiently.
However, existing studies either focus on simple urban village image classification or fail to provide accurate boundary information.
To accurately identify urban village boundaries from satellite images, we harness the power of the vision foundation model and adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM.
Specifically, UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification.
Extensive experimental results on two datasets in China demonstrate that UV-SAM outperforms existing baselines, and identification results over multiple years show that both the number and area of urban villages are decreasing over time, providing deeper insights into the development trends of urban villages and sheds light on the vision foundation models for sustainable cities.
The dataset and codes of this study are available at https://github.com/tsinghua-fib-lab/UV-SAM…

HarvestNet: A Dataset for Detecting Smallholder Farming Activity Using Harvest Piles and Remote Sensing

文章解读: http://www.studyai.com/xueshu/paper/detail/a22e439a50

文章链接: (10.1609/aaai.v38i20.30251)

摘要

Small farms contribute to a large share of the productive land in developing countries.
In regions such as sub-Saharan Africa, where 80% of farms are small (under 2 ha in size), the task of mapping smallholder cropland is an important part of tracking sustainability measures such as crop productivity.
However, the visually diverse and nuanced appearance of small farms has limited the effectiveness of traditional approaches to cropland mapping.
Here we introduce a new approach based on the detection of harvest piles characteristic of many smallholder systems throughout the world.
We present HarvestNet, a dataset for mapping the presence of farms in the Ethiopian regions of Tigray and Amhara during 2020-2023, collected using expert knowledge and satellite images, totalling 7k hand-labeled images and 2k ground-collected labels.
We also benchmark a set of baselines, including SOTA models in remote sensing, with our best models having around 80% classification performance on hand labelled data and 90% and 98% accuracy on ground truth data for Tigray and Amhara, respectively.
We also perform a visual comparison with a widely used pre-existing coverage map and show that our model detects an extra 56,621 hectares of cropland in Tigray.
We conclude that remote sensing of harvest piles can contribute to more timely and accurate cropland assessments in food insecure regions.
The dataset can be accessed through https://figshare.com/s/45a7b45556b90a9a11d2, while the code for the dataset and benchmarks is publicly available at https://github.com/jonxuxu/harvest-piles.

Deep Linear Array Pushbroom Image Restoration: A Degradation Pipeline and Jitter-Aware Restoration Network

文章解读: http://www.studyai.com/xueshu/paper/detail/a7e2ad4485

文章链接: (10.1609/aaai.v38i2.27892)

摘要

Linear Array Pushbroom (LAP) imaging technology is widely used in the realm of remote sensing.
However, images acquired through LAP always suffer from distortion and blur because of camera jitter.
Traditional methods for restoring LAP images, such as algorithms estimating the point spread function (PSF), exhibit limited performance.
To tackle this issue, we propose a Jitter-Aware Restoration Network (JARNet), to remove the distortion and blur in two stages.
In the first stage, we formulate an Optical Flow Correction (OFC) block to refine the optical flow of the degraded LAP images, resulting in pre-corrected images where most of the distortions are alleviated.
In the second stage, for further enhancement of the pre-corrected images, we integrate two jitter-aware techniques within the Spatial and Frequency Residual (SFRes) block: 1) introducing Coordinate Attention (CoA) to the SFRes block in order to capture the jitter state in orthogonal direction; 2) manipulating image features in both spatial and frequency domains to leverage local and global priors.
Additionally, we develop a data synthesis pipeline, which applies Continue Dynamic Shooting Model (CDSM) to simulate realistic degradation in LAP images.
Both the proposed JARNet and LAP image synthesis pipeline establish a foundation for addressing this intricate challenge.
Extensive experiments demonstrate that the proposed two-stage method outperforms state-of-the-art image restoration models.
Code is available at https://github.com/JHW2000/JARNet…

Frequency-Adaptive Pan-Sharpening with Mixture of Experts

文章解读: http://www.studyai.com/xueshu/paper/detail/acbf21ebeb

文章链接: (10.1609/aaai.v38i3.27984)

摘要

Pan-sharpening involves reconstructing missing high-frequency information in multi-spectral images with low spatial resolution, using a higher-resolution panchromatic image as guidance.
Although the inborn connection with frequency domain, existing pan-sharpening research has not almost investigated the potential solution upon frequency domain.
To this end, we propose a novel Frequency Adaptive Mixture of Experts (FAME) learning framework for pan-sharpening, which consists of three key components: the Adaptive Frequency Separation Prediction Module, the Sub-Frequency Learning Expert Module, and the Expert Mixture Module.
In detail, the first leverages the discrete cosine transform to perform frequency separation by predicting the frequency mask.
On the basis of generated mask, the second with low-frequency MOE and high-frequency MOE takes account for enabling the effective low-frequency and high-frequency information reconstruction.
Followed by, the final fusion module dynamically weights high frequency and low-frequency MOE knowledge to adapt to remote sensing images with significant content variations.
Quantitative and qualitative experiments over multiple datasets demonstrate that our method performs the best against other state-of-the-art ones and comprises a strong generalization ability for real-world scenes.
Code will be made publicly at https://github.com/alexhe101/FAME-Net…

Efficient Representation Learning of Satellite Image Time Series and Their Fusion for Spatiotemporal Applications

文章解读: http://www.studyai.com/xueshu/paper/detail/ba4716764a

文章链接: (10.1609/aaai.v38i8.28686)

摘要

Satellite data bolstered by their increasing accessibility is leading to many endeavors of automated monitoring of the earth’s surface for various applications.
Such applications demand high spatial resolution images at a temporal resolution of a few days which entails the challenge of processing a huge volume of image time series data.
To overcome this computing bottleneck, we present PatchNet, a bespoke adaptation of beam search and attention mechanism.
PatchNet is an automated patch selection neural network that requires only a partial spatial traversal of an image time series and yet achieves impressive results.
Satellite systems face a trade-off between spatial and temporal resolutions due to budget/technical constraints e.g., Landsat-8/9 or Sentinel-2 have high spatial resolution whereas, MODIS has high temporal resolution.
To deal with the limitation of coarse temporal resolution, we propose FuSITSNet, a twofold feature-based generic fusion model with multimodal learning in a contrastive setting.
It produces a learned representation after fusion of two satellite image time series leveraging finer spatial resolution of Landsat and finer temporal resolution of MODIS.
The patch alignment module of FuSITSNet aligns the PatchNet processed patches of Landsat-8 with the corresponding MODIS regions to incorporate its finer resolution temporal features.
The untraversed patches are handled by the cross-modality attention which highlights additional hot spot features from the two modalities.
We conduct extensive experiments on more than 2000 counties of US for crop yield, snow cover, and solar energy prediction and show that even one-fourth spatial processing of image time series produces state-of-the-art results.
FuSITSNet outperforms the predictions of single modality and data obtained using existing generative fusion models and allows for monitoring of dynamic phenomena using freely accessible images, thereby unlocking new opportunities…

Patched Line Segment Learning for Vector Road Mapping

文章解读: http://www.studyai.com/xueshu/paper/detail/bdc8eed5b0

文章链接: (10.1609/aaai.v38i6.28447)

摘要

This paper presents a novel approach to computing vector road maps from satellite remotely sensed images, building upon a well-defined Patched Line Segment (PaLiS) representation for road graphs that holds geometric significance.
Unlike prevailing methods that derive road vector representations from satellite images using binary masks or keypoints, our method employs line segments.
These segments not only convey road locations but also capture their orientations, making them a robust choice for representation.
More precisely, given an input image, we divide it into non-overlapping patches and predict a suitable line segment within each patch.
This strategy enables us to capture spatial and structural cues from these patch-based line segments, simplifying the process of constructing the road network graph without the necessity of additional neural networks for connectivity.
In our experiments, we demonstrate how an effective representation of a road graph significantly enhances the performance of vector road mapping on established benchmarks, without requiring extensive modifications to the neural network architecture.
Furthermore, our method achieves state-of-the-art performance with just 6 GPU hours of training, leading to a substantial 32-fold reduction in training costs in terms of GPU hours…

EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering

文章解读: http://www.studyai.com/xueshu/paper/detail/c0fa249403

文章链接: (10.1609/aaai.v38i6.28357)

摘要

Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning.
Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis.
The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded.
As objects are the basis for complex relational reasoning, we propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way.
To preserve refined spatial locations and semantics, SOBA leverages a segmentation network for object semantics generation.
The object-guided attention aggregates object interior features via pseudo masks, and bidirectional cross-attention further models object external relations hierarchically.
To optimize object counting, we propose a numerical difference loss that dynamically adds difference penalties, unifying the classification and regression tasks.
Experimental results show that SOBA outperforms both advanced general and remote sensing methods.
We believe this dataset and framework provide a strong benchmark for Earth vision’s complex analysis.
The project page is at https://Junjue-Wang.github.io/homepage/EarthVQA…

GLH-Water: A Large-Scale Dataset for Global Surface Water Detection in Large-Size Very-High-Resolution Satellite Imagery

文章解读: http://www.studyai.com/xueshu/paper/detail/d99fe3efbf

文章链接: (10.1609/aaai.v38i20.30226)

摘要

Global surface water detection in very-high-resolution (VHR) satellite imagery can directly serve major applications such as refined flood mapping and water resource assessment.
Although achievements have been made in detecting surface water in small-size satellite images corresponding to local geographic scales, datasets and methods suitable for mapping and analyzing global surface water have yet to be explored.
To encourage the development of this task and facilitate the implementation of relevant applications, we propose the GLH-water dataset that consists of 250 satellite images and 40.96 billion pixels labeled surface water annotations that are distributed globally and contain water bodies exhibiting a wide variety of types (e.g.
, rivers, lakes, and ponds in forests, irrigated fields, bare areas, and urban areas).
Each image is of the size 12,800 × 12,800 pixels at 0.3 meter spatial resolution.
To build a benchmark for GLH-water, we perform extensive experiments employing representative surface water detection models, popular semantic segmentation models, and ultra-high resolution segmentation models.
Furthermore, we also design a strong baseline with the novel pyramid consistency loss (PCL) to initially explore this challenge, increasing IoU by 2.4% over the next best baseline.
Finally, we implement the cross-dataset generalization and pilot area application experiments, and the superior performance illustrates the strong generalization and practical application value of GLH-water dataset.
Project page: https://jack-bo1220.github.io/project/GLH-water.html.

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

文章解读: http://www.studyai.com/xueshu/paper/detail/dfb7340694

文章链接: (10.1609/aaai.v38i6.28393)

摘要

Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs).
A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images.
Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale.
In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags.
With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets.
It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval.
We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis…

Domain-Controlled Prompt Learning

文章解读: http://www.studyai.com/xueshu/paper/detail/f7af3a8795

文章链接: (10.1609/aaai.v38i2.27853)

摘要

Large pre-trained vision-language models, such as CLIP, have shown remarkable generalization capabilities across various tasks when appropriate text prompts are provided.
However, adapting these models to specific domains, like remote sensing images (RSIs), medical images, etc, remains unexplored and challenging.
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms, leading to suboptimal performance due to the misinterpretation of specific images in natural image patterns.
To tackle this dilemma, we proposed a Domain-Controlled Prompt Learning for the specific domains.
Specifically, the large-scale specific domain foundation model (LSDM) is first introduced to provide essential specific domain knowledge.
Using lightweight neural networks, we transfer this knowledge into domain biases, which control both the visual and language branches to obtain domain-adaptive prompts in a directly incorporating manner.
Simultaneously, to overcome the existing overfitting challenge, we propose a novel noisy-adding strategy, without extra trainable parameters, to help the model escape the suboptimal solution in a global domain oscillation manner.
Experimental results show our method achieves state-of-the-art performance in specific domain image recognition datasets.
Our code is available at https://github.com/caoql98/DCPL…