Available online , doi: 10.11999/JEIT260189
Abstract:
Significance Remote sensing image–text retrieval (RS-TIR) connects massive Earth observation imagery with natural-language queries and has become an important interface for geospatial intelligence systems. Compared with conventional content-based retrieval, RS-TIR enables users to search scenes, objects, spatial layouts, and functional regions through semantic descriptions instead of handcrafted visual cues. This capability is increasingly needed in natural resource monitoring, urban governance, disaster response, environmental assessment, and on-demand retrieval from rapidly growing satellite archives. However, the task remains fundamentally challenging because remote sensing imagery is captured from a nadir or near-nadir perspective, exhibits strong rotation invariance, contains extreme scale variation from tiny vehicles to large airports, and often involves domain-specific semantic descriptions such as land-use attributes, spatial distributions, and geoscientific relations. Meanwhile, the amount of high-quality image–text annotation is still limited relative to the scale of remote sensing data. These properties enlarge the semantic gap between images and language and constrain the generalization ability of traditional cross-modal retrieval methods. Against this background, the review focuses on how vision–language foundation models (VLMs) reshape RS-TIR by introducing large-scale contrastive pre-training, stronger transferable representations, and more flexible multimodal interaction mechanisms. The review also clarifies why remote sensing adaptation is necessary and why a dedicated synthesis of architectures, datasets, alignment mechanisms, and future directions is timely for the field. Progress The technical development of RS-TIR is organized from three complementary perspectives. First, the review summarizes the domain-specific challenges that shape the task, including visually isotropic topology with extreme scale variation, professionalized and fine-grained textual semantics, and the compounded semantic gap between overhead imagery and natural-language descriptions (Fig.3). The overall survey structure is then outlined to show the logical progression from task formulation to future challenges (Fig.1). From the methodological timeline, RS-TIR evolves from handcrafted visual descriptors and shallow semantic mapping to deep representation learning, and then to VLM-driven paradigms with broader generalization and zero-shot transfer ability (Fig.4, Table 2). Early methods rely on color, texture, shape, and hash-based retrieval, but they struggle to model high-level geospatial semantics and complex scene composition. Deep learning methods improve retrieval by learning joint embedding spaces, adopting dual-encoder or interaction-based architectures, and introducing multi-scale feature fusion and region-aware matching. These methods substantially enhance semantic consistency, yet they still depend heavily on labeled data and often suffer from limited robustness in open or cross-sensor scenarios. Second, the review summarizes the benchmark ecosystem used to evaluate these methods. Representative datasets span small-scale test sets such as Sydney-Caption and UCM-Caption, mainstream benchmarks such as RSICD and RSITMD, and recent large-scale training resources such as RS5M and SkyScript (Table 1). These datasets reveal a clear transition from small manually annotated corpora to web-scale or automatically generated image–text pairs, which in turn supports domain pre-training and larger model adaptation. Third, the review analyzes the core VLM techniques now driving progress in RS-TIR. The model spectrum and representative architecture families, including contrastive dual-encoder models, multimodal interaction models, and remote sensing foundation models integrated with large language models, are summarized systematically (Fig.5, Fig.6, Table 3). Domain adaptation routes are further grouped into continued remote sensing pre-training, parameter-efficient transfer learning, adapter-based tuning, prompt learning, and instruction tuning. At the semantic alignment level, the review emphasizes contrastive joint embedding, fine-grained multi-scale alignment, and the incorporation of remote sensing priors such as spatial topology and geolocation. Performance comparisons on RSICD and RSITMD show that the introduction of remote sensing VLMs, especially RemoteCLIP, GeoRSCLIP, iEBAKER, and LRSCLIP, leads to consistent gains in mean Recall and overall retrieval robustness (Table 4). In parallel, the review also tracks the extension of retrieval capability into unified multi-task remote sensing models, where retrieval, grounding, segmentation, and reasoning begin to share a common multimodal representation space. Conclusions Several conclusions are drawn from the comparative analysis. First, VLMs establish a new dominant paradigm for RS-TIR because they significantly narrow the cross-modal semantic gap while improving transferability across datasets and scenes. Second, there is no universally optimal architecture: dual-encoder models remain attractive for large-scale retrieval because of their efficiency, whereas interaction-based or instruction-enhanced models offer finer semantic alignment at higher computational cost. Third, domain adaptation is indispensable. Continued pre-training on remote sensing image–text corpora, parameter-efficient tuning, and prompt-based adaptation consistently outperform direct reuse of Internet-trained VLMs, indicating that remote sensing imagery differs too strongly from natural-image distributions to rely on generic pre-training alone. Fourth, the most effective recent methods do not improve performance through scale alone; they also exploit remote sensing-specific information, including multi-scale structures, foreground entities, explicit keyword reasoning, and spatial priors. Finally, the review shows that the field is shifting from isolated retrieval models toward more general geospatial multimodal systems. Retrieval is no longer treated only as a matching task, but also as a key capability supporting question answering, instruction following, knowledge augmentation, and coordinated reasoning in remote sensing applications. Prospects Future research is expected to move in four closely related directions. One direction is the unified representation of multi-source heterogeneous data, especially the integration of optical imagery with synthetic aperture radar, hyperspectral data, thermal infrared observations, and multi-temporal acquisitions. Another direction is knowledge-enhanced retrieval, where geospatial priors, land-use rules, remote sensing terminology, and external knowledge bases are incorporated into multimodal alignment and retrieval-augmented reasoning. A third direction is lifelong and open-world learning. Real deployment requires models to remain reliable under seasonal changes, sensor updates, regional domain shifts, cloud contamination, and newly emerging categories without catastrophic forgetting. The fourth direction concerns efficiency and deployability. Because practical remote sensing systems often operate under tight computational budgets, lightweight tuning, sparse computation, token reduction, model compression, and on-orbit or edge inference will become increasingly important. Interactive and explainable retrieval is also likely to grow in importance, allowing analysts to refine queries through dialogue and inspect the image regions or semantic cues that support retrieval decisions. Overall, continued progress in data construction, domain adaptation, semantic alignment, and efficient multimodal modeling is expected to make RS-TIR a more robust infrastructure capability for Earth observation applications.