# Generalizable Entity Grounding via Assistance of Large Language Model

Lu Qi<sup>1\*</sup>, Yi-Wen Chen<sup>1\*</sup>, Lehan Yang<sup>2\*</sup>, Tiancheng Shen<sup>3</sup>, Xiangtai Li<sup>4</sup>,  
Weidong Guo<sup>5†</sup>, Yu Xu<sup>5</sup>, Ming-Hsuan Yang<sup>1,6</sup>,

<sup>1</sup>The University of California, Merced <sup>2</sup>The University of Sydney

<sup>3</sup>The Chinese University of Hong Kong <sup>4</sup>Nanyang Technological University

<sup>5</sup>QQ Browser Lab, Tencent, <sup>6</sup>Google Research

The diagram illustrates the GELLA model's architecture and training data.   
**Top: Training data of GELLA.** It shows a sample image of giraffes in a zoo enclosure. To its right is the COCO Panoptic Segmentation, where various objects are color-coded and numbered (1-13). Further right is the COCO Panoptic Narrative Grounding, which provides a textual description of the scene with corresponding bounding boxes and labels for giraffes, rocks, persons, a railing, a tree, and a building.   
**Bottom: Inference pipeline of GELLA.** This section shows how the model works during inference. An input image is processed by GPT-4 to generate a 'Long Caption' describing the giraffes. Simultaneously, the image is processed by EntitySeg to produce 'High-Quality Entity Segmentation', which is a more detailed and accurate segmentation map. Finally, the GELLA model takes both the long caption and the high-quality segmentation as input to produce 'Panoptic Narrative Grounding Results', which lists entities like 'giraffes', 'giant giraffe', 'three smaller giraffes', 'family', 'grass', 'rocks', and 'wooden fence' with their corresponding segmentation masks.

Figure 1. **Top: Training data of GELLA.** The training data includes the image, panoptic segmentation, and panoptic narrative grounding from the COCO dataset. **Bottom: Inference pipeline of GELLA.** During inference, we can use any LMM (*e.g.*, GPT-4) to generate the caption and any class-agnostic segmentation model (*e.g.*, EntitySeg) to generate the entity segmentation. The GELLA model can associate the two outputs and produce panoptic narrative grounding results.

## Abstract

In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap,

enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.

\*indicates equal contribution. † is the corresponding author.## 1. Introduction

In recent years, large multimodal models (LMMs) [4, 20, 21, 50] have successfully bridged language and vision, ushering in a new era of artificial intelligence. These models have demonstrated unprecedented capabilities, enabling various domains to comprehend and generate textual and visual content, including human-computer interaction, multimedia analysis, and beyond.

Built upon large language models (LLMs), various multimodal models [1, 29, 45, 49] have demonstrated impressive capabilities in panoptic narrative grounding that aim at correlating image captions with specific entity locations. Nevertheless, these methods predominantly ground an individual entity across an image for one time, failing to achieve alignments between long captions and multiple entities. While this objective is attainable through interactive behavior, such approaches like LISA [14] substantially increase computational costs due to the recurrent utilization of LLMs.

The intuitive solution is to revise existing single-entity referring frameworks by instructing the LLM to output a  $\langle\text{SEG}\rangle$  token after each semantic noun. Subsequently, the embedding of  $\langle\text{SEG}\rangle$  tokens can be used for panoptic narrative grounding. However, this pipeline is not flexible and lightweight enough. *First*, it overly relies on users to provide each semantic noun rather than utilizing LLM’s reasoning ability to parse long captions automatically. *Second*, it is highly coupled to a designed segmentation method, damaging the pipeline’s adaptability to diverse input masks generated from other sources. *Last*, using two image encoders that require the same image with different resolutions is inefficient. This is because the LLM needs a CLIP [34] vision encoder that receives low-resolution images as input, while the segmentation target requires another vision encoder to extract fine-grained features from high-resolution images.

To address the issues of flexibility and computation, we introduce the GELLA framework for **G**eneralizable **E**ntity grounding with **L**arge **L**anguage **A**ssistance. Unlike other methods that rely on a high-resolution image encoder, GELLA employs a colormap encoder. In particular, we assign a unique random color to each entity on a colormap, allowing encoded features to provide a robust mask prior with a fixed lightweight structure. The colormap encoder significantly reduces the computational workload required for pixel-level prediction and enhances the flexibility of our framework for various input masks. As shown in Figure 1, GELLA can receive better caption and segmentation results as input from another model and make more elaborate associations. This means our framework can be further improved with the development of each pre-trained model.

Furthermore, we propose a ResoBlend module to merge features from the mask and image extracted from a low-resolution CLIP vision and colormap encoder. The fused features are then used to accurately reconstruct the original

segmentation masks in conjunction with the entity embeddings. These entity embeddings harmonize more effectively with the language embeddings, as both are derived from the CLIP vision encoder, ensuring a more consistent association between the visual and linguistic elements.

We conduct extensive experiments on three tasks, including referring expression segmentation, panoptic narrative grounding, and panoptic segmentation. With our proposed colormap and CLIP vision encoder, the GELLA framework outperforms or is comparable to state-of-the-art methods while retaining the ability to interact with users.

The main contributions of this work are as follows:

- • We propose GELLA, a framework for generalizable entity grounding with a long caption. We leverage an LLM to extract semantic nouns and a class-agnostic segmentation model to generate entity masks. Then we correlate these results with the proposed ResoBlend and association modules.
- • We adopt a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from high-resolution masks. This strategy allows us to extract visual features from low-resolution images using a CLIP vision encoder, significantly reducing computational costs.
- • Extensive experiments on panoptic narrative grounding, referring expression segmentation, and panoptic segmentation demonstrate the effectiveness of the proposed method. Given its flexibility to accept multi-modal inputs, GELLA can consistently leverage state-of-the-art single-modal methods.

## 2. Related Work

**Image Segmentation.** Significant advancements [2, 3, 9, 11, 16, 22, 24, 37, 39, 41, 48] have been made in image segmentation [8, 12, 15, 30, 36, 46] in recent years, with numerous methods and techniques emerging to address the various challenges of class-aware prediction. Another line of image segmentation focuses on the generalization ability to unseen categories or image domains in a class-agnostic manner, as seen in methods like SAM [13] or entity segmentation [31–33]. While these methods exhibit good generalization ability, they cannot associate images with natural language prompts. To enhance versatility, this paper aims to facilitate generalized entity segmentation from complex text prompts with the assistance of large language models.

**Large Multimodal Model.** Large language models (LLMs) have demonstrated remarkable versatility and capabilities across various tasks. Built upon the strengths of LLMs, large multimodal models (LMMs) aim to integrate multimodal skills to achieve versatility across diverse domains, including language, vision, and other modalities. Notable contributions from models such as LLaVA [20, 21],InstructBLIP [4], and MiniGPT-4 [50] enable the generation of textual responses from both image and text inputs by instruction tuning pre-trained LLMs. However, these methods are not able for region-specific pixel-level visual grounding. Recently, there have been studies examining grounded text response generation with LMMs. Approaches such as KOSMOS-2 [29], Shikra [1], and Ferret [45] encode bounding box coordinates into a sequence of location tokens to enable bounding box generation. These approaches rely on language models to generate grounding outputs and cannot perform fine-grained segmentation. BuboGPT [49] utilizes an off-the-shelf visual grounding module to explore the fine-grained relation between different visual objects and modalities. LISA [14] incorporates a segmentation token into the vocabulary, which is then decoded into a segmentation mask for fine-grained reasoning. Unlike existing approaches that only generate bounding box-level grounding or results for a single object, our method can produce fine-grained segmentation for multiple generalized entities.

### 3. Method

Given an image  $\mathbf{I} \in \mathbb{R}^{h \times w \times 3}$  and an associated long caption  $T$ , we first use a class-agnostic segmentation model to obtain entity-level binary segmentation masks  $\mathbf{M} \in \{0, 1\}^{n \times h \times w}$  with  $n$  entities. Our GELLA framework then parses the caption  $T$  to  $m$  semantic nouns  $\mathbf{X} \in \{x_0, \dots, x_m\}$  and assigns them to the masks  $\mathbf{M}$ . We note that the assignment target is a one-to-many association because some semantic nouns might correspond to several masks, such as “three persons”.

In the following sections, we first provide an overview of the entire GELLA framework. Then, we introduce two essential modules in our framework: a ResoBlend and a simple association module. The ResoBlend module can efficiently fuse the image information and mask priority, whereas the association module grounds the semantic nouns to the corresponding masks. Lastly, we propose multi-task training to enhance GELLA’s flexibility for image captioning, entity recognition, and referring expression segmentation.

#### 3.1. Framework Overview

As illustrated in Figure 2, the GELLA framework has two single-modal encoders and two dual-modal decoders. Firstly, the single-modal encoders process the image and the segmentation results to generate corresponding features. Then, these features are fused to create dual-modal features, integrating image/colormap and image/language information, respectively. Finally, the dual-modal decoders interpret these composite features to reconstruct segmentation masks and identify semantic nouns. An association module is designed to link the embeddings of the two output results, facilitating coherent output generation.

**Colormap Encoder.** We encode the entity-level binary segmentation masks  $\mathbf{M}$  into a colormap  $\mathbf{M}_c \in [0, 255]^{h \times w \times 3}$

using a colormap encoder:  $\mathbf{M}_c = f_\Psi(\mathbf{M})$ , where  $f_\Psi$  indicates the random color assignment for each entity. The alternative assignment methods, such as the location-aware assignment used in Painter [42], do not result in performance gains. We discuss this comparison in the ablation study.

Similar to the conventional image encoding process, we encode a colormap to the pyramid mask features as

$$\mathbf{Z}_c = f_\Upsilon(\mathbf{M}_c), \quad (1)$$

where  $\mathbf{Z}_c \in \{z^s\}$ , with  $s \in \{2, 3, 4, 5\}$  indicating the index of pyramid features and  $z^s \in \mathcal{R}^{\frac{h}{2^s} \times \frac{w}{2^s} \times c^s}$ .  $c^s$  is the channel dimension of  $z^s$ .

For  $f_\Upsilon$ , we find that existing lightweight structures, such as MobileNet [35] and Swin-Tiny [23], designed for image classification, can effectively encode the colormap. The reason behind this lies in the explicit boundaries among entities in color blocks, providing our framework with a robust shape priority. This design eliminates the need for high-resolution images for pixel-level prediction, thereby allowing the usage of a CLIP [34] visual backbone with low-resolution images.

**Image Encoder.** We use a CLIP vision encoder in the ViT structure. Following the image size used in training CLIP, our image input maintains its original pre-trained resolution (e.g., 224 and 336),

$$\mathbf{Z}_v = f_\Omega(\mathbf{I}), \quad (2)$$

where  $\mathbf{Z}_v \in \mathcal{R}^{\frac{h}{64} \times \frac{w}{64} \times C_v}$  is the grid features after the last transformer layer, and  $C_v$  is the dimension of visual features.  $\mathbf{Z}_v$  can be combined with other modal features  $\mathbf{Z}_c$  or language token embedding to create multimodal features  $\mathbf{Z}_{vc}$  (visual and mask features) or  $\mathbf{Z}_{vl}$  (visual and language features).

**Language Decoder.** Inspired by the LLaVA [21] design, we employ a single linear layer to embed image patch features into the word embedding space. Specifically, a trainable projection matrix  $W$  is used to convert  $\mathbf{Z}_v$  into the language embedding space as follows:

$$\mathbf{H}_v = W \cdot \mathbf{Z}_v. \quad (3)$$

We prepend the instruction “Please help me extract semantic nouns of this sentence:” to the caption  $T$  and proceed with the tokenizer. Then, we concatenate the visual tokens  $\mathbf{H}_v$  with the textual tokens  $\mathbf{H}_t$  to form multi-modal tokens  $\mathbf{H}_q$ .

We choose LLaMA [38] as the language decoder  $f_\phi$ , which has been demonstrated to be effective in several multi-modal models. Here,  $\phi$  represents the parameters of LLaMA. The language features are extracted as:

$$\mathbf{Z}_{vl} = f_\phi(\mathbf{H}_q). \quad (4)$$Figure 2. **Left: Overview of the GELLA framework.** Given an image  $I$  and its corresponding caption  $T$ , we aim to generate a panoptic segmentation that densely grounds the semantic nouns in the caption. We first obtain an entity-level mask colormap  $M_c$  using a class-agnostic segmentation model. The mask and image are encoded by a mask encoder  $f_Y$  and a CLIP vision encoder  $f_\Omega$ , respectively. The two extracted features are then fused by the ResoBlend module and fed into a mask decoder to reconstruct the mask. For the language part, we prepend an instruction to the caption  $T$  for extracting semantic nouns and proceed with the tokenizer. The visual tokens  $H_v$  and textual tokens  $H_t$  are concatenated and fed into the language decoder  $f_\phi$  to generate  $\langle \text{SEG} \rangle$  tokens as features of each semantic noun. The association module then computes the similarity between the embeddings of semantic nouns and visual entities. **Right: Illustration of the ResoBlend Module.**

$Z_{v1}$  is processed with a fully-connected layer to predict the output sentence’s tokens  $U^t$ , which should include  $\langle \text{SEG} \rangle$  tokens for the semantic nouns. We extract the last-layer embedding corresponding to the  $\langle \text{SEG} \rangle$  tokens as  $E_{v1}$ . The language loss is defined as:

$$\mathcal{L}_{ans} = \mathcal{L}^{ce}(U^t, G^t), \quad (5)$$

where  $\mathcal{L}_{ans}$  is the auto-regressive cross-entropy loss for text generation.  $G^t$  is the ground truth of the text.

**Mask Decoder.** We adopt the decoder design of Mask2Former [3] without any modification, which includes the pixel and transformer decoders. The outputs of the mask decoder are presented in a class-agnostic entity segmentation manner. Specifically, we obtain a group of embeddings  $E_{vc} \in \mathcal{R}^{q \times 256}$  to generate entities’ prediction  $U^e \in \mathcal{R}^{q \times 2}$  and pixel-level mask outputs  $U^m \in \mathcal{R}^{q \times \frac{h}{4} \times \frac{w}{4}}$ .  $q$  is the query number designed in the mask decoder. The loss for this part is

$$\mathcal{L}_{seg} = \mathcal{L}^{bce}(U^e, G^e) + \mathcal{L}^{bce}(U^m, G^m) + \mathcal{L}^{dice}(U^m, G^m), \quad (6)$$

where  $\mathcal{L}^{bce}$  and  $\mathcal{L}^{dice}$  are the binary cross-entropy and dice loss. The first term is for entity prediction, while the second and third terms are for mask prediction. The ground truth for each subtask is  $G^e$  and  $G^m$ .

### 3.2. ResoBlend Module

This module is designed to augment pyramid mask features  $Z_c$  with the single-layer image feature  $Z_v$ . Despite being a single-layer image feature,  $Z_v$  contains both low- and high-level image information, as demonstrated by ViTDet [17].

The right part of Figure 2 illustrates the fusion process in our ResoBlend module using several standard attention modules. Specifically, we employ a self-attention block to process  $Z_v$  into  $Z'_v$ . Subsequently,  $Z'_v$  and the mask feature  $z^s$  are fused through cross-attention  $f_{XAtt}$ , with  $Z'_v$  as the query and  $z^s$  as the key and value:

$$z_d^s = (\text{FFN}(f_{XAtt}(f_q(Z'_v), f_k(z^s), f_v(z^s)))) + Z_v. \quad (7)$$

Finally, we use a simple convolution with a kernel size of 3 and an interpolation operation to transform  $z_d^s$  into  $z_w^s$ :

$$z_w^s = \text{BIINTERP}(\text{CONV}(z_d^s)). \quad (8)$$

Thus, we can input  $Z_{vc} \in \{z_w^s\}$  to the mask decoder for further decoding.

### 3.3. Association Module

We introduce a simple association module to learn cross-modal associations between  $\langle \text{SEG} \rangle$  token embeddings  $E_{v1}$  from the language decoder and entity embeddings  $E_{vc}$  from the mask decoder. The association loss  $\mathcal{L}_{asso}$  is defined as:

$$\mathcal{L}_{asso} = \mathcal{L}^{bce}(\text{FC}(E_{v1}) \cdot \text{FC}(E_{vc})^T, G^A), \quad (9)$$

where FC indicates a fully-connected layer and  $\cdot$  denotes dot product.  $G^A \in \{0, 1\}^{m \times q}$  are the binary association ground truth of semantic nouns and entities.

The overall training loss for GELLA is

$$\mathcal{L} = \mathcal{L}_{seg} + \mathcal{L}_{asso} + \mathcal{L}_{ans}, \quad (10)$$

where we keep the default loss weights designed in LLaVA [20, 21] and Mask2Former [3].#### User Type 1: Image Description Generation

User: <IMAGE> Please help me describe the image.

Assistant: This image depicts an outdoor scene at what appears to be a petting zoo or farm. On the left, there's a light brown cow with a shaggy coat and long horns peering over a yellow barrier, possibly looking for food or attention from the visitors. To the right, a woman wearing a cap and sunglasses is holding a baby. The baby is wearing a pink hat with a striped pattern and looks curiously at the cow. In the background, more animal enclosures and several visitors are enjoying the sunny day. The setting suggests a family-friendly environment where people can interact with farm animals.

#### User Type 2: Semantic Noun Extraction

User: <IMAGE> Please help me extract each semantic noun of the sentence. The sentence is <EXPRESSION>.

Assistant: There are twenty one semantic nouns, including petting zoo <SEG>, light brown cow <SEG>, shaggy coat <SEG>, horns <SEG>, barrier <SEG>, food <SEG>, visitors <SEG>, woman <SEG>, sunglasses <SEG>, baby <SEG>... and animals <SEG>.

Table 1. **One example to illustrate our instruction-based training data.** We prompt GPT-4 for image description generation and semantic noun extraction.

### 3.4. Data Preparation

Table 1 presents an example of the visual instructions for training. Specifically, two types of user instructions are considered: image description and semantic noun extraction. The GELLA framework’s design allows image descriptions from the output itself instead of relying on user instructions. When sampling the image description task, we only train the language part without the segmentation output.

We prompt GPT-4 [28] and curate a question list for image description and semantic noun extraction, as shown in Table 1. For the association between semantic nouns and entities, we directly use the COCO Panoptic Narrative Grounding dataset [6].

### 4. Experiments

We quantitatively evaluate GELLA on three tasks: panoptic segmentation, referring expression segmentation, and panoptic narrative grounding. We first present experimental results for referring expression segmentation and panoptic narrative grounding to show the effectiveness of our method. Then, we mainly ablate our colormap design without a language branch on panoptic segmentation. Finally, we provide qualitative visualization results of our GELLA framework.

We initialize the language branch of our framework with the fully fine-tuned LLaVA-v1.5 [20] pre-trained model, coupled with a CLIP [34] vision encoder. The CLIP component specifically utilizes the ViT-Large-336 architecture as its backbone. Additionally, our image encoder and mask decoder design is based on the Mask2Former [3] model, which incorporates a Swin-Tiny [23] architecture for the backbone of the image encoder.

For the data part, we choose the Panoptic Narrative Grounding (PNG) dataset [6] as the foundational data for our experiments. To ensure fair comparisons across the three tasks, we train our model using images from the respective training sets of each task, totaling 10,813 images. In addition to the PNG dataset for training the segmentation branch, we use GPT-4 [28] to generate two descriptive captions for

each image in the EntitySeg training dataset [33], which contains approximately 31,000 images. Unlike the COCO dataset [18], EntitySeg encompasses various image domains, including indoor and outdoor environments, street scenes, cartoons, and aerial imagery.

A crucial data augmentation protocol applied across all experiments involves resizing input images to the target output dimensions. The same shortest and longest sizes constrain the output dimensions, and we pad the resized image to a square with zero values. This method aligns with the image dimensions used during the pretraining phase of the CLIP vision encoder, typically 224 or 336 pixels. To preserve entity integrity, we have opted against employing cropping techniques that risk truncating entities.

We train our method on 8 A100 GPUs for 50 epochs with an initial learning rate of  $10^{-5}$  and AdamW optimizer with standard parameters. The learning rate is decayed by a factor of 0.1 after 46 and 48 epochs, respectively. During each training iteration, the sample ratios for image description, semantic noun extraction, and narrative grounding are 0.2, 0.2, and 0.6. The batch size is 16. The constraints for the shortest and longest sizes for the CLIP vision encoder are 336. The colormaps used in the inference stage are from the class-agnostic entity segmentation results of Mask2Former with a Swin-Large backbone, which is the state-of-the-art entity segmentation method.

#### 4.1. Referring Expression Segmentation

We evaluate our model on the RefCOCO [47], RefCOCO+ [47], and RefCOCOg [25, 27] benchmarks for referring expression segmentation. The prompt is structured as follows: “Please help me extract each semantic noun from the sentence. The sentence is <EXPRESSION>”, with <EXPRESSION> filled by the corresponding text from the validation set. As shown in Table 2, our model demonstrates considerable performance across all datasets despite not being explicitly trained on the captions of the datasets. This suggests that the underlying large language model possesses a robust ability to comprehend text. In the case of referring<table border="1">
<thead>
<tr>
<th rowspan="2">Method</th>
<th rowspan="2">Venue</th>
<th colspan="3">RefCOCO</th>
<th colspan="3">RefCOCO+</th>
<th colspan="2">RefCOCOg</th>
</tr>
<tr>
<th>val</th>
<th>testA</th>
<th>testB</th>
<th>val</th>
<th>testA</th>
<th>testB</th>
<th>val(U)</th>
<th>test(U)</th>
</tr>
</thead>
<tbody>
<tr>
<td>CRIS [43]</td>
<td>CVPR2022</td>
<td>70.5</td>
<td>73.2</td>
<td>66.1</td>
<td>65.3</td>
<td>68.1</td>
<td>53.7</td>
<td>59.9</td>
<td>60.4</td>
</tr>
<tr>
<td>LAVT [44]</td>
<td>CVPR2022</td>
<td>72.7</td>
<td>75.8</td>
<td>68.8</td>
<td>62.1</td>
<td>68.4</td>
<td>55.1</td>
<td>61.2</td>
<td>62.1</td>
</tr>
<tr>
<td>GRES [19]</td>
<td>CVPR2023</td>
<td>73.8</td>
<td>76.5</td>
<td>70.2</td>
<td>66.0</td>
<td>71.0</td>
<td>57.7</td>
<td>65.0</td>
<td>66.0</td>
</tr>
<tr>
<td>X-Decoder [51]</td>
<td>CVPR2023</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>64.6</td>
<td>-</td>
</tr>
<tr>
<td>SEEM [13]</td>
<td>NeurIPS2023</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>65.7</td>
<td>-</td>
</tr>
<tr>
<td>LISA-7B [14]</td>
<td>arXiv2023</td>
<td>74.9</td>
<td>79.1</td>
<td>72.3</td>
<td>65.1</td>
<td>70.8</td>
<td>58.1</td>
<td>67.9</td>
<td>70.6</td>
</tr>
<tr>
<td>GELLA-7B</td>
<td>-</td>
<td>76.1</td>
<td>79.9</td>
<td>73.1</td>
<td>66.4</td>
<td>72.7</td>
<td>60.2</td>
<td>69.6</td>
<td>71.3</td>
</tr>
<tr>
<td>GELLA-13B</td>
<td>-</td>
<td><b>76.7</b></td>
<td><b>80.5</b></td>
<td><b>73.6</b></td>
<td><b>67.0</b></td>
<td><b>73.2</b></td>
<td><b>60.6</b></td>
<td><b>70.4</b></td>
<td><b>71.5</b></td>
</tr>
</tbody>
</table>

Table 2. **Referring Expression Segmentation Results.** Our performance across RefCOCO, RefCOCO+, and RefCOCOg in generating accurate segmentation masks based on referring expressions surpasses that of closely related work.

<table border="1">
<thead>
<tr>
<th>Method</th>
<th>Venue</th>
<th>AR</th>
<th>AR<sub>Th</sub></th>
<th>AR<sub>St</sub></th>
<th>AR<sub>Sing</sub></th>
<th>AR<sub>Pl</sub></th>
</tr>
</thead>
<tbody>
<tr>
<td>MCN [26]</td>
<td>CVPR2020</td>
<td>-</td>
<td>48.2</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>PNGb [6]</td>
<td>ICCV2021</td>
<td>55.4</td>
<td>56.2</td>
<td>54.3</td>
<td>56.2</td>
<td>48.4</td>
</tr>
<tr>
<td>PPMN [5]</td>
<td>MM2022</td>
<td>59.4</td>
<td>57.2</td>
<td>62.5</td>
<td>60.0</td>
<td>50.4</td>
</tr>
<tr>
<td>NICE [40]</td>
<td>MM2023</td>
<td>62.3</td>
<td>60.2</td>
<td>65.3</td>
<td>63.1</td>
<td>55.2</td>
</tr>
<tr>
<td>PiGLET [7]</td>
<td>TPAMI2023</td>
<td>65.9</td>
<td>64.0</td>
<td>68.6</td>
<td>67.2</td>
<td>54.5</td>
</tr>
<tr>
<td>PPO-TD [10]</td>
<td>IJCAI2023</td>
<td>66.1</td>
<td>64.0</td>
<td>70.7</td>
<td>68.1</td>
<td>58.3</td>
</tr>
<tr>
<td>GELLA-7B</td>
<td>-</td>
<td>69.8</td>
<td>66.2</td>
<td>72.3</td>
<td>69.9</td>
<td>58.4</td>
</tr>
<tr>
<td>GELLA-13B</td>
<td>-</td>
<td><b>71.3</b></td>
<td><b>67.8</b></td>
<td><b>73.1</b></td>
<td><b>71.0</b></td>
<td><b>59.5</b></td>
</tr>
</tbody>
</table>

Table 3. **Panoptic Narrative Grounding Results.** Performance on PNG dataset in generating segmentation masks based on a long caption.

expression segmentation tasks, where each sentence typically references a single main entity, our model can effectively ignore additional attributive nouns due to our association module assigning lower scores to irrelevant segmentation masks. It is noteworthy that scaling up the language model from 7B to 13B parameters does not improve grounding accuracy. This implies that the bottleneck may lie in the segmentation branch rather than language understanding.

## 4.2. Panoptic Narrative Grounding

Table 3 presents the quantitative performance of the GELLA framework, emphasizing its application in panoptic narrative grounding (PNG). This task surpasses the complexity of referring expression segmentation, requiring the segmentation of every semantic noun from extensive captions, demanding a comprehensive grasp of the entire narrative. The deployment of a large language model (LLM) is pivotal in this scenario, offering substantial improvements in task execution when using the same prompts as in referring expression segmentation. GELLA achieves a significant gain with a 7B parameter language branch, enhancing the Average Recall (AR) by 3.7 points beyond the leading method, PPO-TD. Progressing to an even larger language model, such as one with 13B parameters, yields consistent performance enhancements. This outcome underscores the importance of an LLM’s capacity for parsing lengthy paragraphs—a skill that evaluating referring expression segmentation with brief sentences fails to leverage fully.

<table border="1">
<thead>
<tr>
<th>Method</th>
<th>Venue</th>
<th>Epoch</th>
<th>PQ</th>
<th>SQ</th>
<th>RQ</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mask2Former [3]</td>
<td>NeurIPS2023</td>
<td>50</td>
<td>53.4</td>
<td>83.1</td>
<td>63.4</td>
</tr>
<tr>
<td></td>
<td></td>
<td>12</td>
<td>49.2</td>
<td>82.5</td>
<td>58.8</td>
</tr>
<tr>
<td>GELLA-7B</td>
<td>-</td>
<td>12</td>
<td>53.1</td>
<td>83.3</td>
<td>63.8</td>
</tr>
<tr>
<td></td>
<td></td>
<td>50</td>
<td>56.4</td>
<td>83.4</td>
<td>66.8</td>
</tr>
<tr>
<td>GELLA-13B</td>
<td>-</td>
<td>12</td>
<td>53.2</td>
<td>83.2</td>
<td>63.9</td>
</tr>
<tr>
<td></td>
<td></td>
<td>50</td>
<td><b>56.5</b></td>
<td><b>83.3</b></td>
<td><b>67.3</b></td>
</tr>
</tbody>
</table>

Table 4. **Panoptic Segmentation Results.** Performance on COCO panoptic segmentation dataset in generating segmentation masks based on all categories list.

## 4.3. Panoptic Segmentation

Although the GELLA framework is initially designed for narrative grounding tasks, its architecture also allows for generalization to conventional panoptic segmentation. During inference, we employ a comprehensive prompt that enumerates all 133 category names, structured as: “There might exist a person, dog, cat, ... and window.” Similar to referring expression segmentation, entities within the image corresponding to the listed category nouns are identified. In contrast, categories not present in the image are assigned a deficient score, effectively filtering out entities from the final segmentation output.

In Table 4, we present the performance of our GELLA framework on the panoptic segmentation task within the COCO dataset. Our framework, which deviates from the conventional approach by not using a traditional image encoder with a MobileNet(V2) backbone, achieves improved results. It should be noted, however, that this comparison may not be entirely equitable since the colormaps utilized by our framework are derived from a Swin-Large backbone. Despite this, we posit that the results still validate the sufficiency of low-resolution images for the CLIP vision encoder regarding classification accuracy. Furthermore, scaling up the language branch does not enhance segmentation performance. This suggests that extracting semantic nouns is not a bottleneck in this context, and additional capacity in the language branch does not contribute to significant gains.<table border="1">
<thead>
<tr>
<th>Resolution</th>
<th>PQ</th>
<th>SQ</th>
<th>RQ</th>
</tr>
</thead>
<tbody>
<tr>
<td>224</td>
<td>51.5</td>
<td>82.6</td>
<td>61.6</td>
</tr>
<tr>
<td>336</td>
<td>53.1</td>
<td>83.3</td>
<td>63.8</td>
</tr>
</tbody>
</table>

(a)

<table border="1">
<thead>
<tr>
<th>Color</th>
<th>PQ</th>
<th>SQ</th>
<th>RQ</th>
</tr>
</thead>
<tbody>
<tr>
<td>Location</td>
<td>52.9</td>
<td>83.4</td>
<td>63.6</td>
</tr>
<tr>
<td>Random</td>
<td>53.1</td>
<td>83.3</td>
<td>63.8</td>
</tr>
</tbody>
</table>

(b)

Table 5. Comparison of (a) image resolution used in CLIP vision encoder (b) color assignment criterion in generating colormaps.

<table border="1">
<thead>
<tr>
<th>Backbone</th>
<th>AP</th>
<th>AP<sub>50</sub></th>
<th>AP<sub>75</sub></th>
</tr>
</thead>
<tbody>
<tr>
<td>MobileNet(V2)</td>
<td>95.2</td>
<td>96.9</td>
<td>96.1</td>
</tr>
<tr>
<td>Swin-Tiny</td>
<td>95.8</td>
<td>98.9</td>
<td>97.3</td>
</tr>
<tr>
<td>Swin-Large</td>
<td>96.3</td>
<td>99.0</td>
<td>97.8</td>
</tr>
</tbody>
</table>

(a)

<table border="1">
<thead>
<tr>
<th>Masks</th>
<th>AP</th>
<th>AP<sub>50</sub></th>
<th>AP<sub>75</sub></th>
</tr>
</thead>
<tbody>
<tr>
<td>COCO<sup>P</sup></td>
<td>95.2</td>
<td>98.7</td>
<td>97.1</td>
</tr>
<tr>
<td>Entity<sup>P</sup></td>
<td>94.9</td>
<td>98.5</td>
<td>97.0</td>
</tr>
<tr>
<td>COCO<sup>G</sup></td>
<td>95.2</td>
<td>96.9</td>
<td>96.1</td>
</tr>
</tbody>
</table>

(b)

Table 6. Ablation study of (a) various backbones in colormap encoder (b) different segmentation masks in training. We evaluate mask quality in AP due to the class-agnostic property.

#### 4.4. Ablation Studies

Considering the experimental results across three tasks, we analyze the specific effect of the ResoBlend module on panoptic segmentation with a colormap design and a CLIP vision encoder. This design allows us to understand the influence of ResoBlend on mask quality. Furthermore, we conduct experiments on panoptic narrative grounding to gain a better understanding of the influence of the sampling strategy and the association module.

**CLIP Vision Encoder.** Table 5(a) presents the results of our ablation study on the input resolution for the ViT backbone. Using a ViT model trained on lower image resolutions, such as 224 pixels, leads to diminished classification performance. This decrease in accuracy is attributed to the model’s challenge in effectively recognizing small objects, which are prevalent in the COCO dataset.

**Colormap Design.** Table 5(b) details the ablation study of various color assignment strategies for encoding entity segmentation masks. The first row explores color assignment based on the entity’s gravity centroid, as described in Painter [42]. There is no significant performance difference between this centroid-based assignment and a random color assignment approach. This is likely because the network makes it easy to group pixels sharing the same color, regardless of the specific color used. Consequently, we have opted for random color assignment throughout our experiments due to its simplicity and effectiveness.

Table 6 details our exploration of the effectiveness of solely utilizing colormap encoding and decoding across various backbones and colormap sources. These sources encompass the segmentation results generated from models trained on either the COCO or EntitySeg datasets and the ground truth mask annotations from COCO. Our observations confirm that the colormap design maintains its robustness irrespective of the backbone architecture and variability in mask patterns. This resilience suggests that our approach to encoding and decoding colormaps is fundamentally sound

<table border="1">
<thead>
<tr>
<th>SA</th>
<th>CA</th>
<th>FF</th>
<th>RES</th>
<th>CONV</th>
<th>PQ</th>
<th>SQ</th>
<th>RQ</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>51.8</td>
<td>82.7</td>
<td>61.3</td>
</tr>
<tr>
<td>✓</td>
<td></td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>43.1</td>
<td>78.1</td>
<td>58.1</td>
</tr>
<tr>
<td>✓</td>
<td>✓</td>
<td></td>
<td>✓</td>
<td>✓</td>
<td>52.8</td>
<td>83.0</td>
<td>63.3</td>
</tr>
<tr>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td></td>
<td>✓</td>
<td>50.9</td>
<td>81.9</td>
<td>60.5</td>
</tr>
<tr>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td></td>
<td>52.4</td>
<td>82.5</td>
<td>61.8</td>
</tr>
<tr>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>53.1</td>
<td>83.3</td>
<td>63.8</td>
</tr>
</tbody>
</table>

Table 7. Ablation study of modules in the proposed ResoBlend module. “SA”, “CA”, “FF”, “RES” and “CONV” abbreviate the self-attention, cross-attention, feed-forward, residual connection, and convolution module, as illustrated in the right part of Figure 2.

<table border="1">
<thead>
<tr>
<th>Train</th>
<th>PQ</th>
<th>SQ</th>
<th>RQ</th>
</tr>
</thead>
<tbody>
<tr>
<td>COCO</td>
<td>53.0</td>
<td>83.1</td>
<td>63.8</td>
</tr>
<tr>
<td>Entity</td>
<td>52.9</td>
<td>83.2</td>
<td>63.7</td>
</tr>
<tr>
<td>Ground truth</td>
<td>53.1</td>
<td>83.3</td>
<td>63.8</td>
</tr>
</tbody>
</table>

(a)

<table border="1">
<thead>
<tr>
<th>Test</th>
<th>PQ</th>
<th>SQ</th>
<th>RQ</th>
</tr>
</thead>
<tbody>
<tr>
<td>COCO</td>
<td>53.1</td>
<td>83.3</td>
<td>63.8</td>
</tr>
<tr>
<td>Entity</td>
<td>49.2</td>
<td>83.7</td>
<td>58.6</td>
</tr>
<tr>
<td>Ground truth</td>
<td>79.6</td>
<td>97.8</td>
<td>91.4</td>
</tr>
</tbody>
</table>

(b)

Table 8. Comparison of different segmentation masks used in (a) training and (b) testing.

and adaptable to different underlying structures and sources of segmentation information.

**ResoBlend Module.** Table 7 outlines the results of our ablation study on the various components within the proposed ResoBlend module. The findings indicate that our baseline configuration, detailed in the last row, achieves optimal performance. Omitting any of the modules leads to a degradation in mask prediction quality. Notably, segmentation accuracy significantly drops when mask features are isolated from image features, as shown in the second row. This effect underscores the importance of feature interaction between the image and mask features. Nevertheless, even with this separation, the network retains a limited capacity for entity classification. This is attributed to the language embeddings, which contain visual information linked to the CLIP feature, functioning as image tokens within a sentence. This design aspect enhances the language branch’s ability to discern semantic nouns with the image context.

Table 8 ablates the effect of using segmentation masks from various sources during the training and testing phases. Table 8(a) reveals that the origin of mask data in training does not significantly alter outcomes. This observation can be attributed to our segmentation branch’s loss design, which employs the input as the training target. Consequently, the network learns an identity function without refining the mask quality. Table 8(b) illustrates the classification potential of our CLIP vision encoder when provided with ground truth masks during testing, serving as an indicator of the upper-bound performance limit. However, when applying segmentation predictions from a model trained on the EntitySeg dataset, a noticeable domain gap persists concerning the COCO dataset in the case of annotations.

**Sampling Strategy.** Table 9 analyzes how varying the sample ratio of three distinct subtasks—image description,Figure 3. **Illustration of our user interface.** The GELLA framework can perform three tasks: image description, semantic noun extraction, and narrative grounding. We provide more visualization results in our appendix.

<table border="1">
<thead>
<tr>
<th rowspan="2">IMG<br/>DES</th>
<th rowspan="2">NOUN<br/>EXT</th>
<th rowspan="2">ENT<br/>GRO</th>
<th colspan="5">Segmentation</th>
<th rowspan="2">Text<br/>AR<sub>text</sub></th>
</tr>
<tr>
<th>AR</th>
<th>AR<sub>Th</sub></th>
<th>AR<sub>St</sub></th>
<th>AR<sub>Sing</sub></th>
<th>AR<sub>Pl</sub></th>
</tr>
</thead>
<tbody>
<tr>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>66.4</td>
<td>64.2</td>
<td>70.5</td>
<td>68.0</td>
<td>58.3</td>
<td>79.4</td>
</tr>
<tr>
<td>0.0</td>
<td>0.2</td>
<td>0.8</td>
<td>69.4</td>
<td>65.8</td>
<td>71.8</td>
<td>69.5</td>
<td>58.0</td>
<td>95.4</td>
</tr>
<tr>
<td>0.1</td>
<td>0.1</td>
<td>0.8</td>
<td>69.2</td>
<td>65.5</td>
<td>71.5</td>
<td>69.2</td>
<td>57.8</td>
<td>95.7</td>
</tr>
<tr>
<td>0.2</td>
<td>0.2</td>
<td>0.6</td>
<td>69.8</td>
<td>66.2</td>
<td>72.3</td>
<td>69.9</td>
<td>58.4</td>
<td>96.1</td>
</tr>
</tbody>
</table>

Table 9. **Ablation study of the sampling strategy in panoptic narrative grounding.** ‘IMG DES’, ‘NOUN EXT’, and ‘ENT GRO’ indicate three sampled subtasks, including image description, semantic noun extraction, and entity grounding in our training.

semantic noun extraction, and entity grounding—affects our framework’s performance. Notably, the entity grounding subtask includes a component that evaluates the effectiveness of noun extraction within the COCO panoptic narrative grounding dataset. Our initial approach, depicted in the first row, involves exclusively using PNG data for training, which yields commendable results in entity grounding. Nevertheless, the Average Recall (AR) for semantic noun extraction under this setup is suboptimal, ultimately constraining the overall segmentation efficacy. Incorporating captions generated by GPT-4 from the EntitySeg dataset enhances the language model’s precision in noun extraction. The results demonstrate that once the noun extraction subtask is included, the exact sample ratio becomes less critical, as no marked performance discrepancy is observed across different ratios. Additionally, integrating the image description task endows our framework with the capability to produce image captions autonomously, further enriching its functionality. With the sample ratios 0.2, 0.2, and 0.6 for image description, semantic noun extraction, and entity grounding, our GELLA framework obtains the best performance for both entity recognition and panoptic narrative grounding.

<table border="1">
<thead>
<tr>
<th rowspan="2">Structure</th>
<th colspan="5">Segmentation</th>
</tr>
<tr>
<th>AR</th>
<th>AR<sub>Th</sub></th>
<th>AR<sub>St</sub></th>
<th>AR<sub>Sing</sub></th>
<th>AR<sub>Pl</sub></th>
</tr>
</thead>
<tbody>
<tr>
<td>FC+ReLU+FC</td>
<td>69.7</td>
<td>66.1</td>
<td>72.3</td>
<td>69.7</td>
<td>58.2</td>
</tr>
<tr>
<td>FC</td>
<td>69.8</td>
<td>66.2</td>
<td>72.3</td>
<td>69.9</td>
<td>58.4</td>
</tr>
</tbody>
</table>

Table 10. **Ablation study on the association module design.**

<table border="1">
<thead>
<tr>
<th>Method</th>
<th>Backbone</th>
<th>AP<sup>e</sup></th>
<th>AR</th>
<th>Dataset</th>
<th>Task</th>
<th>Per</th>
<th>Zero-Shot</th>
<th>AR</th>
</tr>
</thead>
<tbody>
<tr>
<td>CondInst</td>
<td>Swin-Tiny</td>
<td>34.6</td>
<td>63.5</td>
<td rowspan="3">COCO</td>
<td>Instance</td>
<td>50.1 AP</td>
<td rowspan="3">×</td>
<td>62.3</td>
</tr>
<tr>
<td>SOLOv2</td>
<td>Swin-Tiny</td>
<td>34.3</td>
<td>63.7</td>
<td>Panoptic</td>
<td>57.8 PQ</td>
<td>68.9</td>
</tr>
<tr>
<td rowspan="2">Mask2Former</td>
<td>Swin-Tiny</td>
<td>38.5</td>
<td>64.7</td>
<td>Entity</td>
<td>43.1 AP<sup>e</sup></td>
<td>69.8</td>
</tr>
<tr>
<td>Swin-Large</td>
<td>43.1</td>
<td>69.8</td>
<td>EntitySeg</td>
<td>Entity</td>
<td>46.2 AP<sup>e</sup></td>
<td>✓</td>
<td>69.2</td>
</tr>
</tbody>
</table>

(a)

(b)

Table 11. (a) Using different models on entity segmentation and COCO dataset. AP<sup>e</sup> is the performance on COCO Entity. (b) Using Mask2Former on various segmentation tasks. ‘Per’ indicates the original performance.

**Association Module.** Table 10 presents the ablation study focused on the structural design within our association module. The study reveals that incorporating additional fully connected (FC) layers does not yield further performance improvements. This plateau in enhancement can be attributed to the fact that the final embeddings of both the <SEG> token and the entities already reside within the same domain of the CLIP image space. Consequently, there is no need for complex operations to align them within a shared space, as they are inherently congruent.

**Various Segmentation Inputs** Table 11 shows the flexibility of GELLA with various models and tasks. In other words, the GELLA does not require an explicit segmentation branch by using colormap as input.<table border="1">
<thead>
<tr>
<th>Method</th>
<th>LLM</th>
<th>IME</th>
<th>MMD</th>
<th>ALL</th>
</tr>
</thead>
<tbody>
<tr>
<td>LISA</td>
<td>92607</td>
<td>3114</td>
<td>226</td>
<td>95947</td>
</tr>
<tr>
<td>Ours</td>
<td>11987</td>
<td>204</td>
<td>113</td>
<td>12304</td>
</tr>
</tbody>
</table>

(a) GFLOPS.

<table border="1">
<thead>
<tr>
<th>Method</th>
<th>LLM</th>
<th>IME</th>
<th>MMD</th>
<th>ALL</th>
<th>RESULTS</th>
</tr>
</thead>
<tbody>
<tr>
<td>LISA</td>
<td></td>
<td>1.9s</td>
<td>0.2s</td>
<td>5.5</td>
<td>43.6</td>
</tr>
<tr>
<td>Ours</td>
<td>3.2s</td>
<td>0.7s</td>
<td>0.3s</td>
<td>4.2</td>
<td>69.8</td>
</tr>
</tbody>
</table>

(b) Inference time.

Table 12. Comparison of GFLOPS and average inference time per image with other LLM-based method. “LLM”, “IME” and “MMD” indicate the large language model, image/mask encoder, and decoder.

**Computation Cost Comparison** In Table 12, our method outperforms other alternatives with a lower computational cost, even when accounting for the cost of segmentation masks generated by Mask2Former with a Swin-Large backbone (additional 218M hyper-parameters, 541 GFLOPS and 0.3s inference time).

## 4.5. Visualization

Figure 3 displays the user interface of our GELLA framework, showcasing its capabilities in image captioning, entity recognition, and panoptic narrative grounding. In this interactive interface, GELLA first describes the image and outputs a long caption. Then, it extracts each semantic noun and associates them with the provided entity segmentation masks. Additionally, the GELLA framework can incorporate existing segmentation results as input.

The Figure 4, 5 and 6 show more visualization results between long text caption generated by GPT-4V and entity segmentation results by CropFormer.

## 5. Conclusion

This paper introduces GELLA, a framework that leverages a large language model to ground entities with long captions. The GELLA framework consists of colormap, image, and language encoders to encode the segmentation masks, image, and caption. Compared to other works, the colormap encoder enables the network to prioritize masks without the computational burden of high-resolution image processing. Thus, it allows us to use the CLIP vision encoder for handling low-resolution images in pixel-level prediction and semantic noun extraction with the language decoder. Furthermore, the proposed ResoBlend and association module align two multi-modal features and facilitate coherent narrative construction. We hope the GELLA can provide a more flexible pipeline to ground entities by receiving offline masks generated by state-of-the-art segmentation methods.The image shows a group of zebras and giraffes, likely in a zoo or wildlife park setting. In the foreground, you can see three zebras with distinctive black-and-white striped patterns on their bodies, grazing or walking on a sandy surface. Behind the zebras, several giraffes are visible. These tall animals have long necks and legs and are spotted with brown patches separated by lighter lines. They are gathered around some trees and artificial-looking rocks, probably part of a constructed habitat for them.

zebras: [1, 2, 3]  
giraffes: [4, 5, 6, 7]  
three zebras: [1, 2, 3]  
sandy surface: [8]  
zebras: [1, 2, 3]  
giraffes: [4, 5, 6, 7]  
tall animals: [4, 5, 6, 7]  
trees: [10]  
rocks: [9]

The image shows an outdoor setting with several individuals engaging in different activities. In the foreground, an older gentleman wearing glasses is seated on a green and black bench, attentively reading a newspaper. Behind him, there is another bench where a man in a red shirt is sitting. He is looking to the side, and appears to be resting or waiting. Further in the background, two other men are visible; one is standing with sunglasses on and looking towards the camera, while the other is seated, immersed in his phone.

several individuals: [1, 2, 3, 4, 5]  
gentleman: [1]  
bench: [6]  
newspaper: [10]  
bench: [8]  
man: [2]  
two other men: [3, 4]  
one: [3]  
the other: [4]

Figure 4. **Sample results of panoptic narrative grounding.** The left part shows the input image and the long caption generated by our GELLA model, while the right part displays the panoptic narrative grounding results.The image shows a well-organized and cozy living space. The room features a brown sofa on the right, adorned with a long, white pillow. In front of the sofa, there is a small, colorful rug with orange tones that complements the decor. On the left, a television set is visible, placed on a low, wooden entertainment unit. A small wooden coffee table sits in the center of the room. On the windowsill, behind a sheer white curtain stand more potted plants of different sizes and shapes. Hanging planters add a vertical element to the plant decor.

brown sofa: [1]  
long, white pillow: [2]  
small, colorful rug: [3]  
television set: [8, 9]  
low, wooden entertainment unit: [7]  
small wooden coffee table: [4]  
sheer white curtain: [17]  
potted plants: [12, 13, 14, 15, 16]  
hanging planters: [20, 21]

The image displays a cozy breakfast setting on a bed. The breakfast is arranged on a wooden tray placed atop a pink bedsheet. The tray holds a cup of tea or coffee with a considerable amount of milk, as the liquid is very light in color. Next to it is a small metal pitcher, perhaps containing more milk or cream. In front of these items is a cup of cappuccino with a frothy top, placed on a saucer next to a teaspoon. To the right of the tray is a pastry that looks like a croissant, sitting on a white plate.

wooden tray: [14]  
pink bedsheet: [15]  
tea or coffee: [1]  
small metal pitcher: [9]  
cappuccino: [6]  
saucer: [7]  
teaspoon: [8]  
pastry: [10]  
white plate: [11]

Figure 5. **Sample results of panoptic narrative grounding.** The left part shows the input image and the long caption generated by our GELLA model, while the right part displays the panoptic narrative grounding results.The image depicts an outdoor gathering, presumably a celebration or a picnic. In the center, a man with a beard is sitting at a picnic table holding a young child on his lap. There's a colorful birthday cake on the table in front of them. To the left, there is another man standing by the table, looking downward, and several other people are seated or standing around the area. One woman in sunglasses is standing behind the table, looking at the camera with a slight smile. Further in the background, there's a man with a beard and glasses who seems to be observing the scene.

man: [1]  
picnic table: [10]  
young child: [2]  
colorful birthday cake: [8]  
table: [10]  
man: [3]  
table: [10, 12]  
woman: [4]  
table: [10, 12]  
man: [5]

The image shows three people, likely at a celebratory event. The person on the left is smiling at the camera and wearing a vibrant red hat, a black cardigan, and a light scarf. In the center, there's someone concentrating on cutting a cake decorated with white frosting and colorful flowers, wearing a patterned top and a purple hat. To the right, another person appears to be either cutting a different cake or helping out, wearing a dark top and a black hat. In the background, there's a fire extinguisher and a potted plant.

three people: [1, 2, 3]  
person: [1]  
cake: [7]  
person: [3]  
cake: [8]  
fire extinguisher: [18]  
potted plant: [19]

Figure 6. **Sample results of panoptic narrative grounding.** The left part shows the input image and the long caption generated by our GELLA model, while the right part displays the panoptic narrative grounding results.## References

- [1] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal LLM’s referential dialogue magic. *arXiv:2306.15195*, 2023. 2, 3
- [2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. *TPAMI*, 2017. 2
- [3] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In *CVPR*, 2022. 2, 4, 5, 6
- [4] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. *arXiv:2305.06500*, 2023. 2, 3
- [5] Zihan Ding, Zi-han Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Xiaolin Wei, and Si Liu. PPMN: Pixel-phrase matching network for one-stage panoptic narrative grounding. In *ACM MM*, 2022. 6
- [6] Cristina González, Nicolás Ayobi, Isabela Hernández, José Hernández, Jordi Pont-Tuset, and Pablo Arbeláez. Panoptic narrative grounding. In *ICCV*, 2021. 5, 6
- [7] Cristina González, Nicolás Ayobi, Isabela Hernández, Jordi Pont-Tuset, and Pablo Arbeláez. PiGLET: Pixel-level grounding of language expressions with transformers. *TPAMI*, 2023. 6
- [8] Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. A survey on instance segmentation: state of the art. *International Journal of Multimedia Information Retrieval*, 2020. 2
- [9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In *ICCV*, 2017. 2
- [10] Tianrui Hui, Zihan Ding, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, and Si Liu. Enriching phrases with coupled pixel and object contexts for panoptic narrative grounding. *arXiv:2311.01091*, 2023. 6
- [11] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. In *CVPR*, 2019. 2
- [12] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In *CVPR*, 2019. 2
- [13] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. *arXiv:2304.02643*, 2023. 2, 6
- [14] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning segmentation via large language model. *arXiv:2308.00692*, 2023. 2, 3, 6
- [15] Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, and Yi Yang. Deep hierarchical semantic segmentation. In *CVPR*, 2022. 2
- [16] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia. Fully convolutional networks for panoptic segmentation. In *CVPR*, 2021. 2
- [17] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In *ECCV*, 2022. 4
- [18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In *ECCV*, 2014. 5
- [19] Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. In *CVPR*, 2023. 6
- [20] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. *arXiv:2310.03744*, 2023. 2, 4, 5
- [21] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In *NeurIPS*, 2023. 2, 3, 4
- [22] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In *CVPR*, 2018. 2
- [23] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In *ICCV*, 2021. 3, 5
- [24] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In *CVPR*, 2015. 2
- [25] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In *CVPR*, 2016. 5
- [26] Edgar Margföy-Tuay, Juan C Pérez, Emilio Botero, and Pablo Arbeláez. Dynamic multimodal instance segmentation guided by natural language queries. In *ECCV*, 2018. 6
- [27] Varun K. Nagaraja, Vlad I. Morariu, and Larry S. Davis. Modeling context between objects for referring expression understanding. In *ECCV*, 2016. 5
- [28] OpenAI. GPT-4 technical report. *arXiv:2303.08774*, 2023. 5
- [29] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. *arXiv:2306.14824*, 2023. 2, 3
- [30] Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Amodal instance segmentation with kins dataset. In *CVPR*, 2019. 2
- [31] Lu Qi, Jason Kuen, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, and Jiaya Jia. Open-world entity segmentation. *TPAMI*, 2022. 2
- [32] Lu Qi, Jason Kuen, Weidong Guo, Jiuxiang Gu, Zhe Lin, Bo Du, Yu Xu, and Ming-Hsuan Yang. AIMS: All-inclusive multi-level segmentation. In *NeurIPS*, 2023.
- [33] Lu Qi, Jason Kuen, Tiancheng Shen, Jiuxiang Gu, Wenbo Li, Weidong Guo, Jiaya Jia, Zhe Lin, and Ming-Hsuan Yang. High quality entity segmentation. In *ICCV*, 2023. 2, 5
- [34] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. *arXiv:2103.00020*, 2021. 2, 3, 5- [35] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In *CVPR*, 2018. 3
- [36] Tiancheng Shen, Yuechen Zhang, Lu Qi, Jason Kuen, Xingyu Xie, Jianlong Wu, Zhe Lin, and Jiaya Jia. High quality segmentation for ultra high-resolution images. In *CVPR*, 2022. 2
- [37] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. In *ECCV*, 2020. 2
- [38] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. *arXiv:2302.13971*, 2023. 3
- [39] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In *CVPR*, 2021. 2
- [40] Haowei Wang, Jiayi Ji, Tianyu Guo, Yilong Yang, Yiyi Zhou, Xiaoshuai Sun, and Rongrong Ji. NICE: Improving panoptic narrative detection and segmentation with cascading collaborative learning. *arXiv:2310.10975*, 2023. 6
- [41] Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. SOLO: Segmenting objects by locations. In *ECCV*, 2020. 2
- [42] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In *CVPR*, 2023. 3, 7
- [43] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. CRIS: CLIP-driven referring image segmentation. In *CVPR*, 2022. 6
- [44] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. LAVT: Language-aware vision transformer for referring image segmentation. In *CVPR*, 2022. 6
- [45] Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. *arXiv:2310.07704*, 2023. 2, 3
- [46] Hongshan Yu, Zhengeng Yang, Lei Tan, Yaonan Wang, Wei Sun, Mingui Sun, and Yandong Tang. Methods and datasets on semantic segmentation: A review. *Neurocomputing*, 2018. 2
- [47] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. In *ECCV*, 2016. 5
- [48] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In *CVPR*, 2017. 2
- [49] Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, and Bingyi Kang. BuboGPT: Enabling visual grounding in multi-modal LLMs. *arXiv:2307.08581*, 2023. 2, 3
- [50] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. *arXiv:2304.10592*, 2023. 2, 3
- [51] Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. In *CVPR*, 2023. 6
