持续更新

Your paper GRSL-01600-2023 Multi-Stage Synergistic Aggregation Network for Remote Sensing Visual Grounding has been carefully reviewed by the GRSL review panel and found to be unacceptable in its present form. The reviewers did suggest, however, that if completely revised the paper might be found acceptable. We encourage you to revise and resubmit this manuscript as a new paper to GRSL.

If you do decide to resubmit the paper, please include a point-by-point response to the comments of the reviewers along with the new paper in order to expedite its review. Use the “Submit a Resubmission” link in your author center when you submit the new manuscript. Your resubmission is due by 05-Jan-2024

Below you will find comments from the review panel. Any attached files that may be referenced with these comments can be accessed in a copy of this decision letter located in your Author Center on ScholarOne Manuscripts.

Sincerely,
Dr. Avik Bhattacharya
Editor-in-Chief
IEEE Geoscience and Remote Sensing Letters

您的论文GRSL-01600-2023《用于遥感视觉定位的多阶段协同聚合网络》已经被GRSL审稿小组仔细审查,并认为其当前形式不可接受。然而,审稿人建议,如果完全修订,可能会被接受。我们鼓励您对此手稿进行修订,并以新论文的形式重新提交给GRSL。

如果您决定重新提交该论文,请在新论文中包括对审稿人评论的逐条回应,以加快审查过程。在提交新手稿时,请使用您的作者中心中的“提交修订”链接。您的重新提交截止日期为2024年1月5日。

以下是审稿小组的评论。与这些评论相关的任何附件文件可以在ScholarOne手稿的作者中心中找到,该决定信的副本中包含了这些附件文件的链接。

Associate Editor Comments:
Associate Editor
Comments to the Author:
According to the comments, the innovation of the proposed method is doubtful. The introduction and detail information of the proposed method is not clear. Besides, this manuscript needs to be further improved in terms of writting.

Reviewer(s) Comments:
Reviewer: 1

Comments to the Author
The authors propose a multi-stage synergistic aggregation network for remote sensing visual grounding to address the correlations between textual semantics and visual information, as well as the dependencies between features and bounding box representations. My major comments are as follows:

(1) The literature review on remote sensing visual grounding methods is inadequate. I recommend that the authors provide a more comprehensive literature review on remote sensing visual grounding methods.
(2) The authors state that existing methods overlooked the crucial dependencies between features and bounding box representations. I cannot understand what these dependencies are. Please add more descriptions to explain them.
(3) The authors do not introduce the objective function of the proposed method. Please add the definition of the objective function.
(4) The proposed method is complex and not easy to understand. I suggest that the authors make their code publicly available and release their best model to allow for result reproduction. This will greatly influence my decision.
(5) There is some inconsistent expression. For example, in Fig. 1, “Query channel Attention” is denoted as “QCA,” and “Cross Attention” is denoted as “CA Layer.” Additionally, the authors usually use “QCB” to refer to “Query Channel Attention” in the context, but in Fig. 1, “QCA” is used, which is confusing.
(6) I suggest the authors add a comma or a period at the end of different equations.

(1) 关于遥感视觉接地方法的文献综述不够充分。我建议作者对遥感视觉接地方法进行更全面的文献综述。
(2) 作者指出,现有方法忽略了地物与边界框表示之间的关键依赖关系。我不明白这些依赖关系是什么。请补充说明。
x (3) 作者没有介绍建议方法的目标函数。请补充目标函数的定义。
resp:我们在experiments的文本描述部分,描述了预测头为两层结构的mlp,和使用的损失函数为cross entropyloss“The prediction head is simply designed as a two-layer MLP. The utilized loss function is the cross-entropy(CE) loss.”,现在将损失函数以公示x进行明确了定义和对应的使用。
x (4) 建议的方法很复杂,不容易理解。我建议作者公开他们的代码,并发布他们的最佳模型,以便复制结果。这将极大地影响我的决定。
resp :我们将首先公开源码和最佳模型,和注意力热力图的可视化代码,并且后续会无保留地整理好消融实验的分支代码和对应的模型权重,请保持耐心:https://github.com/waynamigo/AGT-MSVG。
x (5) 有些表达不一致。例如,在图 1 中,”Query channel Attention “表示为 “QCA”,而 “Cross Attention “表示为 “CA Layer”。此外,作者在上下文中通常使用 “QCB “来指代 “查询通道注意”,但在图 1 中却使用了 “QCA”,这容易引起混淆。
resp:我们修改了fig1中的错误,感谢您的指正,为粗糙的第一版手稿感到抱歉
x (6) 我建议作者在不同等式的末尾加上逗号或句号。

resp:已在最新的submission中更改了修改后的公式。

Reviewer: 2

Comments to the Author
Overall, this is a well-structured paper, but there are several writing issues. The paper focuses on the latest RSVG task and proposes a novel multi-stage synergistic aggregation module that effectively aggregates visual and textual contexts to facilitate the learning of multi-scale multimodal features. However, the introduction of the generative paradigm, which is the innovative aspect, may not be appropriate. The reviewer thinks this manuscript is not acceptable without major revisions.

总体而言,这是一篇结构良好的论文,但存在一些写作问题。论文侧重于最新的RSVG任务,并提出了一个新颖的多阶段协同聚合模块,可以有效地聚合视觉和文本上下文,以促进多尺度多模态特征的学习。然而,引入创新性范式的部分可能不太恰当。审稿人认为,如果没有进行重大修订,这篇手稿将无法被接受。

standardize writing, precise formula symbols

  1. Is the ‘Auto-Regressive Transformer’ in Figure 1(a) the same module as the ‘Generative Transformer’ in (b)? Please ensure consistent expression. Section II.C is referred to as the ‘Auto-regressive Generative Transformer’.

  2. In Section II, the first appearance of the symbol “i” in i needs to be explained clearly; it seems inconsistent with the meaning in formulas (2), (3), (4), and (5). In the section “Feature Aggregation,” it is stated that “i presents the i-th stage of aggregation.”.

  3. It is mentioned in the letter that “if there are n queries associated with the same image, then the image forms n samples with these n queries.” And in “The embedding set can be presented as ={[CLS], t1, t2, t3, …, tn},” the symbol “n” is used, which should not represent the same meaning. Formulas (6) and (7) also use “n,” please carefully check and differentiate symbols with different meanings.

  4. Section “Transformer Decoder” is poorly written. This section references formulas (9) and (12), but it is clear that formula (12) does not exist. So there is an error, please modify it carefully. What does “Cxy” mean in formula (8)? Does it have the same meaning as “Cxy” in Figure 1(a)? What is the meaning of “kwh”? What is the difference between “Cbins” and “Cˆbins”? What is the difference between “Cxy” and “Cˆxy”? What is “Sc”? What are s1, s2, s3, s4? They are not explained clearly.

  5. There is a grammar error: “The result sequence Cˆbins predicted by the decoder can be inverse quantized with (12) to obtain the floating-point result, we use Cbins.” This sentence is incorrect as there are complete sentences before and after the comma. To clarify, this does NOT means expressions in other text are right. Please thoroughly check the grammar in the paper.

  6. Several symbols, such as “n,” “k,” etc., are not in italic font. Please carefully check and modify.
    x 1. 图 1(a)中的 “自回归变压器 “与(b)中的 “生成变压器 “是同一个模块吗?请确保表述一致。第 II.C 节称为 “自动回归生成变换器”。
    感谢指正,我们的自回归transformer起到的作用就是生成器的作用,由于写作的疏忽,现在已为了避免混淆将fig1中的agt改成了aggt,这将避免了上下文歧义。

x 2. 在第 II 节中,i 中第一次出现的符号 “i “需要解释清楚;它似乎与公式 (2)、(3)、(4) 和 (5) 中的含义不一致。在 “特征聚合 “一节中,”i 表示聚合的第 i 个阶段”。
在II节中我们重新描述了图像I与文本T的对应关系,作为样本对的集合,避免了与接下来的(2345)公式的混淆

x 3. 信中提到,”如果有 n 个查询与同一幅图像相关联,那么图像就会与这 n 个查询形成 n 个样本”。而在 “嵌入集可以表示为 ={[CLS], t1, t2, t3, …, tn}”中,使用了符号 “n”,这不应该代表相同的含义。公式 (6) 和 (7) 也使用了 “n”,请仔细检查并区分不同含义的符号。

  1. “变压器解码器 “一节写得不好。该节引用了公式 (9) 和 (12),但显然不存在公式 (12)。因此存在错误,请仔细修改。公式(8)中的 “Cxy “是什么意思?与图 1(a)中的 “Cxy “含义相同吗?kwh” 的含义是什么?Cbins” 和 “Cˆbins” 有什么区别?Cxy” 和 “Cˆxy “有什么区别?什么是 “Sc”?什么是 s1、s2、s3、s4? 没有解释清楚。

  2. 有一个语法错误: “解码器预测的结果序列 Cˆbins 可以用(12)进行反量化,得到浮点结果,我们使用 Cbins。这句话不正确,因为逗号前后都有完整的句子。需要说明的是,这并不意味着其他文本中的表达是正确的。请彻底检查论文中的语法。

  3. 有几个符号,如 “n”、”k “等,没有用斜体字。请仔细检查并修改。

novelty

  1. Contribution 2) mentions introducing the generation paradigm into the RSVG field. Please explain what exactly the generation paradigm means. In the abstract, it is mentioned as “generate discrete coordinates sequence in an auto-regressive manner,” but the MGVLF proposed in reference [17] also generates discrete coordinates through feature regression. This contribution point is not sufficient.

experiments

  1. In Section A, it is stated that “our framework achieves improved visual representation learning for small-scale objects.” How can you prove this through qualitative or quantitative results? Otherwise, it is difficult to draw this conclusion.

other

  1. Reference [17] constructs the DIOR-RSVG dataset, and the proposed method is named MGVLF. Please modify the entire letter and TABLE 1 accordingly. The Venue should be TGRS, not IEEE.

  2. Is the ‘Prediction Head’ in Figure 1(a) the same as ‘MLP’ in (b)?

  3. Refer to recent articles [1] on remote sensing vision-language task.
    [1] Y. Yuan, Y. Zhan and Z. Xiong, “Parameter-Efficient Transfer Learning for Remote Sensing Image–Text Retrieval,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1-14, 2023, Art no. 5619014, doi: 10.1109/TGRS.2023.3308969.

新奇

  1. 贡献 2)提到将生成范式引入 RSVG 领域。请解释一下生成范式的确切含义。摘要中提到 “以自动回归方式生成离散坐标序列”,但参考文献 [17] 中提出的 MGVLF 也是通过特征回归生成离散坐标的。这个贡献点还不够。

实验

  1. A 部分指出 “我们的框架改进了小尺度物体的视觉表征学习”。如何通过定性或定量结果来证明这一点?否则很难得出这一结论。

其他v

x1. 参考文献 [17] 构建了 DIOR-RSVG 数据集,提出的方法被命名为 MGVLF。请相应修改整封信和表 1。地点应为 TGRS,而不是 IEEE。

x2. 图 1(a)中的 “预测头 “与(b)中的 “MLP “是否相同?

  1. 参考最近关于遥感视觉语言任务的文章 [1]。
    [1] Y. Yuan, Y. Zhan and Z. Xiong, “Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp.