TextShield-R1 Reinforced Reasoning for Tampered Text Detection

📅 发布时间：2026/7/5 0:49:44 👁️ 浏览次数：

TextShield-R1: Reinforced Reasoning for Tampered Text DetectionAuthors:Chenfan Qu, Yiwu Zhong, Jian Liu, Xuekang Zhu, Bohan Yu, Lianwen JinDeep-Dive Summary:TextShield-R1: 强化推理用于篡改文本检测摘要篡改图像的日益盛行带来了严重的视觉安全威胁迫切需要可靠的检测方法。多模态大语言模型MLLMs在分析篡改图像和生成解释方面展现出强大潜力。然而它们在识别微观伪造痕迹、精确定位篡改区域以及过度依赖昂贵的伪造解释标注等方面仍面临挑战。为此我们推出了TextShield-R1这是首个基于强化学习的 MLLM 篡改文本检测与推理方案。具体而言该方法引入了取证持续预训练Forensic Continual Pre-training通过利用来自自然图像取证和 OCR 任务的大规模廉价数据由易到难地培养模型检测篡改文本的能力。在微调阶段我们采用**组相对策略优化GRPO**和创新的奖励函数以减少对标注的依赖并增强推理能力。在推理阶段我们通过 **OCR 纠正OCR Rectification**增强定位精度。此外为了支持严格评估我们推出了Text Forensics Reasoning (TFR)基准测试包含超过 4.5 万张真实和篡改图像涵盖 16 种语言、10 种篡改技术。广泛的实验表明TextShield-R1 在可解释的篡改文本检测领域达到了目前的领先水平。文本取证推理基准 (TFR Benchmark)动机TFR 的构建旨在解决现有基准测试的七大缺陷领域受限仅限文档或场景文本、范围窄缺乏全生成图像、比例失衡缺乏真实样本、多样性不足、技术过时、缺乏 OOD 评估以及标注不全缺少文本解释。构建与亮点TFR 包含 45,971 张伪造图像和 45,514 张对应的真实图像。它是首个全面涵盖文档、场景文本和证件类卡片的基准。此外它支持跨图像风格CIS、跨篡改方法CTM和跨语言CL的稳健评估并包含了由 GPT-4o 生成的高质量伪造样本。表 1公开文本取证基准测试的对比。数据集图像领域伪造区域真实数量伪造数量方法数语言数最新 AIGC 方法OOD 评估伪造解释T-SROIE文档局部092011SR-Net (2019)××DocTamper文档局部0170,00032-部分×Ours文档场景卡片局部全局45514459711016GPT-4o (2025)满足满足图 2TFR 基准测试的代表性样本左和数据统计右。实验结果我们采用 Qwen2.5-VL-7B 作为基础模型。在 TFR 基准测试上的实验表明TextShield-R1 在图像级分类、篡改文本识别、定位以及伪造推理任务中均显著优于 MiniCPM、Qwen2.5-VL 和 InternVL3 等主流模型确立了该领域新的 SOTA最先进水平。消融实验表 3 展示了我们提出的各模块的消融实验结果。设置 (1) 是 Qwen2.5-VL-7B 基线模型未包含任何我们提出的模块而设置 (5) 是包含所有模块的完整 TextShield-R1 模型。从完整模型中移除法证持续预训练Forensic Continual Pre-training设置 (2)会导致性能显著下降得分甚至低于基线模型 (1)。这突显了该预训练阶段的关键作用它为理解复杂的文本图像法证任务奠定了基础。如果没有这一阶段模型在随后的 GRPO 微调过程中难以有效学习并收敛。尽管完整模型 (5) 仅使用了四分之一训练数据的文本伪造推理注释其表现仍与设置 (3) 相当。这一结果验证了我们新颖的奖励函数能够有效地使模型从部分标注的数据集中学习伪造推理。此外通过整合 OCR 纠偏OCR Rectification方法完整模型 (5) 在所有四个测试集的篡改文本定位任务上均优于设置 (4)。这证实了 OCR 纠偏通过有效利用多模态大语言模型MLLM固有的 OCR 优势增强了定位性能。表 4 详细列出了法证持续预训练阶段的消融实验。设置 (1) 是没有任何持续预训练的基线模型。设置 (2) 仅涉及区分真实、生成和篡改图像的预训练虽然提高了图像级分类能力但导致 OCR 和定位性能大幅下降。这是因为这种单一的预训练目标导致模型对原有的 OCR 和定位能力产生了灾难性遗忘。加入 3D 法证学习3D Forensic Learning任务设置 (3)显著提升了定位性能但由于对文本识别知识的持续遗忘OCR 得分仍然较低。我们的最终预训练模型设置 (5)通过添加 OCR 引用落地OCR Reference Grounding任务实现这使模型既具备法证意识又具备 OCR 能力。最后设置 (5) 优于设置 (4) 的表现证明了 3D 法证学习是必不可少的因为它能帮助模型学习用于伪造定位的泛化特征。Table 2: Comparison experiments. ‘Cls’ denotes the real/generated/tampered task with the accuracy metric. ‘OCR’ denotes the tampered text recognition task with the OCR accuracy metric. ‘Loc.’ denotes the tampered text localization task with the IoU metric. ‘Res.’ denotes the forgery reasoning task, using the average score of cosine similarity, Rouge-L and BLEU as metric. ‘FakeShield\, SIDA\’ denote FakeShield (Xu et al. 2024) and SIDA (Huang et al. 2024) with the Qwen2.5-VL-7B as MLLM.MethodTest setClS setCTM setCL setCls.OCRLoc.Res.Cls.OCRLoc.Res.Cls.OCRLoc.Res.Cls.OCRLoc.Res.Cls.OCRLoc.Res.Official pre-trained base MLLMs without fine-tuningGPT4o51.75.60.519.453.422.01.924.337.827.23.19.748.38.63.114.2MiniCPM-V.2.630.41.60.03.231.44.80.03.226.27.00.02.530.90.40.01.9InternVL3-2B33.25.40.08.540.014.00.19.920.318.20.44.034.56.70.06.9InternVL3-8B40.49.30.217.947.020.90.820.325.231.51.78.845.110.30.518.1Qwen2.5-VL-3B46.21.80.19.548.45.90.315.342.27.70.25.647.11.50.27.9Qwen2.5-VL-7B42.66.40.19.549.919.40.417.634.021.00.64.550.111.10.210.4MLLMs fine-tuned with full training set imagesMiniCPM-V.2.676.217.611.241.171.724.522.832.364.820.724.927.381.533.620.540.3InternVL3-2B75.418.110.340.668.523.121.432.062.518.726.225.080.231.721.039.5InternVL3-8B78.621.915.441.770.827.625.233.867.723.631.432.384.338.024.842.0Qwen2.5-VL-3B77.518.611.642.972.325.020.933.663.018.825.624.680.932.421.439.7Qwen2.5-VL-7B79.124.318.242.971.130.726.535.773.626.334.236.285.138.225.543.1FakeShield70.59.25.435.662.014.810.629.257.515.117.321.470.323.911.334.8FakeShield*79.124.37.642.871.130.515.035.673.626.321.836.285.138.115.642.9SIDA71.29.25.635.762.214.910.829.457.515.117.321.570.423.811.525.0SIDA*79.224.37.742.971.430.915.135.773.626.321.836.385.238.215.843.0Ours88.147.657.858.872.962.161.056.588.845.668.351.285.539.040.646.2Table 3: Ablation study on the proposed modules. ‘w.o.’ denotes ‘without’. ‘FCP’ denotes the Forensic Continual Pre-training approach. ‘OCR Rect.’ denotes the proposed OCR Rectification.NumAblationTest setClS setCTM setCL setCls.OCRLoc.Res.Cls.OCRLoc.Res.Cls.OCRLoc.Res.Cls.OCRLoc.Res.Cls.OCRLoc.Res.(1)Baseline79.124.318.242.971.130.726.535.773.626.334.236.285.138.225.543.1(2)w.o. FCP75.821.912.739.068.425.020.930.666.323.725.530.183.938.526.041.8(3)w.o. GRPO87.646.857.758.672.361.760.856.288.145.368.250.985.438.540.246.1(4)w.o. OCR Rect.88.147.642.758.872.962.156.656.588.845.657.951.285.539.032.346.2(5)TextShield-R188.147.657.858.872.962.161.056.588.845.668.351.285.539.040.646.2Table 4: Ablation study on the proposed Forensic Continual Pre-training method. ‘Nat.’ denotes pre-training the model to identify whether a natural image is real/generated/tampered. ‘3D-FL’ denotes the proposed 3D Forensic Learning approach. ‘OCR’ denotes including the OCR reference grounding task.Num.AblationsTest setClS setCTM setCL setNat.3D-FLOCRCls.OCRLoc.Res.Cls.OCRLoc.Res.Cls.OCRLoc.Res.(1)×××75.821.912.739.068.425.020.930.666.323.725.530.183.938.5(2)✓××80.912.79.834.065.114.312.427.557.616.915.126.280.118.6(3)✓✓×82.311.513.939.267.712.015.829.863.614.719.128.481.59.9(4)✓×✓83.240.948.652.770.456.852.051.478.641.156.148.382.537.8(5)✓✓✓88.147.657.858.872.962.161.056.588.845.668.351.285.539.0结论在这项工作中我们介绍了 TextShield-R1这是一个系统性解决基于 MLLM 的篡改文本检测关键挑战的新型框架。我们的法证持续预训练Forensic Continual Pre-training通过由易到难的课程学习弥合了通用预训练与精细法证分析之间的差距。为了减少对昂贵标注的依赖并培养更深层次的分析能力我们开创了一种强化学习方法并使用新颖的奖励函数来引导模型。此外我们的 OCR 纠偏方法通过利用 MLLM 自身强大的文本识别能力来优化其预测结果优雅地解决了定位准确度低的问题。我们还构建了文本法证推理TFR基准测试。这一综合性资源通过整合多样化的图像领域、现代伪造技术以及稳健的跨领域、跨方法和跨语言测试设置弥补了先前数据集的七项主要缺陷。广泛的实验验证了 TextShield-R1 在检测准确性、泛化能力和可解释性方面显著提升了现有技术水平。通过填补方法论和评估方面的关键空白我们的工作为未来开发更可靠、更值得信赖的法证 AI 系统奠定了坚实的基础。致谢本研究得到了中国国家自然科学基金项目编号62476093和蚂蚁集团研究实习生计划的部分支持。Original Abstract:The growing prevalence of tampered images poses serious security threats, highlighting the urgent need for reliable detection methods. Multimodal large language models (MLLMs) demonstrate strong potential in analyzing tampered images and generating interpretations. However, they still struggle with identifying micro-level artifacts, exhibit low accuracy in localizing tampered text regions, and heavily rely on expensive annotations for forgery interpretation. To this end, we introduce TextShield-R1, the first reinforcement learning based MLLM solution for tampered text detection and reasoning. Specifically, our approach introduces Forensic Continual Pre-training, an easy-to-hard curriculum that well prepares the MLLM for tampered text detection by harnessing the large-scale cheap data from natural image forensic and OCR tasks. During fine-tuning, we perform Group Relative Policy Optimization with novel reward functions to reduce annotation dependency and improve reasoning capabilities. At inference time, we enhance localization accuracy via OCR Rectification, a method that leverages the MLLM’s strong text recognition abilities to refine its predictions. Furthermore, to support rigorous evaluation, we introduce the Text Forensics Reasoning (TFR) benchmark, comprising over 45k real and tampered images across 16 languages, 10 tampering techniques, and diverse domains. Rich reasoning-style annotations are included, allowing for comprehensive assessment. Our TFR benchmark simultaneously addresses seven major limitations of existing benchmarks and enables robust evaluation under cross-style, cross-method, and cross-language conditions. Extensive experiments demonstrate that TextShield-R1 significantly advances the state of the art in interpretable tampered text detection.PDF Link:2602.19828v1部分平台可能图片显示异常请以我的博客内容为准

相关新闻

最新新闻

日新闻

周新闻

月新闻