Papers
arxiv:2410.21637

Mitigating Paraphrase Attacks on Machine-Text Detectors via Paraphrase Inversion

Published on Mar 19, 2025
Authors:
,

Abstract

Paraphrase inversion techniques restore original text from paraphrased versions, improving machine-text detector performance through translation-based approaches that generalize across unseen paraphrasing models.

High-quality paraphrases are easy to produce using instruction-tuned language models or specialized paraphrasing models. Although this capability has a variety of benign applications, paraphrasing attacksx2013paraphrases applied to machine-generated textsx2013are known to significantly degrade the performance of machine-text detectors. This motivates us to consider the novel problem of paraphrase inversion, where, given paraphrased text, the objective is to recover an approximation of the original text. The closer the approximation is to the original text, the better machine-text detectors will perform. We propose an approach which frames the problem as translation from paraphrased text back to the original text, which requires examples of texts and corresponding paraphrases to train the inversion model. Fortunately, such training data can easily be generated, given a corpus of original texts and one or more paraphrasing models. We find that language models such as GPT-4 and Llama-3 exhibit biases when paraphrasing which an inversion model can learn with a modest amount of data. Perhaps surprisingly, we also find that such models generalize well, including to paraphrase models unseen at training time. Finally, we show that when combined with a paraphrased-text detector, our inversion models provide an effective defense against paraphrasing attacks, and overall our approach yields an average improvement of +22% AUROC across seven machine-text detectors and three different domains.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2410.21637
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.21637 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.21637 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.21637 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.