Reducing the Generation and Propagation of AI-generated Misinformation

By Chris Goh

Background Information

Generative artificial intelligence (GenAI) including large language models like GPT-4 and diffusion models such as Stable Diffusion has advanced to a point where synthetic text, images, audio, and video can be produced rapidly, at low cost, and with high realism, often indistinguishable from authentic human-created content (Bommasani et al., 2021). This capability has thus lowered the entry requirement in order to create deceptive media, enabling malicious actors to produce misinformation (Chesney & Citron, 2019; Brundage et al., 2018).

Research Question and Thesis Statement

Research Question: Which technical measures can we implement to meaningfully reduce the generation and propagation of AI-generated misinformation?

Thesis Statement: A multi-layered mitigation strategy combining cryptographic provenance, proactive watermarking, and algorithmic demotion policies offers a grounded solution to secure the information ecosystem against AI-driven misinformation.

Purpose and Scope of the Paper

This paper takes current evidence in order to construct a coherent framework for mitigating AI-generated misinformation.

This paper focuses on what technical safeguards can be implemented to mitigate the spread of AI disinformation. No original data collection was performed during this study; instead, this paper relies on analysis of existing benchmarks, policy documents, and peer-reviewed literature.

2. Literature Review

Summary of Existing Research

Recent scholarly articles shows that generative artificial intelligence (GenAI) significantly lowers the cost, effort, and technical skill required to produce persuasive synthetic content at scale (Brundage et al., 2018). Studies demonstrate that AI-generated text generated by state-of-the-art large language models (LLMs) have equal or greater credibility to humanc reated content. (Gehman et al., 2020; Buchanan et al., 2023). The same can be seen for diffusion based image generators.

Efforts to detect AI-generated media have focused on deep learning classifiers such as XceptionNet and Vision Transformers (ViTs). However, these detection systems have been shown to underperform when evaluated on out-of-distribution data or perturbed data (data subjected to minor changes). (Chai et al., 2022; Marra et al., 2023). In contrast, more proactive approaches, such as invisible watermarking and cryptographic provenance are showing themselves to be more reliable alternatives. (Kirchenbauer et al., 2023; C2PA, 2023).

Key Theories, Findings, and Debates

Gaps in the Literature

Firstly, most technical evaluations about GenAI occur in controlled laboratory settings and fail to account for scalability constraints in real-world deployment, such as computational overhead, cross-platform interoperability, and user adoption barriers (Neupane et al., 2024).

Secondly, few studies systematically address the societal consequences of “liar’s dividend”, a phenomenon wherein bad actors exploit public uncertainty about AI authenticity to dismiss genuine evidence as synthetic (Chesney & Citron, 2019).

This paper addresses these shortcoming by proposing a unified, evidence-based architecture that integrates technically feasible safeguards and a framework for regulatory action.

2. Technical Landscape

Modern GenAI systems exhibit high fidelity (the degree of perceptual authenticity of synthetic content compared to genuine, human-created or naturally captured data.):

● Text: GPT-4 generates news-style text rated as more trustworthy than human-written articles by 64% of participants in controlled studies [4].

● Images: Stable Diffusion v2.1 produces synthetic faces with Fréchet Inception Distance (FID) scores below 5, approaching photographic realism [5].

● Video/Audio: Open-source tools like Wav2Lip and Resemble.AI enable lip-synced deepfakes and voice cloning with minimal training data [6].

Most importantly, while adversaries can adapt generators to fool classifiers, defenders must generalize to account for all possible attacks. In practice, experimental results have confirmed that while the accuracy of even the most effective deep fake classifiers dips below 70% over in-distribution generators.

3. Technical Approaches

3.1 Cryptographic Provenance (C2PA)

The Coalition for Content Provenance and Authenticity (C2PA) a joint initiative by Adobe, Microsoft, Intel, and others to develop an open technical standard for tamper-evident content. This standard embeds cryptographically signed metadata (a “manifest”) that records the origin of digital content, editing history, and whether generative AI was involved in its creation (C2PA, 2023). Critically, this manifest is designed to survive common format conversions (e.g., JPEG compression, MP4 transcoding) and can be verified independently by platforms or users.

Adobe’s implementation, Content Credentials, is now integrated into widely used tools including Photoshop, Microsoft Designer, and the Windows 11 Snipping Tool among others. Empirical user studies conducted by the C2PA and independent researchers show that using visible badges (e.g., “Made with AI” labels) to mark such content, participants were 23% less likely to believe false claims presented in synthetic media compared to unlabeled content (Hussain et al., 2023).

3.2 AI Watermarking

Watermarking offers a proactive alternative to post-hoc detection by embedding statistical or perceptual signals directly into AI-generated outputs during the generation process. Unlike classifiers, watermarking does not require retraining.

Two main approaches have emerged:

SynthID (Google DeepMind, 2023): This system modifies pixel values in AI-generated images using a diffusion-model-aware perturbation strategy. The watermark survives common transformations including JPEG compression (quality ≥70), cropping (up to 20%), and color adjustments with >95% detection accuracy in internal tests (Google DeepMind, 2023). SynthID is being piloted in Google’s Imagen and integrated into Vertex AI.

Stable Signature (NVIDIA, 2023): Designed specifically for diffusion models like Stable Diffusion, this method injects subtle noise patterns in the latent space during generation. It achieves a 98.5% recovery rate after exposure to Gaussian noise, resizing, and brightness shifts outperforming prior spatial-domain watermarks (NVIDIA Research, 2023).

However, a key limitation to this approach is that such watermarking requires cooperation from model developers and cannot be applied retroactively to open-weight models (e.g., Llama, Stable Diffusion) deployed locally without modification (Kirchenbauer et al., 2023).

3.3 Statistical and Linguistic Anomaly Detection (for Text)

It is well known that AI tends to have its own distinct writing style, such as its overuse of em dashes. By analysing some of its statistical deviations from human writing, we can determine with greater certainty whether a text was written by AI or a human.

DetectGPT, introduced by Mitchell et al. (2023), is a zero-shot (zero specific training examples given) method for detecting AI-generated text that does not require training a separate classifier, collecting labeled datasets, or modifying language models. Instead, it uses a fundamental property of how large language models (LLMs) assign probabilities to text: AI-generated passages tend to occupy regions of negative curvature in the model’s log-probability landscape, meaning they are local maxima—“peaks” where even small perturbations (e.g., rephrasing a sentence) cause a sharp drop in likelihood. In contrast, human-written text is more robust to such changes. DetectGPT operationalizes this by comparing the log-probability of an original passage to that of multiple paraphrased versions (e.g., generated by T5). If the original is significantly more probable, it is flagged as machine-generated. This approach achieves 95% AUROC on detecting GPT-NeoX–generated fake news, outperforming all prior zero-shot baselines.

DetectGPT’s effectiveness comes from its sensitivity to three anomalies that commonly appear in LLM output: low perplexity, unusual burstiness, and over-optimized semantic coherence.

a) Low Perplexity

Perplexity measures how “surprised” a language model is by a sequence of text It can be represented by the following equation:

Human writing often includes irregularities such as unexpected metaphors, regional idioms or other such exprssions, syntactic quirks, and so on. These introduce controlled randomness, thus resulting in higher perplexity. However, LLM’s are optimized to minimize this prediction error by selecting the most probable next token, thus producing text that is overly fluent, generic, and statistically “smooth”. This leads to low perplexity across AI-generated passages.

Because AI text sits in a narrow probability peak, its log-probability (and thus inverse perplexity) is unusually high compared to perturbed (slightly altered or paraphrased) versions. Buchanan et al. (2023) found that GPT-3.5–generated essays had a mean

perplexity of 19.4 (std. dev. = 6.1), while human-written counterparts averaged 42.7 (std. dev. = 18.3)

b) Unusual Burstiness

Burstiness is the variance in token perplexity across a document or in simple terms, the distribution of “non-standard” words or phrases. It can be expressed as the equation below.

Humans use rare terms or jargon sparingly and contextually (e.g., mentioning “quantum decoherence” once in a physics essay), leading to moderately small, natural burstiness patterns. LLMs, however, may overuse rare tokens due to overfitting, prompt-induced repetition (“repeat key terms”), or lack of true understanding, thus resulting in either unnaturally clustered rare words (high burstiness) or monotonous lexical uniformity (unusually low burstiness).

A text with abnormally low burstiness, i.e., all sentences using similar common and generic words, will have low variability in token probability and, hence, the entire text will fall in a high likelihood region. However, after perturbing the text, the burstiness of the text changes, and the log probability drops abnormally, compared to its original state in human-written text. The strategy of GPTZero, as presented in Tian (2023), explicitly includes low perplexity and burstiness but fails to adjust according to the paraphrased input, which is the advantage of the curvature-based approach.

c) Over-Optimized Semantic Coherence

Semantic coherence refers to the topical consistency across a text. While human writing often includes abnormalities such as tone shifts, or ambiguous phrasing and so, LLMs are trained to maximise coherence through objectives like next-token prediction and instruction tuning. This produces “hyper-coherent” narratives with unnaturally smooth transitions and perfect adherence to the topic, paradoxically making them less human-like.

Mitchell et al. (2023) notes that these over-optimizations create a “valley” in the probability landscape: AI-generated text is so tightly optimized that minor perturbations (e.g., synonym substitution or reordering) drastically reduce its likelihood under the model.

Human text, however, is more “resilient” because it was never made for such “probabilistic fluency”. DetectGPT measures this fragility by generating perturbed versions of the input and then checking whether the original is a sharp local maximum. If so, then it is likely machine-generated.

Results Table

Approach

Limitations and Challenges

As always, these measures are not foolproof, nor will they always detect AI content.

C2PA

● Metadata stripping: The format of C2PA manifests, stored as file metadata such as XMP information or PNG chunks, is normally stripped during upload to a social media platform such as Instagram, WhatsApp, or during a screenshot.

● Lack of protection for legacy or unmarked content: This only works for content produced after C2PA adoption. I cannot verify unmarked adversarial synthetic media or legacy content.

● Voluntary implementation: this model relies on the integrity of the creator; malicious creators can modify or delete the manifests without any cryptographic consequences.

AI Watermark

● Only works with closed model cooperation: Only will work when model providersembed watermarks into inference flows, for example, Google and NVIDIA. Utterly does notwork with open-weight models, such as Llama and Stable Diffusion, which are hostedlocally.

● Vulnerable to analog attacks: Watermarks are prone to degradation through screen recording, printing/scanning, or video re-encoding, which are common methods of misinformation dissemination in

● False sense of security: The assumption of greater than 95% accuracy in lab settings is based on ideal conditions. In practical transformation scenarios, like TikTok compression, recovery rates may be affected significantly.

Perplexity based Analysis

● Easily evaded by paraphrasing: According to Buchanan et al. (2023), the rate of accuracy decreases from near 85% to below 60% by commanding the LLM model with "rewrite this in your own words."

● High False Positives: Formal human-written text, such as legal documents or academic abstracts, will likely have low perplexity scores, leading to misclassification as machine-generated.

● Model dependency: A match with a similar reference LLM is needed, and performance drops off badly for different model families.

Burstiness based Analysis

● Genre-sensitive: Burstiness naturally varies across writing styles, e.g., poetry vs. news writing. Technical or repetitive writing styles can be similar to AI's low burstiness and are likely not accurate indicators.

● Not robust on its own: It only works well in combination with perplexity, like in GPTZero, because it doesn’t have much discriminative power by itself.

● Fragile to editing: Minor stylistic changes (e.g., adding emphasis or examples) change token surprisal distribution and break the signal.

Semantic Coherence Analysis

● Computationally expensive: Many perturbed variants are generated per input, making it unsuitable for real-time or high-volume moderation.

● Log-probability access dependent: Requires either the original generator’s logit output or a very similar proxy, which is unavailable for closed APIs such as GPT-4 without special access.

● Breaks under meaningful edits: If paraphrasing is semantically equivalent yet different in form-an example of which would be instruction tuning with LLMs-then the curvature signal disappears, and the method fails to detect.

Conclusion

This paper seeks to answer the pivotal question: What technical measures offer promise in mitigating the generation of, as well as the flow of, AI-based misinformation? This has been done by carrying out a comparative analysis of five detection schemes: crypto-provenance

(C2PA), AI watermark, perplexity-based, burstiness-based, and semantic coherence analysis. It has been shown that proactive-approach-based methods promise much more in comparison to reactive approach-based methods.

C2PA and AI watermarking use verifiable markings during creation that can be identified after various transformations or compressions. Statistical approaches, though theoretically being more accurate approaches, are shown to be very fragile in minimal adversarial conditions such as paraphrasing and styling. Their accuracy dips to lesser than 60% in such cases (Buchanan et al., 2023; Mitchell et al., 2023). This confirms the concept mentioned in the earlier paragraph about the disparity seen in the rate of advancement in generative models and detection systems.

However, the five approaches are still limited in the sense that, while the C2PA adopts voluntarily, the watermarking solution has the problem of failing if the model uses open weights that operate offline, whereas statistical-based solutions face the problem of false positives. Everything suggests that technical solutions are not enough.

Therefore, we recommend a multi-faceted approach:

1. Standardizing mandatory watermarking and provenance for commercial AI systems through regulation (e.g., extending the EU AI Act).

2. Developing hybrid verification systems that combine cryptographic metadata with statistical analysis to determine the authenticity of content.

3. Continuous development of research into this field to generate statistically better ways for content detection.

References

[1] Bommasani, R., et al. (2021). On the Opportunities and Risks of Foundation Models. arXiv:2108.07258.

[2] Wardle, C., & Derakhshan, H. (2017). Information Disorder: Toward an interdisciplinary framework for research and policy making. Council of Europe Report.

[3] Wang, R., et al. (2023). Diffusion Models are Undetectable. arXiv:2311.10282.

[4] Clark, E., et al. (2023). Human Preference Evaluation of AI-Generated Text. ACL Findings. [5] Saharia, C., et al. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS.

[6] Chung, J.S., et al. (2019). You said that? (Wav2Lip). BMVC.

[7] Coalition for Content Provenance and Authenticity (C2PA). (2023). C2PA Specification v1.2. https://c2pa.org/specifications/

[8] Adobe. (2023). Content Credentials: A New Standard for Digital Authenticity. https://contentauthenticity.org

[9] Hwang, T.J., et al. (2023). Effects of Provenance Labels on Misinformation Belief. Proceedings of the ACM on Human-Computer Interaction (CSCW).

[10] Google DeepMind. (2023). SynthID: Invisible Watermarking for AI-Generated Images. https://deepmind.google/technologies/synthid/

[11] Fernandez, A., et al. (2023). Stable Signature: Stealthy Watermarking for Diffusion Models. arXiv:2311.18800.

[12] Dolhansky, B., et al. (2020). The Deepfake Detection Challenge Dataset. arXiv:2006.07397. [13] European Parliament. (2024). Regulation on Artificial Intelligence (AI Act). Official Journal of the EU.

[14] Meta. (2023). Labeling AI-Generated Images on Facebook and Instagram. https://about.fb.com/news/2023/09/ai-image-labeling/

[15] TikTok. (2023). AI-Generated Content Policy.

https://www.tiktok.com/transparency/en/ai-content/

[16] Diakopoulos, N., & Johnson, I. (2023). Auditing AI-Generated Political Imagery on Social Media. arXiv:2310.09876.

[17] Weidinger, I., et al. (2021). Ethical and Societal Implications of Open Foundation Models. arXiv:2110.08368.