Why Fingerprints Cannot Be Reversed

When an academic institution stores previous student work, theses, or articles in PlagAware's reference library, we create a "fingerprint" of each text. Optionally, only this fingerprint is what gets stored in our database—not the actual text.

Key Facts at a Glance

PlagAware stores only mathematical fingerprints, not the original text

Reversing a fingerprint back to the original text is mathematically impossible

Students' and authors' intellectual property remains fully protected

Even in case of a data breach, no original texts can be reconstructed

What is a Text Fingerprint?

Think of a fingerprint as a unique code that represents a document—similar to how your actual fingerprint represents you but cannot be used to reconstruct your entire body. The fingerprint is much shorter than the original text and contains no readable words or sentences. It serves exclusively to detect matches between texts.

When a new student submission is checked, PlagAware compares it against the stored fingerprints to detect potential plagiarism—without ever needing to store or access the original reference texts.

How Academic Institutions Use PlagAware

The Primary Use Case

Universities, colleges, and research institutions use PlagAware to verify the originality of submitted academic work:

Students submit – bachelor theses, master's theses, or term papers
PlagAware scans – the submission against online sources and the institution's reference library
Results show – matching passages with their sources
Faculty reviews – the report and makes academic integrity decisions

Example of a fingerprint

Original Text: "The quick brown fox jumps over the lazy dog. This sentence demonstrates fingerprinting."
↓
Stored Fingerprint: "ju1AE 6l3M"

The Reference Library

Institutions build a reference library over time containing:

Previously submitted student theses
Published articles and papers
Course materials and lecture notes
Any other texts that should be checked against

These reference texts are stored as fingerprints only—protecting the intellectual property of previous students and authors.

Why Fingerprint-Based Storage Matters

The Challenge for Institutions

Academic institutions face a dilemma when building a plagiarism reference database:

Need	Risk with Traditional Storage
Store previous theses for comparison	Students' intellectual property could be exposed
Include published articles	Copyright and licensing concerns
Build comprehensive database	Large collection = larger breach risk
Share across departments	Wider access = more vulnerability

The Fingerprint Solution

What We Store	What This Means
Mathematical fingerprint	Plagiarism can still be detected accurately
NOT the actual text	Original authors' work cannot be read or copied
NOT recoverable content	Even a database breach reveals nothing useful

Key Benefits for Academic Institutions

1. Student Privacy Protection

When you add a student's thesis to the reference library:

The thesis text is not stored—only its fingerprint
No one can read the thesis from the database—not staff, not hackers, not anyone
Graduates' work stays private even as it helps maintain academic integrity

2. No Intellectual Property Concerns

Previous students' work cannot be copied or sold
Published articles in the library cannot be redistributed
No copyright liability from storing third-party content

3. Full Detection Capability

Despite not storing original texts, the system detects:

Exact copying from previous submissions
Paraphrased content from reference materials
Partial matches indicating potential plagiarism
Matches across thousands of documents in milliseconds

4. Compliance and Data Protection

GDPR-friendly: No personal intellectual property is retained
Reduced liability: Cannot leak what you don't have
Easy deletion: Removing a fingerprint leaves no trace of the original

Why the Original Text Cannot Be Recovered

The "Blender" Problem

Imagine putting ingredients into a blender:

You put in an apple, banana, and orange
You get a smoothie
You cannot "un-blend" the smoothie back into the original fruits

The fingerprinting process works similarly. It combines information in a way that cannot be reversed.

Many Words → Same Code

The fingerprint uses a mathematical formula that assigns each word a single character (one of 62 possible: 0-9, A-Z, a-z).

The Math:

Each letter has a numeric value (a=97, b=98, c=99, etc.)
All letter values in a word are added together
The result is divided by 62, and only the remainder is kept

Example of a "Collision":

Word	Letter Values	Sum	÷ 62 Remainder	Fingerprint Character
"form"	102+111+114+109	436	2	2
"from"	102+114+111+109	436	2	2
"wort"	119+111+114+116	460	26	Q
"trow"	116+114+111+119	460	26	Q

Notice that "form" and "from" produce the exact same fingerprint character! This is called a "collision."

Key insight: With only 62 possible characters but millions of words in a language, on average thousands of different words share each fingerprint character.

Information is Permanently Lost

The fingerprinting process discards:

All words shorter than 4 letters ("the", "a", "is", "and", etc.)
All punctuation and formatting
Capital letters (everything becomes lowercase)
The actual spelling of words (only a mathematical hash remains)
Numbers and special characters

Could Someone Guess the Original Text?

The Scale of Impossibility

Let's calculate how many possible original texts could produce the same fingerprint:

Assumptions:

Average English word length: 5 letters
Words with 4+ letters in English: ~150,000
Words sharing one fingerprint character: ~2,400 (150,000 ÷ 62)

Fingerprint Length	Possible Combinations
1 character	2,400 words
2 characters	5,760,000 combinations
5 characters	7.9 × 10¹⁶ (79 quadrillion)
10 characters	6.2 × 10³³ combinations
50 characters	10¹⁶⁹ combinations

A typical academic paper might have a fingerprint of 500+ characters.

To put this in perspective:

There are approximately 10⁸⁰ atoms in the observable universe
A 50-character fingerprint has more possible source texts than atoms in the universe—by a factor of 10⁸⁹

Brute Force Time Estimates

If a supercomputer could check 1 trillion (10¹²) combinations per second:

Fingerprint Length	Time to Check All Possibilities
5 characters	10 years
10 characters	350 billion years
20 characters	Longer than the age of the universe × 10²⁰

What About Using AI (ChatGPT, etc.)?

Could an AI Reconstruct the Text?

Modern AI language models are impressive, but they face the same fundamental limitations:

The Collision Problem Remains: Even if an AI knows that fingerprint character "A" maps to some word, it still has ~2,400 candidates. AI cannot know which specific word was used.
No Training Data Exists: AI models learn from examples. Since fingerprints are designed to be irreversible, there is no training data of "fingerprint → original text" pairs.
Grammatical Constraints Don't Help Enough: While an AI might generate grammatically correct text, the search space is still impossibly large.

Language Statistics: A Closer Look

Some might argue: "But language isn't random! Certain word combinations are more common."

This is true, but still insufficient:

Using word frequency data from English language corpora:

The 1,000 most common words account for ~70% of typical text
But ~300 of these are 3 letters or fewer (discarded!)
The remaining ~700 words still spread across 62 fingerprint characters
That's still ~11 words per character on average

Even limiting to only common words, a 20-character fingerprint still has:

11²⁰ = 6.7 × 10²⁰ possible combinations
That's 670 quintillion possibilities

Practical Security Implications

What This Means for Reference Library Content

Concern	Reality
"Can anyone read stored theses?"	No. Only fingerprints exist—no readable text.
"Can hackers steal previous student work?"	No. There's nothing to steal—only irreversible codes.
"Can essay mills access the database?"	No. Fingerprints cannot be converted back to usable text.
"Are we liable for storing others' IP?"	Minimal. You store mathematical representations, not content.
"What if we're audited or subpoenaed?"	We can only provide fingerprints, which are meaningless without originals.

How This Compares to Other Services

Service Type	What They Store	Risk to Authors
Traditional plagiarism databases	Full text copies	High—texts can be leaked, sold, or misused
Document repositories	Complete documents	Medium—depends on security measures
PlagAware Reference Library	Fingerprints only	None—mathematical impossibility to recover

Summary

How PlagAware Protects Everyone

Stakeholder	How Fingerprints Help
Current students	Their submissions are checked fairly against comprehensive sources
Previous students	Their theses help detect plagiarism without exposing their work
Faculty	Reliable detection without managing sensitive text databases
Institution	Reduced liability, simplified compliance, effective integrity checks
Original authors	Reference articles cannot be extracted or redistributed

The Bottom Line

Reference library content is protected by mathematics, not just policies.

When an institution adds texts to PlagAware's reference library:

Plagiarism detection works accurately against all stored references

Original authors' intellectual property remains completely private

No text can ever be recovered—even by PlagAware

Students, authors, and institutions are all protected

This is not a policy decision that could change—it's a mathematical certainty built into how the system works.

Technical Reference

For those interested in the exact algorithm:

Word → Sum of ASCII values → Modulo 62 → Character (0-9, A-Z, a-z)

Words < 4 characters: Ignored
Punctuation/short words: Create sentence boundaries (spaces in fingerprint)
Collision rate: ~2,400 words per character (assuming 150,000 words with 4+ letters)

What is a Fingerprint?
Institutional Use
Why Fingerprints?
Benefits for Institutions
Why No Recovery?
Brute Force Analysis
AI and Language Statistics
Security Implications
Summary
Technical Reference