Why Fingerprints Cannot Be Reversed
When an academic institution stores previous student work, theses, or articles in PlagAware's reference library, we create a "fingerprint" of each text. Optionally, only this fingerprint is what gets stored in our database—not the actual text.
Key Facts at a Glance
- check_circlePlagAware stores only mathematical fingerprints, not the original text
- check_circleReversing a fingerprint back to the original text is mathematically impossible
- check_circleStudents' and authors' intellectual property remains fully protected
- check_circleEven in case of a data breach, no original texts can be reconstructed
What is a Text Fingerprint?
Think of a fingerprint as a unique code that represents a document—similar to how your actual fingerprint represents you but cannot be used to reconstruct your entire body. The fingerprint is much shorter than the original text and contains no readable words or sentences. It serves exclusively to detect matches between texts.
When a new student submission is checked, PlagAware compares it against the stored fingerprints to detect potential plagiarism—without ever needing to store or access the original reference texts.
How Academic Institutions Use PlagAware
The Primary Use Case
Universities, colleges, and research institutions use PlagAware to verify the originality of submitted academic work:
- arrow_forwardStudents submit – bachelor theses, master's theses, or term papers
- arrow_forwardPlagAware scans – the submission against online sources and the institution's reference library
- arrow_forwardResults show – matching passages with their sources
- arrow_forwardFaculty reviews – the report and makes academic integrity decisions
Example of a fingerprint
Original Text: "The quick brown fox jumps over the lazy dog. This sentence demonstrates fingerprinting." ↓ Stored Fingerprint: "ju1AE 6l3M"
The Reference Library
Institutions build a reference library over time containing:
- check_circlePreviously submitted student theses
- check_circlePublished articles and papers
- check_circleCourse materials and lecture notes
- check_circleAny other texts that should be checked against
These reference texts are stored as fingerprints only—protecting the intellectual property of previous students and authors.
Why Fingerprint-Based Storage Matters
The Challenge for Institutions
Academic institutions face a dilemma when building a plagiarism reference database:
| Need | Risk with Traditional Storage |
|---|---|
| Store previous theses for comparison | Students' intellectual property could be exposed |
| Include published articles | Copyright and licensing concerns |
| Build comprehensive database | Large collection = larger breach risk |
| Share across departments | Wider access = more vulnerability |
The Fingerprint Solution
| What We Store | What This Means |
|---|---|
| check_circle Mathematical fingerprint | Plagiarism can still be detected accurately |
| cancel NOT the actual text | Original authors' work cannot be read or copied |
| cancel NOT recoverable content | Even a database breach reveals nothing useful |
Key Benefits for Academic Institutions
1. Student Privacy Protection
When you add a student's thesis to the reference library:
- check_circleThe thesis text is not stored—only its fingerprint
- check_circleNo one can read the thesis from the database—not staff, not hackers, not anyone
- check_circleGraduates' work stays private even as it helps maintain academic integrity
2. No Intellectual Property Concerns
- check_circlePrevious students' work cannot be copied or sold
- check_circlePublished articles in the library cannot be redistributed
- check_circleNo copyright liability from storing third-party content
3. Full Detection Capability
Despite not storing original texts, the system detects:
- check_circleExact copying from previous submissions
- check_circleParaphrased content from reference materials
- check_circlePartial matches indicating potential plagiarism
- check_circleMatches across thousands of documents in milliseconds
4. Compliance and Data Protection
- check_circleGDPR-friendly: No personal intellectual property is retained
- check_circleReduced liability: Cannot leak what you don't have
- check_circleEasy deletion: Removing a fingerprint leaves no trace of the original
Why the Original Text Cannot Be Recovered
The "Blender" Problem
Imagine putting ingredients into a blender:
- arrow_forwardYou put in an apple, banana, and orange
- arrow_forwardYou get a smoothie
- arrow_forwardYou cannot "un-blend" the smoothie back into the original fruits
The fingerprinting process works similarly. It combines information in a way that cannot be reversed.
Many Words → Same Code
The fingerprint uses a mathematical formula that assigns each word a single character (one of 62 possible: 0-9, A-Z, a-z).
The Math:
- arrow_forwardEach letter has a numeric value (a=97, b=98, c=99, etc.)
- arrow_forwardAll letter values in a word are added together
- arrow_forwardThe result is divided by 62, and only the remainder is kept
Example of a "Collision":
| Word | Letter Values | Sum | ÷ 62 Remainder | Fingerprint Character |
|---|---|---|---|---|
| "form" | 102+111+114+109 | 436 | 2 | 2 |
| "from" | 102+114+111+109 | 436 | 2 | 2 |
| "wort" | 119+111+114+116 | 460 | 26 | Q |
| "trow" | 116+114+111+119 | 460 | 26 | Q |
Notice that "form" and "from" produce the exact same fingerprint character! This is called a "collision."
Key insight: With only 62 possible characters but millions of words in a language, on average thousands of different words share each fingerprint character.
Information is Permanently Lost
The fingerprinting process discards:
- closeAll words shorter than 4 letters ("the", "a", "is", "and", etc.)
- closeAll punctuation and formatting
- closeCapital letters (everything becomes lowercase)
- closeThe actual spelling of words (only a mathematical hash remains)
- closeNumbers and special characters
Could Someone Guess the Original Text?
The Scale of Impossibility
Let's calculate how many possible original texts could produce the same fingerprint:
Assumptions:
- arrow_forwardAverage English word length: 5 letters
- arrow_forwardWords with 4+ letters in English: ~150,000
- arrow_forwardWords sharing one fingerprint character: ~2,400 (150,000 ÷ 62)
| Fingerprint Length | Possible Combinations |
|---|---|
| 1 character | 2,400 words |
| 2 characters | 5,760,000 combinations |
| 5 characters | 7.9 × 10¹⁶ (79 quadrillion) |
| 10 characters | 6.2 × 10³³ combinations |
| 50 characters | 10¹⁶⁹ combinations |
A typical academic paper might have a fingerprint of 500+ characters.
To put this in perspective:
- arrow_forwardThere are approximately 10⁸⁰ atoms in the observable universe
- arrow_forwardA 50-character fingerprint has more possible source texts than atoms in the universe—by a factor of 10⁸⁹
Brute Force Time Estimates
If a supercomputer could check 1 trillion (10¹²) combinations per second:
| Fingerprint Length | Time to Check All Possibilities |
|---|---|
| 5 characters | 10 years |
| 10 characters | 350 billion years |
| 20 characters | Longer than the age of the universe × 10²⁰ |
What About Using AI (ChatGPT, etc.)?
Could an AI Reconstruct the Text?
Modern AI language models are impressive, but they face the same fundamental limitations:
- closeThe Collision Problem Remains: Even if an AI knows that fingerprint character "A" maps to some word, it still has ~2,400 candidates. AI cannot know which specific word was used.
- closeNo Training Data Exists: AI models learn from examples. Since fingerprints are designed to be irreversible, there is no training data of "fingerprint → original text" pairs.
- closeGrammatical Constraints Don't Help Enough: While an AI might generate grammatically correct text, the search space is still impossibly large.
Language Statistics: A Closer Look
Some might argue: "But language isn't random! Certain word combinations are more common."
This is true, but still insufficient:
Using word frequency data from English language corpora:
- arrow_forwardThe 1,000 most common words account for ~70% of typical text
- arrow_forwardBut ~300 of these are 3 letters or fewer (discarded!)
- arrow_forwardThe remaining ~700 words still spread across 62 fingerprint characters
- arrow_forwardThat's still ~11 words per character on average
Even limiting to only common words, a 20-character fingerprint still has:
- arrow_forward11²⁰ = 6.7 × 10²⁰ possible combinations
- arrow_forwardThat's 670 quintillion possibilities
Practical Security Implications
What This Means for Reference Library Content
| Concern | Reality |
|---|---|
| "Can anyone read stored theses?" | No. Only fingerprints exist—no readable text. |
| "Can hackers steal previous student work?" | No. There's nothing to steal—only irreversible codes. |
| "Can essay mills access the database?" | No. Fingerprints cannot be converted back to usable text. |
| "Are we liable for storing others' IP?" | Minimal. You store mathematical representations, not content. |
| "What if we're audited or subpoenaed?" | We can only provide fingerprints, which are meaningless without originals. |
How This Compares to Other Services
| Service Type | What They Store | Risk to Authors |
|---|---|---|
| Traditional plagiarism databases | Full text copies | High—texts can be leaked, sold, or misused |
| Document repositories | Complete documents | Medium—depends on security measures |
| PlagAware Reference Library | Fingerprints only | None—mathematical impossibility to recover |
Summary
How PlagAware Protects Everyone
| Stakeholder | How Fingerprints Help |
|---|---|
| Current students | Their submissions are checked fairly against comprehensive sources |
| Previous students | Their theses help detect plagiarism without exposing their work |
| Faculty | Reliable detection without managing sensitive text databases |
| Institution | Reduced liability, simplified compliance, effective integrity checks |
| Original authors | Reference articles cannot be extracted or redistributed |
The Bottom Line
Reference library content is protected by mathematics, not just policies.
When an institution adds texts to PlagAware's reference library:
- check_circlePlagiarism detection works accurately against all stored references
- check_circleOriginal authors' intellectual property remains completely private
- check_circleNo text can ever be recovered—even by PlagAware
- check_circleStudents, authors, and institutions are all protected
This is not a policy decision that could change—it's a mathematical certainty built into how the system works.
Technical Reference
For those interested in the exact algorithm:
Word → Sum of ASCII values → Modulo 62 → Character (0-9, A-Z, a-z)
- arrow_forwardWords < 4 characters: Ignored
- arrow_forwardPunctuation/short words: Create sentence boundaries (spaces in fingerprint)
- arrow_forwardCollision rate: ~2,400 words per character (assuming 150,000 words with 4+ letters)
