Why Fingerprints Cannot Be Reversed

Fingerprint Icon

When an academic institution stores previous student work, theses, or articles in PlagAware's reference library, we create a "fingerprint" of each text. Optionally, only this fingerprint is what gets stored in our database—not the actual text.

Key Facts at a Glance

  • check_circlePlagAware stores only mathematical fingerprints, not the original text
  • check_circleReversing a fingerprint back to the original text is mathematically impossible
  • check_circleStudents' and authors' intellectual property remains fully protected
  • check_circleEven in case of a data breach, no original texts can be reconstructed

What is a Text Fingerprint?

Think of a fingerprint as a unique code that represents a document—similar to how your actual fingerprint represents you but cannot be used to reconstruct your entire body. The fingerprint is much shorter than the original text and contains no readable words or sentences. It serves exclusively to detect matches between texts.

When a new student submission is checked, PlagAware compares it against the stored fingerprints to detect potential plagiarism—without ever needing to store or access the original reference texts.

How Academic Institutions Use PlagAware

The Primary Use Case

Universities, colleges, and research institutions use PlagAware to verify the originality of submitted academic work:

  • arrow_forwardStudents submit – bachelor theses, master's theses, or term papers
  • arrow_forwardPlagAware scans – the submission against online sources and the institution's reference library
  • arrow_forwardResults show – matching passages with their sources
  • arrow_forwardFaculty reviews – the report and makes academic integrity decisions

Example of a fingerprint

Original Text: "The quick brown fox jumps over the lazy dog. This sentence demonstrates fingerprinting."
↓
Stored Fingerprint: "ju1AE 6l3M"

The Reference Library

Institutions build a reference library over time containing:

  • check_circlePreviously submitted student theses
  • check_circlePublished articles and papers
  • check_circleCourse materials and lecture notes
  • check_circleAny other texts that should be checked against

These reference texts are stored as fingerprints only—protecting the intellectual property of previous students and authors.

Why Fingerprint-Based Storage Matters

The Challenge for Institutions

Academic institutions face a dilemma when building a plagiarism reference database:

Need Risk with Traditional Storage
Store previous theses for comparison Students' intellectual property could be exposed
Include published articles Copyright and licensing concerns
Build comprehensive database Large collection = larger breach risk
Share across departments Wider access = more vulnerability

The Fingerprint Solution

What We Store What This Means
check_circle Mathematical fingerprint Plagiarism can still be detected accurately
cancel NOT the actual text Original authors' work cannot be read or copied
cancel NOT recoverable content Even a database breach reveals nothing useful

Key Benefits for Academic Institutions

1. Student Privacy Protection

When you add a student's thesis to the reference library:

  • check_circleThe thesis text is not stored—only its fingerprint
  • check_circleNo one can read the thesis from the database—not staff, not hackers, not anyone
  • check_circleGraduates' work stays private even as it helps maintain academic integrity

2. No Intellectual Property Concerns

  • check_circlePrevious students' work cannot be copied or sold
  • check_circlePublished articles in the library cannot be redistributed
  • check_circleNo copyright liability from storing third-party content

3. Full Detection Capability

Despite not storing original texts, the system detects:

  • check_circleExact copying from previous submissions
  • check_circleParaphrased content from reference materials
  • check_circlePartial matches indicating potential plagiarism
  • check_circleMatches across thousands of documents in milliseconds

4. Compliance and Data Protection

  • check_circleGDPR-friendly: No personal intellectual property is retained
  • check_circleReduced liability: Cannot leak what you don't have
  • check_circleEasy deletion: Removing a fingerprint leaves no trace of the original

Why the Original Text Cannot Be Recovered

The "Blender" Problem

Imagine putting ingredients into a blender:

  • arrow_forwardYou put in an apple, banana, and orange
  • arrow_forwardYou get a smoothie
  • arrow_forwardYou cannot "un-blend" the smoothie back into the original fruits

The fingerprinting process works similarly. It combines information in a way that cannot be reversed.

Many Words → Same Code

The fingerprint uses a mathematical formula that assigns each word a single character (one of 62 possible: 0-9, A-Z, a-z).

The Math:

  • arrow_forwardEach letter has a numeric value (a=97, b=98, c=99, etc.)
  • arrow_forwardAll letter values in a word are added together
  • arrow_forwardThe result is divided by 62, and only the remainder is kept

Example of a "Collision":

Word Letter Values Sum ÷ 62 Remainder Fingerprint Character
"form" 102+111+114+109 436 2 2
"from" 102+114+111+109 436 2 2
"wort" 119+111+114+116 460 26 Q
"trow" 116+114+111+119 460 26 Q

Notice that "form" and "from" produce the exact same fingerprint character! This is called a "collision."

Key insight: With only 62 possible characters but millions of words in a language, on average thousands of different words share each fingerprint character.

Information is Permanently Lost

The fingerprinting process discards:

  • closeAll words shorter than 4 letters ("the", "a", "is", "and", etc.)
  • closeAll punctuation and formatting
  • closeCapital letters (everything becomes lowercase)
  • closeThe actual spelling of words (only a mathematical hash remains)
  • closeNumbers and special characters

Could Someone Guess the Original Text?

The Scale of Impossibility

Let's calculate how many possible original texts could produce the same fingerprint:

Assumptions:

  • arrow_forwardAverage English word length: 5 letters
  • arrow_forwardWords with 4+ letters in English: ~150,000
  • arrow_forwardWords sharing one fingerprint character: ~2,400 (150,000 ÷ 62)
Fingerprint Length Possible Combinations
1 character 2,400 words
2 characters 5,760,000 combinations
5 characters 7.9 × 10¹⁶ (79 quadrillion)
10 characters 6.2 × 10³³ combinations
50 characters 10¹⁶⁹ combinations

A typical academic paper might have a fingerprint of 500+ characters.

To put this in perspective:

  • arrow_forwardThere are approximately 10⁸⁰ atoms in the observable universe
  • arrow_forwardA 50-character fingerprint has more possible source texts than atoms in the universe—by a factor of 10⁸⁹

Brute Force Time Estimates

If a supercomputer could check 1 trillion (10¹²) combinations per second:

Fingerprint Length Time to Check All Possibilities
5 characters 10 years
10 characters 350 billion years
20 characters Longer than the age of the universe × 10²⁰

What About Using AI (ChatGPT, etc.)?

Could an AI Reconstruct the Text?

Modern AI language models are impressive, but they face the same fundamental limitations:

  • closeThe Collision Problem Remains: Even if an AI knows that fingerprint character "A" maps to some word, it still has ~2,400 candidates. AI cannot know which specific word was used.
  • closeNo Training Data Exists: AI models learn from examples. Since fingerprints are designed to be irreversible, there is no training data of "fingerprint → original text" pairs.
  • closeGrammatical Constraints Don't Help Enough: While an AI might generate grammatically correct text, the search space is still impossibly large.

Language Statistics: A Closer Look

Some might argue: "But language isn't random! Certain word combinations are more common."

This is true, but still insufficient:

Using word frequency data from English language corpora:

  • arrow_forwardThe 1,000 most common words account for ~70% of typical text
  • arrow_forwardBut ~300 of these are 3 letters or fewer (discarded!)
  • arrow_forwardThe remaining ~700 words still spread across 62 fingerprint characters
  • arrow_forwardThat's still ~11 words per character on average

Even limiting to only common words, a 20-character fingerprint still has:

  • arrow_forward11²⁰ = 6.7 × 10²⁰ possible combinations
  • arrow_forwardThat's 670 quintillion possibilities

Practical Security Implications

What This Means for Reference Library Content

Concern Reality
"Can anyone read stored theses?" No. Only fingerprints exist—no readable text.
"Can hackers steal previous student work?" No. There's nothing to steal—only irreversible codes.
"Can essay mills access the database?" No. Fingerprints cannot be converted back to usable text.
"Are we liable for storing others' IP?" Minimal. You store mathematical representations, not content.
"What if we're audited or subpoenaed?" We can only provide fingerprints, which are meaningless without originals.

How This Compares to Other Services

Service Type What They Store Risk to Authors
Traditional plagiarism databases Full text copies High—texts can be leaked, sold, or misused
Document repositories Complete documents Medium—depends on security measures
PlagAware Reference Library Fingerprints only None—mathematical impossibility to recover

Summary

How PlagAware Protects Everyone

Stakeholder How Fingerprints Help
Current students Their submissions are checked fairly against comprehensive sources
Previous students Their theses help detect plagiarism without exposing their work
Faculty Reliable detection without managing sensitive text databases
Institution Reduced liability, simplified compliance, effective integrity checks
Original authors Reference articles cannot be extracted or redistributed

The Bottom Line

Reference library content is protected by mathematics, not just policies.

When an institution adds texts to PlagAware's reference library:

  • check_circlePlagiarism detection works accurately against all stored references
  • check_circleOriginal authors' intellectual property remains completely private
  • check_circleNo text can ever be recovered—even by PlagAware
  • check_circleStudents, authors, and institutions are all protected

This is not a policy decision that could change—it's a mathematical certainty built into how the system works.

Technical Reference

For those interested in the exact algorithm:

Word → Sum of ASCII values → Modulo 62 → Character (0-9, A-Z, a-z)
  • arrow_forwardWords < 4 characters: Ignored
  • arrow_forwardPunctuation/short words: Create sentence boundaries (spaces in fingerprint)
  • arrow_forwardCollision rate: ~2,400 words per character (assuming 150,000 words with 4+ letters)

Encrypted transmission of documents and secure payment options

  • encrypted data transmission
  • Payment of plagiarism check by PayPal
  • Payment of plagiarism scan by Sofortueberweisung