Derek Earnhart · April 2026 research package

The Voynich Manuscript as a generator-class discrimination problem.

This project does not start by claiming a translation. It asks a more testable question: does Voynichese behave like one generator, several generators, or a layout-driven production system? The framework turns EVA transcription into STA token families, transition graphs, entropy surfaces, and Hodge-style flow features that can succeed or fail under controls.

Core frame: not “translate the Voynich” first — discriminate the generator. If the manuscript is produced by different regimes, the transition graph should expose structural differences across Currier language, hand, and section, using public source anchors for Zandbergen’s RZ/Currier language work and the original Currier papers. The plan of record is Action Plan v1: rigorous, falsifiable, and explicit about null results.

How the work unfolded

From “can it be decoded?” to “what kind of machine could have produced this?”

The project became credible when it stopped trying to force plaintext out of the manuscript and started measuring structure. Each layer below is designed to preserve evidence while reducing room for wishful interpretation.

01

Start with the transcription, not a theory

The pipeline uses the Landini–Stolfi IVTFF archive and filters to Takahashi ;H lines: 5,207 parsed lines across 225 pages. That textual substrate is credited against Zandbergen’s public transliteration and IVTFF documentation before any modeling begins.

02

EVA becomes STA

EVA is good for transcription, but analysis benefits from families. The Super Transliteration Alphabet is René Zandbergen’s public processing alphabet; this framework uses STA-style family/member codes — q-like prefixes, gallows, vowel clusters, and terminal forms — to test structure without pretending we know phonetics.

03

Words become graph flow

Tokens are treated as nodes and adjacent token transitions as edges. This turns a page, paragraph, or full manuscript into a transition system: what follows what, how often, how constrained, and how different the regimes are.

04

Hodge enters as a filter

Hodge-style graph analysis separates flow into structured recurrence versus noisy drift. In plain English: it asks whether the text has stable circulating patterns that survive when you compare sections, hands, and controls.

05

High-dimensional features, then projection

The framework can lift token windows into high-dimensional feature space — STA families, positions, transitions, entropy, graph features, and embedding modes — then project them back into interpretable measurements. The point is not mysticism; it is controlled compression.

06

Claims must survive holdout

Any candidate “meaning” or key has to generalize across folios, sections, scribal hands, and image context. If it only works after patching rules page-by-page, it fails.

Pipeline

From transcription to topology

  1. 01Parse Takahashi ;H lines from Landini-Stolfi IVTFF, tied back to public transliteration documentation.
  2. 02Normalize EVA into Zandbergen-cited STA token families without semantic translation claims.
  3. 03Build token-transition graphs and successor distributions.
  4. 04Measure entropy, circuit rank, retention, out-degree, and role asymmetry.
  5. 05Run BEH tests, matched controls, and the period-17 spectral candidate.

Falsifiability

BEH was revised, not rescued.

The original one-sided prediction — Currier A / Hand 1 having higher entropy than B / 2 — failed. The research frame keeps the result by narrowing to the falsifiable two-sided BEH: the strata differ significantly, with direction determined empirically.

Observed direction: Currier B and Hand 2 have higher per-unit entropy.

The period-17 candidate

A pattern worth testing, not a conclusion.

One thread of the work noticed possible periodic behavior near 17 in position-based sequences. The right response is not to declare a calendar key or hidden cipher. The right response is a pre-registered spectral test.

What gets measured

Tokens per line, first/last token families, gallows presence, line-initial and line-final markers, and interword-length sequences — per folio and concatenated.

What would count as evidence

A stable peak near period 17 with bootstrap confidence intervals and effect size above neighborhood power, surviving transliteration swaps and appropriate folio-level controls.

What would weaken it

If reflowed line breaks preserve the signal, it may be a token-frequency artifact. If page-order randomization destroys it, it may be sequence-order dependent rather than local layout structure.

Main result

Two-sided BEH supported across all four tests.

All BH-adjusted q-values are below 0.001. Effect sizes range from medium to very large.

q < .001

All four corrected comparisons significant.

−0.578 → −1.810

Hedges’ g range across tests.

10k

Permutation and bootstrap resamples, seed 42.

2.64×

Full retention exceeds uniform-vocab control.

ComparisonLeveln₁/n₂Mean₁Mean₂ΔH95% CIqg
Currier A vs Bpage114/826.0166.958−0.943[−1.123, −0.762]<.001−1.544
Currier A vs Bparagraph141/1115.7126.407−0.695[−0.901, −0.484]<.001−0.870
Hand 1 vs 2page86/455.8406.590−0.750[−0.908, −0.595]<.001−1.810
Hand 1 vs 2paragraph92/645.7266.077−0.351[−0.557, −0.133]<.001−0.578

STA + Hodge, in human terms

The alphabet became a measurement instrument.

The Super Transliteration Alphabet is not a proposed pronunciation. Zandbergen defines STA as a computer-processing aid for comparing transliterations; this site uses that idea as a measurement layer while keeping Derek’s graph/Hodge claims separate from the source transcription.

Why STA?

Raw EVA tokens like qokedy, chedy, daiin, and ol are useful, but they mix local shape, position, and recurrence. Following Zandbergen’s STA family logic, the analysis asks whether glyph families behave differently by line position, folio, section, or hand.

Why “101 dimensions”?

The working framework can expand each token/window into a high-dimensional feature vector: token identity, STA family, member index, line position, word position, transition context, section metadata, entropy, successor structure, graph degree, and spectral modes. The exact dimension can change by experiment; the principle is stable: encode many weak signals, then test which survive controls.

What Hodge contributes

On a graph, Hodge-style decomposition is a way to separate flow-like recurrence from gradient-like drift. If a manuscript section has stable cycles — repeated transition pathways that are not explained by unigram frequency alone — Hodge features should help expose them.

What “meaning” means here

Meaning is not asserted from a few pretty outputs. The current meaning is structural: Currier B and Hand 2 show higher per-unit entropy, sections differ in successor entropy, and matched controls do not fully reproduce observed retention. Semantic meaning comes later, only if candidate keys align with illustrations and generalize.

Graph signature

Full-manuscript structural features

Entropy10.452 bits
Successor entropy4.361 bits
Circuit rank μ22,675
Mean out-degree4.037
daiin count864

Section successor entropy

Biological section stands out.

biological3.734
herbal3.372
stars3.330
pharmaceutical2.172
cosmological1.979
text1.382
zodiac1.265

Matched controls

The manuscript is not just a shuffled bag of Voynichese.

Observedretention 0.8376%

Successor entropy 4.361 bits.

Frequency shuffleretention 0.3109%

Preserves unigram frequencies, weakens local transition retention.

Uniform vocabretention 0.0130%

Destroys manuscript-like recurrence almost entirely.

Research ledger

The wins matter because the failures are recorded.

This is the difference between a serious framework and a decoding story. The project keeps the negative results visible.

Win

Full transcription processed

5,207 Takahashi lines, 225 pages, 37,967 cleaned word tokens, and 8,071 vocabulary items were parsed into reproducible processed outputs.

Win

Currier/hand structure survived inference

Four BEH comparisons were significant after BH correction, with medium-to-large effects. That supports structural heterogeneity.

Win

Controls did not erase everything

Frequency-shuffle and uniform-vocab controls changed retention and successor structure, suggesting the manuscript is not merely a shuffled word bag.

Candidate

Period near 17

A possible period-17 signal is interesting enough to pre-register and test, but not strong enough to advertise as decoded meaning.

Fail

Original direction was wrong

The first one-sided BEH prediction expected A/1 entropy to exceed B/2. The data went the other way. The claim was revised to a two-sided structural hypothesis.

Fail

No decipherment key yet

No substitution table currently satisfies the standard: one rule set, all sections, historical-language grammar, image alignment, and independent replication.

Recent Seagate sources

Recovered working package

  • voynich_reproducibility_repo_final_polished.zipApr 28, 2026 · reproducibility repo
  • voynich_full_experiment_outputs.zipApr 28, 2026 · processed outputs
  • voynich_complete_updated_manuscript.pdfApr 28, 2026 · manuscript package
  • LSI_ivtff_0d.txtApr 28, 2026 · raw IVTFF transcription copy; cite against Zandbergen/IVTFF documentation

Sources and citations

Public source authority stays separate from the experimental framework.

The site’s graph, entropy, and Hodge results are Derek Earnhart’s analysis layer. The underlying manuscript record, transliteration conventions, STA terminology, and Currier/RZ language labels should be cited through the sources below.

Transliteration / IVTFF / Takahashi / ZL

René Zandbergen, “Transliteration of the Text”

Documents the major Voynich transliterations, the Stolfi-Landini interlinear lineage, Takahashi material, Zandbergen-Landini files, and the IVTFF common file format used for text processing.

voynich.nu/transcr.html

STA / family coding

René Zandbergen, “A superset of transliteration alphabets”

Defines STA as a superset alphabet and explains STA families. This site’s STA feature layer is a derived analytical use of that public convention, not an independent pronunciation or decipherment claim.

voynich.nu/extra/sta.html

Currier language / RZ language extension

René Zandbergen, “Extension to the Currier languages”

Summarizes Currier A/B as statistical groupings, distinguishes language from handwriting, and proposes the RZ language identification used as a modern classification reference.

voynich.nu/extra/rz_lang.html

Original Currier statistical split

Prescott H. Currier papers, hosted by René Zandbergen

Primary historical source for Currier’s language/hand terminology and his statistical framing of multiple Voynich text regimes.

voynich.nu/extra/curr_main.html

Manuscript custody / images / sections

Beinecke Rare Book & Manuscript Library, “Voynich Manuscript”

Yale’s institutional record for Beinecke MS 408, including collection history, section overview, and links to manuscript images and catalog material.

beinecke.library.yale.edu

Classic cryptologic survey

Mary E. D’Imperio, The Voynich Manuscript: An Elegant Enigma

NSA/CSS monograph from 1978 that remains a useful historical survey of Voynich research, cryptologic approaches, and earlier analytic traditions.

nsa.gov PDF

Attribution note: “STA” and the STA family concept should be credited to René Zandbergen. Derek’s contribution here is the downstream statistical/Hodge testing framework built on top of those public transcription and classification resources.