Official Release Guide | THE PDF FILES

Release Snapshot

High-level orientation for the public release bundle and the current scan pipeline, including estimated release-level picture volume and verified mirror/torrent coverage.

Pictures / Images: Where They Came From

The image corpus is not one single ZIP. It comes from official release IMAGES folders (Concordance/IPRO datasets) plus extracted page-image outputs used by OCR/CLIP processing.

Source	What It Is	Approx Volume	Current Path
DataSet10 (official release)	Concordance/IPRO image production set in `VOL00010/IMAGES`.	303,204 image files	`_DATA/DataSet10/VOL00010/IMAGES/`
DataSet11 (official release)	Concordance/IPRO image production set in `IMAGES/0001-0332`.	~332,664 image files (est.)	`_DATA/DataSet11/IMAGES/`
Extracted Image Corpus (working)	Page-image corpus used for OCR triage and CLIP tagging across extracted release materials.	~1.2M+ images	`_CLIP_OUTPUT/ALL_EXTRACTED_IMAGES/_by_first_category/`
CLIP Tagged Subsets	Tagged/organized subsets derived from the extracted image corpus.	3,349 (EG) + 689 (G) (+ test folders)	`_CLIP_OUTPUT/EG/`, `_CLIP_OUTPUT/G/`

Reference basis in-project: AUDIT-REPORT-2026-02-12.md (DS10/DS11 IMAGES counts/structure) and PROJECT/SCRIPTS/ocr-text-separator-v2.py ("Separates 1.2M+ images").

Main Parts Of The Release

Part A: Smaller Public Packages

DS1-DS7 and DS12 are the smaller extracted packages. They are already covered in the small-dataset person scans and are useful for quick targeted review.

Part B: Mid-size Package

DS8 is a medium package and produced the majority of early Maxwell hits in our person scans, making it a high-signal bridge between small and large corpora.

Part C: Heavy Corpora

DS9, DS10, and DS11 are the largest workloads. DS10 is financial-heavy by keyword profile, DS11 is core email-heavy, and DS9 is the largest unscanned-person gap now being processed.

Official ZIP Packages (Main View)

Each dataset row includes official ZIP name, file count and size, what it mainly consists of from scan evidence, and direct links.

Dataset	Official ZIP / Size	Files	What It Consists Of (scan-based)	Links

Scan Coverage Snapshot

Status view of what has been scanned so far, so you can see what is complete vs in progress.

DS1-DS7 and DS12: Existing 12-person scans complete.
DS8: Existing 12-person scan complete.
DS10: Financial keyword scan complete.
DS11: Existing scans complete for 11/12 tracked people; Maxwell was not included in that historical DS11 pass.
DS9: Existing 12-person scan is currently running in single-job mode.

Where This Data Lives In The Site

content/emails/{person}/ stores per-person JSONL outputs and person-level email artifacts.
content/emails/cast-flagged/ stores aggregated email markdown evidence files used by the cast and explorer views.
content/cast/pages/ and content/cast/cast-index.md map scanned people into browsable cast pages.
data/public-data.json drives the dataset cards and this guide page summary.