OFFICIAL RELEASE GUIDE

What was officially released (DOJ and mirrors), how the main ZIP packages map to DS1-DS12, and where torrent sources are available.

Release Snapshot

High-level orientation for the public release bundle and the current scan pipeline, including estimated release-level picture volume and verified mirror/torrent coverage.

Pictures / Images: Where They Came From

The image corpus is not one single ZIP. It comes from official release IMAGES folders (Concordance/IPRO datasets) plus extracted page-image outputs used by OCR/CLIP processing.

Source What It Is Approx Volume Current Path
DataSet10 (official release) Concordance/IPRO image production set in VOL00010/IMAGES. 303,204 image files _DATA/DataSet10/VOL00010/IMAGES/
DataSet11 (official release) Concordance/IPRO image production set in IMAGES/0001-0332. ~332,664 image files (est.) _DATA/DataSet11/IMAGES/
Extracted Image Corpus (working) Page-image corpus used for OCR triage and CLIP tagging across extracted release materials. ~1.2M+ images _CLIP_OUTPUT/ALL_EXTRACTED_IMAGES/_by_first_category/
CLIP Tagged Subsets Tagged/organized subsets derived from the extracted image corpus. 3,349 (EG) + 689 (G) (+ test folders) _CLIP_OUTPUT/EG/, _CLIP_OUTPUT/G/

Reference basis in-project: AUDIT-REPORT-2026-02-12.md (DS10/DS11 IMAGES counts/structure) and PROJECT/SCRIPTS/ocr-text-separator-v2.py ("Separates 1.2M+ images").

Main Parts Of The Release

Part A: Smaller Public Packages

DS1-DS7 and DS12 are the smaller extracted packages. They are already covered in the small-dataset person scans and are useful for quick targeted review.

Part B: Mid-size Package

DS8 is a medium package and produced the majority of early Maxwell hits in our person scans, making it a high-signal bridge between small and large corpora.

Part C: Heavy Corpora

DS9, DS10, and DS11 are the largest workloads. DS10 is financial-heavy by keyword profile, DS11 is core email-heavy, and DS9 is the largest unscanned-person gap now being processed.

Official ZIP Packages (Main View)

Each dataset row includes official ZIP name, file count and size, what it mainly consists of from scan evidence, and direct links.

Dataset Official ZIP / Size Files What It Consists Of (scan-based) Links

Scan Coverage Snapshot

Status view of what has been scanned so far, so you can see what is complete vs in progress.

Where This Data Lives In The Site