Part A: Smaller Public Packages
DS1-DS7 and DS12 are the smaller extracted packages. They are already covered in the small-dataset person scans and are useful for quick targeted review.
What was officially released (DOJ and mirrors), how the main ZIP packages map to DS1-DS12, and where torrent sources are available.
High-level orientation for the public release bundle and the current scan pipeline, including estimated release-level picture volume and verified mirror/torrent coverage.
The image corpus is not one single ZIP. It comes from official release IMAGES folders (Concordance/IPRO datasets) plus extracted page-image outputs used by OCR/CLIP processing.
| Source | What It Is | Approx Volume | Current Path |
|---|---|---|---|
| DataSet10 (official release) | Concordance/IPRO image production set in VOL00010/IMAGES. |
303,204 image files | _DATA/DataSet10/VOL00010/IMAGES/ |
| DataSet11 (official release) | Concordance/IPRO image production set in IMAGES/0001-0332. |
~332,664 image files (est.) | _DATA/DataSet11/IMAGES/ |
| Extracted Image Corpus (working) | Page-image corpus used for OCR triage and CLIP tagging across extracted release materials. | ~1.2M+ images | _CLIP_OUTPUT/ALL_EXTRACTED_IMAGES/_by_first_category/ |
| CLIP Tagged Subsets | Tagged/organized subsets derived from the extracted image corpus. | 3,349 (EG) + 689 (G) (+ test folders) | _CLIP_OUTPUT/EG/, _CLIP_OUTPUT/G/ |
Reference basis in-project: AUDIT-REPORT-2026-02-12.md (DS10/DS11 IMAGES counts/structure) and PROJECT/SCRIPTS/ocr-text-separator-v2.py ("Separates 1.2M+ images").
DS1-DS7 and DS12 are the smaller extracted packages. They are already covered in the small-dataset person scans and are useful for quick targeted review.
DS8 is a medium package and produced the majority of early Maxwell hits in our person scans, making it a high-signal bridge between small and large corpora.
DS9, DS10, and DS11 are the largest workloads. DS10 is financial-heavy by keyword profile, DS11 is core email-heavy, and DS9 is the largest unscanned-person gap now being processed.
Each dataset row includes official ZIP name, file count and size, what it mainly consists of from scan evidence, and direct links.
| Dataset | Official ZIP / Size | Files | What It Consists Of (scan-based) | Links |
|---|
Status view of what has been scanned so far, so you can see what is complete vs in progress.
content/emails/{person}/ stores per-person JSONL outputs and person-level email artifacts.content/emails/cast-flagged/ stores aggregated email markdown evidence files used by the cast and explorer views.content/cast/pages/ and content/cast/cast-index.md map scanned people into browsable cast pages.data/public-data.json drives the dataset cards and this guide page summary.