1. Overview: Images in the Epstein Files
The official Epstein file releases contain approximately 900,956 files across 12 datasets. Nearly every one of these files is a PDF — and inside those PDFs are embedded images: photographs, screenshots, scanned pages, charts, and hundreds of thousands of blank white rectangles.
Our pipeline has extracted and individually classified 949,123 images from these PDFs, sorted them into 47 content categories using an AI vision model, and initially flagged 546,976 blank-like pages by pixel analysis. After OCR/reclassification of blank-labeled buckets, the current confirmed blank bucket is 17,831 pages (~1.9% of all extracted images).
_NEEDS_OCR), leaving a current confirmed blank bucket of
17,831 pages.
2. Why Are Images Stored as PDFs?
This is one of the most important questions about the release format, and the answer says more about the releasing agency than it does about the files themselves.
What happened
Photographs, screenshots, scanned documents, text messages, business cards, and handwritten notes were all individually wrapped in PDF containers before release. A simple 200KB JPEG photograph becomes a 1.5MB single-page PDF. A screenshot of a text conversation that could be a PNG file becomes a PDF that requires a PDF reader to open.
What the original files were
These images originally existed as standard image files — .jpg,
.png, .tiff, .bmp — or as embedded images
within email clients, litigation databases, and forensic imaging tools. They were converted
to individual PDF pages as part of the Concordance/IPRO litigation support workflow used
by the DOJ and cooperating law firms.
Why this matters
- You can’t search images inside PDFs. Image-based PDFs don’t have searchable text. Every page requires OCR (Optical Character Recognition) before it becomes searchable — a process that takes seconds per page and hours at scale.
- You can’t thumbnail or preview PDFs easily. Standard file browsers can display image thumbnails instantly. PDFs require rendering, which makes visual scanning of large collections impractical without specialized tools.
- File sizes balloon. The PDF wrapper adds overhead to every file. Across 900,000+ images, this adds up to hundreds of gigabytes of wasted space.
- Metadata is lost. Original image files contain EXIF data — camera model, date taken, GPS coordinates, orientation. PDF conversion strips all of this. For evidentiary photographs, this metadata loss is significant.
- It fragments multi-page documents. A 47-page scanned deposition becomes 47 individual PDF files, each containing a single page image. Reassembling the original document requires knowing which page numbers belong together — information that is often only available in the proprietary Concordance load file.
3. The Extraction Pipeline
To make the images accessible, we built a multi-stage pipeline that extracts, classifies, and organizes every embedded image from every PDF in the release.
Filename Convention
Each extracted image gets a structured filename that encodes its classification and source:
deposition-transcript+adult-woman+affidavit__EFTA01609671_20260130_p001_i001.png
- Primary category + secondary tags — what the AI thinks the image contains
- EFTA ID — the original Bates/document identifier from the release
- Date, page, image number — extraction metadata
Technical Details
| Component | Detail |
|---|---|
| PDF parser | PyMuPDF (fitz) 1.27.1 — extracts embedded image XObjects from each page |
| Vision model | OpenAI CLIP ViT-B/32 — zero-shot image classification |
| Label set | 48 custom labels — tuned for legal/evidentiary content (see §4) |
| Blank detection | Pillow + NumPy — mean pixel > 252 & std < 3 = blank |
| Output format | PNG images, organized into category folders |
| Processing time | ~4 hours with CUDA GPU, ~40 hours CPU-only |
4. AI Classification (CLIP)
Every extracted image is classified using CLIP (Contrastive Language–Image
Pre-training), an AI model developed by OpenAI that can match images to text descriptions
without being specifically trained on any particular category. We use the
ViT-B/32 variant.
How it works
CLIP compares each image against a set of 48 custom text labels and returns a probability score for each one. The labels were specifically designed for this type of legal/evidentiary content:
Accuracy & Limitations
- CLIP is good at broad categorization — it reliably distinguishes email screenshots from deposition pages from photographs
- It struggles with similar document types — an affidavit vs. a legal filing vs. a court document can look nearly identical, so these categories have overlap
- Multi-label tagging helps — each image gets its top 3–5
labels encoded in the filename, so an image classified as
email-screenshot+contact-listcaptures both aspects - Human review is still needed — the AI classification is a sorting tool, not a final determination. Critical images should always be verified
5. Content Breakdown: What’s in the Images
Based on CLIP classification of 949,123 extracted images, here is the high-level breakdown of what the Epstein files actually contain visually:
Content Type Distribution
What this tells us
Over 87% of all images are pages of typed text — depositions, emails, and legal documents that were printed or scanned and then stored as images inside PDFs. These are not “pictures” in any meaningful sense. They are text documents that have been rendered as pictures, making them unsearchable, uncopyable, and unindexable without OCR processing.
Actual photographs of people, places, and things make up less than 0.4% of the total. The Epstein files are overwhelmingly a text archive that has been converted into an image archive through format choices.
6. Full Category Inventory
All 26 active categories with current file counts. Categories were created by the CLIP classification pipeline and have been consolidated through manual review.
Complete Table
| Category | Images | % of Total | Blanks Swept | Notes |
|---|---|---|---|---|
| deposition-transcript | 327,717 | 46.4% | 95,721 removed | Largest category — court depositions, legal testimony pages |
| Emails Screenshots | 245,041 | 34.7% | 15,825 removed | Email bodies rendered as images within PDFs |
| uncategorized | 27,345 | 3.9% | Pending | Images that didn’t strongly match any label |
| scanned-document | 24,220 | 3.4% | 15,651 removed | Generic scanned pages — high blank rate (64.6%) |
| affidavit | 19,792 | 2.8% | Pending | Sworn statements, witness declarations |
| bank-statement | 16,641 | 2.4% | Pending | Financial account statements, transaction records |
| court-document | 16,145 | 2.3% | Pending | Court filings, orders, motions |
| Text Messages & Screenshots | 9,420 | 1.3% | Pending | Text/SMS conversations captured as screenshots |
| book-page | 5,903 | 0.8% | Pending | Scanned book/publication pages |
| Q | 5,011 | 0.7% | Skipped (misc) | Miscellaneous overflow category |
| Locations | 2,364 | 0.3% | Skipped (small) | Maps, property images, location shots |
| People | 2,263 | 0.3% | Pending | Photographs containing identifiable persons |
| computer-screen | 1,387 | 0.2% | Pending | Screenshots of computer interfaces |
| chart | 1,024 | 0.1% | Pending | Graphs, diagrams, organizational charts |
| timeline | 463 | 0.1% | Pending | Chronological visualizations |
| conference-room | 400 | 0.1% | Pending | Meeting/office environment photos |
| Handwritten Notes | 201 | 0.03% | Small set | Handwritten documents, notes, annotations |
| newspaper-clipping | 126 | 0.02% | Pending | News articles, press clippings |
| ETC | 66 | <0.01% | Small set | Miscellaneous items that don’t fit other categories |
| + 7 minor/empty categories | ~23 | <0.01% | Graphics, Documents, Legal, Paperwork, Xxx, check, form |
7. Blank Pages & Waste
Blank handling now has two distinct numbers: the initial image-level flags from pixel sweep, and the current confirmed blank bucket after OCR/reclassification. The initial sweep flagged 546,976 pages as blank-like; after reprocessing blank-labeled folders, the remaining confirmed blank bucket is 17,831 pages.
_NEEDS_OCR, and current bucket counts in
_BLANK_PAGES are _blank=17,821 and
_questionable=10 (17,831 total).
The category table below is a historical image-level sweep snapshot and should be read as pre-reclassification detail, not the current blank residual.
How blank detection works
Each image is loaded as a pixel array. If the mean pixel value exceeds 252 (out of 255) and the standard deviation is below 3, the image is classified as blank. This catches pure white pages, near-white pages, and pages with only faint scanner artifacts. The threshold is conservative — genuinely blank pages typically have mean > 254 and std < 1.
Blank Page Breakdown
| Category | Total Images | Blanks Found | Blank Rate | Status |
|---|---|---|---|---|
| deposition-transcript | 424,716 | 233,747 | 55.0% | ✓ Complete |
| email-screenshot | 261,902 | 186,309 | 71.1% | ✓ Complete |
| uncategorized | 28,544 | 26,975 | 94.5% | ✓ Complete |
| contact-list | 29,072 | 24,135 | 83.0% | ✓ Complete |
| typed-page | 22,896 | 18,736 | 81.8% | ✓ Complete |
| invoice | 43,229 | 15,457 | 35.8% | ✓ Complete |
| affidavit | 19,856 | 12,380 | 62.3% | ✓ Complete |
| bank-statement | 16,658 | 8,300 | 49.8% | ✓ Complete |
| legal-filing | 24,847 | 7,226 | 29.1% | ✓ Complete |
| court-document | 16,253 | 6,895 | 42.4% | ✓ Complete |
| scanned-document | 25,013 | 139 | 0.6% | ✓ Complete |
| + 36 other categories | 36,137 | 6,677 | 18.5% | ✓ Complete |
| TOTAL | 949,123 | 546,976 | 57.6% | ✓ Complete |
What 17,831 confirmed blank pages means in practice: If a researcher is opening PDFs to review the files, the current blank residual is roughly 1 in 53 files. At 10 seconds per file to open, inspect, and close: that’s ~49.5 hours of wasted time — about 6 working days — still significant, but far below the original image-level blank estimate.
8. Cleanup: Duplicates, Tiny Files, Corrupt Images
Beyond blank pages, the extraction process identifies several other categories of problematic files that inflate the release without adding content.
What each category means
| Category | Count | Description |
|---|---|---|
| Blank pages | 17,831 | Current confirmed blank residual after OCR/reclassification of previously blank-labeled buckets. (Initial image-level flags were 546,976.) |
| Duplicates | 250,243 | Duplicate images identified via perceptual hash (dHash) matching. Same visual content stored under different filenames across datasets. Moved to DUPLICATES folder. |
| Bad / corrupt | 3,554 | Images that failed extraction or are corrupted — zero-byte files, truncated PNGs, images that won’t load. Moved to BLANKS BADS folder. |
| Tiny files (<5KB) | 697 | Images under 5KB — typically 1x1 pixel spacers, tiny icons, or PDF artifacts. Too small to contain meaningful content. Moved to TINY folder. |
| Verified keepers | 689 | Images manually reviewed and confirmed as significant/useful content. Moved to KEEPERS folder for priority access. |
Total waste calculation
Adding up all identified waste: 17,831 blanks + 250,243 duplicates + 3,554 corrupt + 697 tiny = 272,325 junk files out of 949,123 extracted images. That’s 28.7% of extracted images. After removing this identified waste, approximately 676,798 pages remain.
9. Which Datasets Contain What
The 12 datasets vary enormously in their image content. Based on our scanning and classification work so far, here is what we know about the visual content of each dataset:
| Dataset | Total Files | Archive Size | Primary Image Content |
|---|---|---|---|
| DS1 | 3,167 | 1.23 GB | Mixed — legal filings, correspondence pages, some scanned documents |
| DS2 | 579 | 0.62 GB | Correspondence, some financial document pages |
| DS3 | 70 | 0.58 GB | Legal filings, deposition pages — higher page-per-PDF ratio |
| DS4 | 155 | 0.34 GB | Property records, legal documents |
| DS5–DS7 | 159 | 0.20 GB | Small supplementary sets — mixed legal/correspondence |
| DS8 | 10,595 | 9.95 GB | MCC prison records — guard check sheets, incident reports, facility photos, surveillance-related docs |
| DS9 | 254,477 | 137.90 GB | Largest archive — Lesley Groff email images, Richard Kahn correspondence, bulk email screenshots. Not yet fully classified |
| DS10 | 302,600 | 78.64 GB | Concordance/IPRO format — single-page scanned images in PDF wrappers. Heavy deposition content. Highest blank page rate expected. |
| DS11 | 331,997 | 25.56 GB | Bulk email archive — mostly email screenshots. Largest dataset by file count. |
| DS12 | 155 | 0.11 GB | Supplementary documents |
Where the images are concentrated
Three datasets (DS9, DS10, DS11) account for 98.4% of all files. The remaining nine datasets combined are a rounding error in terms of image volume.
10. Current Processing Status
All major pipeline stages are complete. Here is the final state of each:
| Pipeline Stage | Status | Detail |
|---|---|---|
| PDF image extraction | ✓ Complete | 949,123 images extracted from all 12 datasets |
| CLIP classification | ✓ Complete | All images tagged and sorted into 47 categories |
| Blank page sweep | ✓ Complete | All 47 categories swept — 546,976 initial blank flags, with 17,831 current confirmed blank residual after OCR/reclassification |
| Duplicate detection | ✓ Complete | 250,243 duplicates identified via perceptual hash (dHash) matching |
| Corrupt/tiny cleanup | ✓ Complete | 3,554 corrupt + 697 tiny files isolated |
| Text extraction | ✓ Complete | 343 million words extracted from all datasets (including 40.3M from deep sweep recovery) |
| Person scan | ✓ Complete | 27 persons scanned across all 12 datasets |
| Manual review (keepers) | ✓ Complete | 689 images verified and flagged as significant |
📸 Interactive Image Galleries
Browse classified images directly in your browser — filterable, searchable, with full-size lightbox viewing.
THE PDF FILES
Image analysis based on extraction and CLIP classification of 949,123 images from 900,956 PDFs.
Classification: OpenAI CLIP ViT-B/32. Extraction: PyMuPDF. Blank detection: Pillow + NumPy.
All scripts and methodology are open source and reproducible.
← Back to Main Site · File Analysis & Findings · Gallery Picks · Handwritten Notes · Cast Hub · Search Files