← BACK TO MAIN SITE

IMAGE & PHOTO ANALYSIS

A deep dive into the 949,123 images extracted from the Epstein file PDFs — what they contain, how they were classified, why they were buried in PDF wrappers, and what the blank page and duplicate cleanup reveals.

Last updated: February 15, 2026

1. Overview: Images in the Epstein Files

The official Epstein file releases contain approximately 900,956 files across 12 datasets. Nearly every one of these files is a PDF — and inside those PDFs are embedded images: photographs, screenshots, scanned pages, charts, and hundreds of thousands of blank white rectangles.

Our pipeline has extracted and individually classified 949,123 images from these PDFs, sorted them into 47 content categories using an AI vision model, and initially flagged 546,976 blank-like pages by pixel analysis. After OCR/reclassification of blank-labeled buckets, the current confirmed blank bucket is 17,831 pages (~1.9% of all extracted images).

949,123
Images Extracted
47
Content Categories
546,976
Initial Blank Flags
1.9%
Current Blank Rate
3,554
Bad / Corrupt
697
Tiny (<5KB)
✅ Processing complete. All pipeline stages have finished. Blank page sweep covered all 47 categories. The initial image-level sweep flagged 546,976 pages. The blank organizer then reclassified/OCR-processed large blank-labeled buckets (including 241,686 moved from _NEEDS_OCR), leaving a current confirmed blank bucket of 17,831 pages.

2. Why Are Images Stored as PDFs?

This is one of the most important questions about the release format, and the answer says more about the releasing agency than it does about the files themselves.

What happened

Photographs, screenshots, scanned documents, text messages, business cards, and handwritten notes were all individually wrapped in PDF containers before release. A simple 200KB JPEG photograph becomes a 1.5MB single-page PDF. A screenshot of a text conversation that could be a PNG file becomes a PDF that requires a PDF reader to open.

What the original files were

These images originally existed as standard image files — .jpg, .png, .tiff, .bmp — or as embedded images within email clients, litigation databases, and forensic imaging tools. They were converted to individual PDF pages as part of the Concordance/IPRO litigation support workflow used by the DOJ and cooperating law firms.

Why this matters

The practical effect: Releasing images as PDFs means that no casual observer can review them. You need extraction software, classification tools, OCR engines, and significant storage just to see what’s in the pictures. This transforms a task that should be “look at the photos” into a multi-week engineering project.

3. The Extraction Pipeline

To make the images accessible, we built a multi-stage pipeline that extracts, classifies, and organizes every embedded image from every PDF in the release.

1. Extract
PyMuPDF opens each PDF and pulls out every embedded image as a PNG
2. Tag
CLIP AI vision model scores each image against 48 content labels
3. Rename
Files are renamed with their top labels embedded in the filename
4. Sort
Images moved into category folders based on their primary classification
5. Sweep
Blank detection removes white/empty pages from each category

Filename Convention

Each extracted image gets a structured filename that encodes its classification and source:

deposition-transcript+adult-woman+affidavit__EFTA01609671_20260130_p001_i001.png

Technical Details

ComponentDetail
PDF parserPyMuPDF (fitz) 1.27.1 — extracts embedded image XObjects from each page
Vision modelOpenAI CLIP ViT-B/32 — zero-shot image classification
Label set48 custom labels — tuned for legal/evidentiary content (see §4)
Blank detectionPillow + NumPy — mean pixel > 252 & std < 3 = blank
Output formatPNG images, organized into category folders
Processing time~4 hours with CUDA GPU, ~40 hours CPU-only

4. AI Classification (CLIP)

Every extracted image is classified using CLIP (Contrastive Language–Image Pre-training), an AI model developed by OpenAI that can match images to text descriptions without being specifically trained on any particular category. We use the ViT-B/32 variant.

How it works

CLIP compares each image against a set of 48 custom text labels and returns a probability score for each one. The labels were specifically designed for this type of legal/evidentiary content:

scanned document typed page email screenshot text message screenshot deposition transcript court document legal filing affidavit bank statement invoice check receipt contact list passport id card business card handwritten note portrait photo group photo person adult man adult woman child indoor scene vehicle signature newspaper clipping spreadsheet + 20 more labels

Accuracy & Limitations

5. Content Breakdown: What’s in the Images

Based on CLIP classification of 949,123 extracted images, here is the high-level breakdown of what the Epstein files actually contain visually:

52.4%
Documents & Depositions
34.7%
Email Screenshots
2.4%
Financial Records
1.3%
Text Messages
0.4%
Photos of People
8.8%
Other / Uncategorized

Content Type Distribution

Documents & depositions
369,557
Email screenshots
245,041
Other / uncategorized
62,230
Financial records
16,641
Text messages
9,420
Photos of people / scenes
2,663

What this tells us

Over 87% of all images are pages of typed text — depositions, emails, and legal documents that were printed or scanned and then stored as images inside PDFs. These are not “pictures” in any meaningful sense. They are text documents that have been rendered as pictures, making them unsearchable, uncopyable, and unindexable without OCR processing.

Actual photographs of people, places, and things make up less than 0.4% of the total. The Epstein files are overwhelmingly a text archive that has been converted into an image archive through format choices.

6. Full Category Inventory

All 26 active categories with current file counts. Categories were created by the CLIP classification pipeline and have been consolidated through manual review.

deposition-transcript
327,717
Emails Screenshots
245,041
uncategorized
27,345
scanned-document
24,220
affidavit
19,792
bank-statement
16,641
court-document
16,145
Text Messages
9,420
book-page
5,903
People
2,263
Locations
2,364

Complete Table

CategoryImages% of TotalBlanks SweptNotes
deposition-transcript327,71746.4%95,721 removedLargest category — court depositions, legal testimony pages
Emails Screenshots245,04134.7%15,825 removedEmail bodies rendered as images within PDFs
uncategorized27,3453.9%PendingImages that didn’t strongly match any label
scanned-document24,2203.4%15,651 removedGeneric scanned pages — high blank rate (64.6%)
affidavit19,7922.8%PendingSworn statements, witness declarations
bank-statement16,6412.4%PendingFinancial account statements, transaction records
court-document16,1452.3%PendingCourt filings, orders, motions
Text Messages & Screenshots9,4201.3%PendingText/SMS conversations captured as screenshots
book-page5,9030.8%PendingScanned book/publication pages
Q5,0110.7%Skipped (misc)Miscellaneous overflow category
Locations2,3640.3%Skipped (small)Maps, property images, location shots
People2,2630.3%PendingPhotographs containing identifiable persons
computer-screen1,3870.2%PendingScreenshots of computer interfaces
chart1,0240.1%PendingGraphs, diagrams, organizational charts
timeline4630.1%PendingChronological visualizations
conference-room4000.1%PendingMeeting/office environment photos
Handwritten Notes2010.03%Small setHandwritten documents, notes, annotations
newspaper-clipping1260.02%PendingNews articles, press clippings
ETC66<0.01%Small setMiscellaneous items that don’t fit other categories
+ 7 minor/empty categories~23<0.01%Graphics, Documents, Legal, Paperwork, Xxx, check, form

7. Blank Pages & Waste

Blank handling now has two distinct numbers: the initial image-level flags from pixel sweep, and the current confirmed blank bucket after OCR/reclassification. The initial sweep flagged 546,976 pages as blank-like; after reprocessing blank-labeled folders, the remaining confirmed blank bucket is 17,831 pages.

📌 Important update: Current blank buckets were re-audited from local workflow logs: initial blank flags = 546,976 (pixel sweep), organizer moved 241,686 files out of _NEEDS_OCR, and current bucket counts in _BLANK_PAGES are _blank=17,821 and _questionable=10 (17,831 total).
546,976
Initial Blank Flags
57.6%
Initial Flag Rate
17,831
Current Blank Bucket
1.9%
Current Blank Rate
47 / 47
Categories Swept
100%
Sweep Complete

The category table below is a historical image-level sweep snapshot and should be read as pre-reclassification detail, not the current blank residual.

How blank detection works

Each image is loaded as a pixel array. If the mean pixel value exceeds 252 (out of 255) and the standard deviation is below 3, the image is classified as blank. This catches pure white pages, near-white pages, and pages with only faint scanner artifacts. The threshold is conservative — genuinely blank pages typically have mean > 254 and std < 1.

Blank Page Breakdown

CategoryTotal ImagesBlanks FoundBlank RateStatus
deposition-transcript424,716 233,74755.0% ✓ Complete
email-screenshot261,902 186,30971.1% ✓ Complete
uncategorized28,544 26,97594.5% ✓ Complete
contact-list29,072 24,13583.0% ✓ Complete
typed-page22,896 18,73681.8% ✓ Complete
invoice43,229 15,45735.8% ✓ Complete
affidavit19,856 12,38062.3% ✓ Complete
bank-statement16,658 8,30049.8% ✓ Complete
legal-filing24,847 7,22629.1% ✓ Complete
court-document16,253 6,89542.4% ✓ Complete
scanned-document25,013 1390.6% ✓ Complete
+ 36 other categories36,137 6,67718.5% ✓ Complete
TOTAL949,123 546,976 57.6% ✓ Complete
🔴 The email-screenshot category is 71.1% blank. Nearly three-quarters of all email screenshots are empty pages. The “uncategorized” folder is 94.5% blank. The deposition-transcript category — the single largest in the entire release — is 55% blank. More than half of every deposition is empty paper included in the “official release.”

What 17,831 confirmed blank pages means in practice: If a researcher is opening PDFs to review the files, the current blank residual is roughly 1 in 53 files. At 10 seconds per file to open, inspect, and close: that’s ~49.5 hours of wasted time — about 6 working days — still significant, but far below the original image-level blank estimate.

8. Cleanup: Duplicates, Tiny Files, Corrupt Images

Beyond blank pages, the extraction process identifies several other categories of problematic files that inflate the release without adding content.

17,831
Current Blank Pages
250,243
Duplicates
3,554
Bad / Corrupt
697
Tiny (<5KB)
689
Verified Keepers

What each category means

CategoryCountDescription
Blank pages 17,831 Current confirmed blank residual after OCR/reclassification of previously blank-labeled buckets. (Initial image-level flags were 546,976.)
Duplicates 250,243 Duplicate images identified via perceptual hash (dHash) matching. Same visual content stored under different filenames across datasets. Moved to DUPLICATES folder.
Bad / corrupt 3,554 Images that failed extraction or are corrupted — zero-byte files, truncated PNGs, images that won’t load. Moved to BLANKS BADS folder.
Tiny files (<5KB) 697 Images under 5KB — typically 1x1 pixel spacers, tiny icons, or PDF artifacts. Too small to contain meaningful content. Moved to TINY folder.
Verified keepers 689 Images manually reviewed and confirmed as significant/useful content. Moved to KEEPERS folder for priority access.

Total waste calculation

Adding up all identified waste: 17,831 blanks + 250,243 duplicates + 3,554 corrupt + 697 tiny = 272,325 junk files out of 949,123 extracted images. That’s 28.7% of extracted images. After removing this identified waste, approximately 676,798 pages remain.

9. Which Datasets Contain What

The 12 datasets vary enormously in their image content. Based on our scanning and classification work so far, here is what we know about the visual content of each dataset:

DatasetTotal FilesArchive SizePrimary Image Content
DS13,1671.23 GB Mixed — legal filings, correspondence pages, some scanned documents
DS25790.62 GB Correspondence, some financial document pages
DS3700.58 GB Legal filings, deposition pages — higher page-per-PDF ratio
DS41550.34 GB Property records, legal documents
DS5–DS71590.20 GB Small supplementary sets — mixed legal/correspondence
DS810,5959.95 GB MCC prison records — guard check sheets, incident reports, facility photos, surveillance-related docs
DS9254,477137.90 GB Largest archive — Lesley Groff email images, Richard Kahn correspondence, bulk email screenshots. Not yet fully classified
DS10302,60078.64 GB Concordance/IPRO format — single-page scanned images in PDF wrappers. Heavy deposition content. Highest blank page rate expected.
DS11331,99725.56 GB Bulk email archive — mostly email screenshots. Largest dataset by file count.
DS121550.11 GB Supplementary documents

Where the images are concentrated

DS11 — bulk email
331,997
DS10 — Concordance/IPRO
302,600
DS9 — Groff archive
254,477
DS8 — MCC records
10,595
DS1–7 + DS12 combined
4,325

Three datasets (DS9, DS10, DS11) account for 98.4% of all files. The remaining nine datasets combined are a rounding error in terms of image volume.

10. Current Processing Status

All major pipeline stages are complete. Here is the final state of each:

Pipeline StageStatusDetail
PDF image extraction ✓ Complete 949,123 images extracted from all 12 datasets
CLIP classification ✓ Complete All images tagged and sorted into 47 categories
Blank page sweep ✓ Complete All 47 categories swept — 546,976 initial blank flags, with 17,831 current confirmed blank residual after OCR/reclassification
Duplicate detection ✓ Complete 250,243 duplicates identified via perceptual hash (dHash) matching
Corrupt/tiny cleanup ✓ Complete 3,554 corrupt + 697 tiny files isolated
Text extraction ✓ Complete 343 million words extracted from all datasets (including 40.3M from deep sweep recovery)
Person scan ✓ Complete 27 persons scanned across all 12 datasets
Manual review (keepers) ✓ Complete 689 images verified and flagged as significant
✅ Processing complete. All pipeline stages have finished. The full results are searchable and browsable on this site. Current identified waste: 272,325 files (28.7% of extracted images), with blank residual now at 17,831 files after the reclassification pass.

📸 Interactive Image Galleries

Browse classified images directly in your browser — filterable, searchable, with full-size lightbox viewing.

THE PDF FILES

Image analysis based on extraction and CLIP classification of 949,123 images from 900,956 PDFs.
Classification: OpenAI CLIP ViT-B/32. Extraction: PyMuPDF. Blank detection: Pillow + NumPy.
All scripts and methodology are open source and reproducible.

← Back to Main Site  ·  File Analysis & Findings  ·  Gallery Picks  ·  Handwritten Notes  ·  Cast Hub  ·  Search Files