IMAGE & PHOTO ANALYSIS | THE PDF FILES

1. Overview: Images in the Epstein Files

The official Epstein file releases contain approximately 900,956 files across 12 datasets. Nearly every one of these files is a PDF — and inside those PDFs are embedded images: photographs, screenshots, scanned pages, charts, and hundreds of thousands of blank white rectangles.

Our pipeline has extracted and individually classified 949,123 images from these PDFs, sorted them into 47 content categories using an AI vision model, and initially flagged 546,976 blank-like pages by pixel analysis. After OCR/reclassification of blank-labeled buckets, the current confirmed blank bucket is 17,831 pages (~1.9% of all extracted images).

949,123

Images Extracted

47

Content Categories

546,976

Initial Blank Flags

1.9%

Current Blank Rate

3,554

Bad / Corrupt

697

Tiny (<5KB)

âœ… Processing complete. All pipeline stages have finished. Blank page sweep covered all 47 categories. The initial image-level sweep flagged 546,976 pages. The blank organizer then reclassified/OCR-processed large blank-labeled buckets (including 241,686 moved from _NEEDS_OCR), leaving a current confirmed blank bucket of 17,831 pages.

2. Why Are Images Stored as PDFs?

This is one of the most important questions about the release format, and the answer says more about the releasing agency than it does about the files themselves.

What happened

Photographs, screenshots, scanned documents, text messages, business cards, and handwritten notes were all individually wrapped in PDF containers before release. A simple 200KB JPEG photograph becomes a 1.5MB single-page PDF. A screenshot of a text conversation that could be a PNG file becomes a PDF that requires a PDF reader to open.

What the original files were

These images originally existed as standard image files — .jpg, .png, .tiff, .bmp — or as embedded images within email clients, litigation databases, and forensic imaging tools. They were converted to individual PDF pages as part of the Concordance/IPRO litigation support workflow used by the DOJ and cooperating law firms.

Why this matters

You can’t search images inside PDFs. Image-based PDFs don’t have searchable text. Every page requires OCR (Optical Character Recognition) before it becomes searchable — a process that takes seconds per page and hours at scale.
You can’t thumbnail or preview PDFs easily. Standard file browsers can display image thumbnails instantly. PDFs require rendering, which makes visual scanning of large collections impractical without specialized tools.
File sizes balloon. The PDF wrapper adds overhead to every file. Across 900,000+ images, this adds up to hundreds of gigabytes of wasted space.
Metadata is lost. Original image files contain EXIF data — camera model, date taken, GPS coordinates, orientation. PDF conversion strips all of this. For evidentiary photographs, this metadata loss is significant.
It fragments multi-page documents. A 47-page scanned deposition becomes 47 individual PDF files, each containing a single page image. Reassembling the original document requires knowing which page numbers belong together — information that is often only available in the proprietary Concordance load file.

The practical effect: Releasing images as PDFs means that no casual observer can review them. You need extraction software, classification tools, OCR engines, and significant storage just to see what’s in the pictures. This transforms a task that should be “look at the photos” into a multi-week engineering project.

3. The Extraction Pipeline

To make the images accessible, we built a multi-stage pipeline that extracts, classifies, and organizes every embedded image from every PDF in the release.

1. Extract

PyMuPDF opens each PDF and pulls out every embedded image as a PNG

2. Tag

CLIP AI vision model scores each image against 48 content labels

3. Rename

Files are renamed with their top labels embedded in the filename

4. Sort

Images moved into category folders based on their primary classification

5. Sweep

Blank detection removes white/empty pages from each category

Filename Convention

Each extracted image gets a structured filename that encodes its classification and source:

deposition-transcript+adult-woman+affidavit__EFTA01609671_20260130_p001_i001.png

Primary category + secondary tags — what the AI thinks the image contains
EFTA ID — the original Bates/document identifier from the release
Date, page, image number — extraction metadata

Technical Details

Component	Detail
PDF parser	PyMuPDF (fitz) 1.27.1 — extracts embedded image XObjects from each page
Vision model	OpenAI CLIP ViT-B/32 — zero-shot image classification
Label set	48 custom labels — tuned for legal/evidentiary content (see Â§4)
Blank detection	Pillow + NumPy — mean pixel > 252 & std < 3 = blank
Output format	PNG images, organized into category folders
Processing time	~4 hours with CUDA GPU, ~40 hours CPU-only

4. AI Classification (CLIP)

Every extracted image is classified using CLIP (Contrastive Language–Image Pre-training), an AI model developed by OpenAI that can match images to text descriptions without being specifically trained on any particular category. We use the ViT-B/32 variant.

How it works

CLIP compares each image against a set of 48 custom text labels and returns a probability score for each one. The labels were specifically designed for this type of legal/evidentiary content:

scanned document typed page email screenshot text message screenshot deposition transcript court document legal filing affidavit bank statement invoice check receipt contact list passport id card business card handwritten note portrait photo group photo person adult man adult woman child indoor scene vehicle signature newspaper clipping spreadsheet + 20 more labels

Accuracy & Limitations

CLIP is good at broad categorization — it reliably distinguishes email screenshots from deposition pages from photographs
It struggles with similar document types — an affidavit vs. a legal filing vs. a court document can look nearly identical, so these categories have overlap
Multi-label tagging helps — each image gets its top 3–5 labels encoded in the filename, so an image classified as email-screenshot+contact-list captures both aspects
Human review is still needed — the AI classification is a sorting tool, not a final determination. Critical images should always be verified

5. Content Breakdown: What’s in the Images

Based on CLIP classification of 949,123 extracted images, here is the high-level breakdown of what the Epstein files actually contain visually:

52.4%

Documents & Depositions

34.7%

Email Screenshots

2.4%

Financial Records

1.3%

Text Messages

0.4%

Photos of People

8.8%

Other / Uncategorized

Content Type Distribution

Documents & depositions

369,557

Email screenshots

245,041

Other / uncategorized

62,230

Financial records

16,641

Text messages

9,420

Photos of people / scenes

2,663

What this tells us

Over 87% of all images are pages of typed text — depositions, emails, and legal documents that were printed or scanned and then stored as images inside PDFs. These are not “pictures” in any meaningful sense. They are text documents that have been rendered as pictures, making them unsearchable, uncopyable, and unindexable without OCR processing.

Actual photographs of people, places, and things make up less than 0.4% of the total. The Epstein files are overwhelmingly a text archive that has been converted into an image archive through format choices.

6. Full Category Inventory

All 26 active categories with current file counts. Categories were created by the CLIP classification pipeline and have been consolidated through manual review.

deposition-transcript

327,717

Emails Screenshots

245,041

uncategorized

27,345

scanned-document

24,220

affidavit

19,792

bank-statement

16,641

court-document

16,145

Text Messages

9,420

book-page

5,903

People

2,263

Locations

2,364

Complete Table

Category	Images	% of Total	Blanks Swept	Notes
deposition-transcript	327,717	46.4%	95,721 removed	Largest category â€” court depositions, legal testimony pages
Emails Screenshots	245,041	34.7%	15,825 removed	Email bodies rendered as images within PDFs
uncategorized	27,345	3.9%	Pending	Images that didn’t strongly match any label
scanned-document	24,220	3.4%	15,651 removed	Generic scanned pages â€” high blank rate (64.6%)
affidavit	19,792	2.8%	Pending	Sworn statements, witness declarations
bank-statement	16,641	2.4%	Pending	Financial account statements, transaction records
court-document	16,145	2.3%	Pending	Court filings, orders, motions
Text Messages & Screenshots	9,420	1.3%	Pending	Text/SMS conversations captured as screenshots
book-page	5,903	0.8%	Pending	Scanned book/publication pages
Q	5,011	0.7%	Skipped (misc)	Miscellaneous overflow category
Locations	2,364	0.3%	Skipped (small)	Maps, property images, location shots
People	2,263	0.3%	Pending	Photographs containing identifiable persons
computer-screen	1,387	0.2%	Pending	Screenshots of computer interfaces
chart	1,024	0.1%	Pending	Graphs, diagrams, organizational charts
timeline	463	0.1%	Pending	Chronological visualizations
conference-room	400	0.1%	Pending	Meeting/office environment photos
Handwritten Notes	201	0.03%	Small set	Handwritten documents, notes, annotations
newspaper-clipping	126	0.02%	Pending	News articles, press clippings
ETC	66	<0.01%	Small set	Miscellaneous items that don’t fit other categories
+ 7 minor/empty categories	~23	<0.01%		Graphics, Documents, Legal, Paperwork, Xxx, check, form

7. Blank Pages & Waste

Blank handling now has two distinct numbers: the initial image-level flags from pixel sweep, and the current confirmed blank bucket after OCR/reclassification. The initial sweep flagged 546,976 pages as blank-like; after reprocessing blank-labeled folders, the remaining confirmed blank bucket is 17,831 pages.

ðŸ“Œ Important update: Current blank buckets were re-audited from local workflow logs: initial blank flags = 546,976 (pixel sweep), organizer moved 241,686 files out of _NEEDS_OCR, and current bucket counts in _BLANK_PAGES are _blank=17,821 and _questionable=10 (17,831 total).

546,976

Initial Blank Flags

57.6%

Initial Flag Rate

17,831

Current Blank Bucket

1.9%

Current Blank Rate

47 / 47

Categories Swept

100%

Sweep Complete

The category table below is a historical image-level sweep snapshot and should be read as pre-reclassification detail, not the current blank residual.

How blank detection works

Each image is loaded as a pixel array. If the mean pixel value exceeds 252 (out of 255) and the standard deviation is below 3, the image is classified as blank. This catches pure white pages, near-white pages, and pages with only faint scanner artifacts. The threshold is conservative â€” genuinely blank pages typically have mean > 254 and std < 1.

Blank Page Breakdown

Category	Total Images	Blanks Found	Blank Rate	Status
deposition-transcript	424,716	233,747	55.0%	âœ“ Complete
email-screenshot	261,902	186,309	71.1%	âœ“ Complete
uncategorized	28,544	26,975	94.5%	âœ“ Complete
contact-list	29,072	24,135	83.0%	âœ“ Complete
typed-page	22,896	18,736	81.8%	âœ“ Complete
invoice	43,229	15,457	35.8%	âœ“ Complete
affidavit	19,856	12,380	62.3%	âœ“ Complete
bank-statement	16,658	8,300	49.8%	âœ“ Complete
legal-filing	24,847	7,226	29.1%	âœ“ Complete
court-document	16,253	6,895	42.4%	âœ“ Complete
scanned-document	25,013	139	0.6%	âœ“ Complete
+ 36 other categories	36,137	6,677	18.5%	âœ“ Complete
TOTAL	949,123	546,976	57.6%	âœ“ Complete

ðŸ”´ The email-screenshot category is 71.1% blank. Nearly three-quarters of all email screenshots are empty pages. The “uncategorized” folder is 94.5% blank. The deposition-transcript category — the single largest in the entire release — is 55% blank. More than half of every deposition is empty paper included in the “official release.”

What 17,831 confirmed blank pages means in practice: If a researcher is opening PDFs to review the files, the current blank residual is roughly 1 in 53 files. At 10 seconds per file to open, inspect, and close: that’s ~49.5 hours of wasted time — about 6 working days — still significant, but far below the original image-level blank estimate.

8. Cleanup: Duplicates, Tiny Files, Corrupt Images

Beyond blank pages, the extraction process identifies several other categories of problematic files that inflate the release without adding content.

17,831

Current Blank Pages

250,243

Duplicates

3,554

Bad / Corrupt

697

Tiny (<5KB)

689

Verified Keepers

What each category means

Category	Count	Description
Blank pages	17,831	Current confirmed blank residual after OCR/reclassification of previously blank-labeled buckets. (Initial image-level flags were 546,976.)
Duplicates	250,243	Duplicate images identified via perceptual hash (dHash) matching. Same visual content stored under different filenames across datasets. Moved to `DUPLICATES` folder.
Bad / corrupt	3,554	Images that failed extraction or are corrupted — zero-byte files, truncated PNGs, images that won’t load. Moved to `BLANKS BADS` folder.
Tiny files (<5KB)	697	Images under 5KB — typically 1x1 pixel spacers, tiny icons, or PDF artifacts. Too small to contain meaningful content. Moved to `TINY` folder.
Verified keepers	689	Images manually reviewed and confirmed as significant/useful content. Moved to `KEEPERS` folder for priority access.

Total waste calculation

Adding up all identified waste: 17,831 blanks + 250,243 duplicates + 3,554 corrupt + 697 tiny = 272,325 junk files out of 949,123 extracted images. That’s 28.7% of extracted images. After removing this identified waste, approximately 676,798 pages remain.

9. Which Datasets Contain What

The 12 datasets vary enormously in their image content. Based on our scanning and classification work so far, here is what we know about the visual content of each dataset:

Dataset	Total Files	Archive Size	Primary Image Content
DS1	3,167	1.23 GB	Mixed — legal filings, correspondence pages, some scanned documents
DS2	579	0.62 GB	Correspondence, some financial document pages
DS3	70	0.58 GB	Legal filings, deposition pages — higher page-per-PDF ratio
DS4	155	0.34 GB	Property records, legal documents
DS5–DS7	159	0.20 GB	Small supplementary sets — mixed legal/correspondence
DS8	10,595	9.95 GB	MCC prison records â€” guard check sheets, incident reports, facility photos, surveillance-related docs
DS9	254,477	137.90 GB	Largest archive — Lesley Groff email images, Richard Kahn correspondence, bulk email screenshots. Not yet fully classified
DS10	302,600	78.64 GB	Concordance/IPRO format — single-page scanned images in PDF wrappers. Heavy deposition content. Highest blank page rate expected.
DS11	331,997	25.56 GB	Bulk email archive â€” mostly email screenshots. Largest dataset by file count.
DS12	155	0.11 GB	Supplementary documents

Where the images are concentrated

DS11 â€” bulk email

331,997

DS10 â€” Concordance/IPRO

302,600

DS9 â€” Groff archive

254,477

DS8 â€” MCC records

10,595

DS1–7 + DS12 combined

4,325

Three datasets (DS9, DS10, DS11) account for 98.4% of all files. The remaining nine datasets combined are a rounding error in terms of image volume.

10. Current Processing Status

All major pipeline stages are complete. Here is the final state of each:

Pipeline Stage	Status	Detail
PDF image extraction	âœ“ Complete	949,123 images extracted from all 12 datasets
CLIP classification	âœ“ Complete	All images tagged and sorted into 47 categories
Blank page sweep	âœ“ Complete	All 47 categories swept — 546,976 initial blank flags, with 17,831 current confirmed blank residual after OCR/reclassification
Duplicate detection	âœ“ Complete	250,243 duplicates identified via perceptual hash (dHash) matching
Corrupt/tiny cleanup	âœ“ Complete	3,554 corrupt + 697 tiny files isolated
Text extraction	âœ“ Complete	343 million words extracted from all datasets (including 40.3M from deep sweep recovery)
Person scan	âœ“ Complete	27 persons scanned across all 12 datasets
Manual review (keepers)	âœ“ Complete	689 images verified and flagged as significant

âœ… Processing complete. All pipeline stages have finished. The full results are searchable and browsable on this site. Current identified waste: 272,325 files (28.7% of extracted images), with blank residual now at 17,831 files after the reclassification pass.

ðŸ“¸ Interactive Image Galleries

Browse classified images directly in your browser â€” filterable, searchable, with full-size lightbox viewing.

682

â Gallery Picks

Hand-picked highlights

532

Epstein Face Matches

OpenCV YuNet + SFace scan output

Notes

Handwritten Notes

Review handwritten pages and transcripts

47

Category Galleries

Jump by people, aircraft, maps, docs, and more

THE PDF FILES

Image analysis based on extraction and CLIP classification of 949,123 images from 900,956 PDFs.
Classification: OpenAI CLIP ViT-B/32. Extraction: PyMuPDF. Blank detection: Pillow + NumPy.
All scripts and methodology are open source and reproducible.

← Back to Main Site · File Analysis & Findings · Gallery Picks · Handwritten Notes · Cast Hub · Search Files