← BACK TO MAIN SITE

FILE ANALYSIS & FINDINGS

A detailed breakdown of the official Epstein file releases — what’s actually in them, how they were formatted, what it takes to process them, and what the numbers reveal about the DOJ’s approach to “transparency.”

Last updated: February 2026 · Based on analysis of 904,395 files across 12 official datasets

1. The Scale of the Release

The DOJ released Jeffrey Epstein’s files across 12 separate ZIP/TAR archives totaling approximately 461 GB of compressed data, expanding to over 1.86 TB on disk. The release contains 904,395 individual files, of which approximately 900,956 are PDFs.

904,395
Total Files
~461 GB
Compressed Archives
1.86 TB
Expanded on Disk
12
Separate Datasets
99.6%
Files Are PDFs
835,718
Extracted Images

To put this in perspective: if you spent 30 seconds reviewing each document, working 8 hours a day with no breaks, it would take 2.6 years to get through all of them. The format choices made by the releasing agency ensure this is exactly what would be required for any meaningful review.

2. Format & Delivery Choices

Everything Is a PDF

Nearly every file in the release — 99.6% — is a PDF. This includes emails, text messages, screenshots, photographs, handwritten notes, spreadsheets, legal filings, and blank pages. Each individual email is its own separate PDF file. A conversation thread that might be a single scrollable email in your inbox becomes 5–15 individual PDF files that must be opened, read, and cross-referenced manually.

Why Not Just Release the Native Formats?

The emails in these files originally existed as .msg, .eml, or in PST archive format. The images existed as .jpg and .png files. The spreadsheets were .xls or .csv. Converting everything to individual PDF files:

Images Stored as PDFs

Photographs, screenshots, scanned documents, and even business cards are all wrapped in PDF containers. A simple JPEG photograph that would be 200KB becomes a 1.5MB PDF. Our extraction pipeline pulled 835,718 images out of these PDF wrappers. Of those, over 564,000 appeared blank at the pixel level. However, our Deep Sweep later recovered hidden text from 400,913 of those pages — they were blank as images but contained embedded text layers that yielded 40.3 million additional words.

Concordance / IPRO Load File Format

Dataset 10 (302,600 files) uses Concordance/IPRO litigation support format. This is an industry tool used by law firms for document review — it assigns each page a Bates number and stores it as a single-page TIFF or PDF. A 47-page deposition becomes 47 individual files named EPSTEIN-00384721.pdf through EPSTEIN-00384767.pdf. Reassembling documents from this format requires the accompanying load file (DAT/OPT), which is not always included or is itself in a proprietary format.

⚠ Format note: The Concordance/IPRO format is specifically designed for billing-by-the-hour legal review. It is not designed for public transparency. The choice to release files in this format — rather than converting to merged, searchable PDFs — is itself a significant editorial decision.

3. Content Quality Breakdown

We performed a systematic analysis of the file contents by manually categorizing a 500-document sample from Dataset 11 (the largest dataset at 331,997 files). Because these files are released in numbered batches, we reviewed contiguous ID windows (plus spot checks outside those windows) so adjacent files could preserve missing thread context. Results:

43%
Admin / Scheduling
38%
Substantive Content
18%
Noise / Junk

What “Admin / Scheduling” Means (43%)

Nearly half the files are scheduling emails, appointment confirmations, flight arrangements, and logistical coordination. These are emails like: “Haircut at 2pm. Vegan lunch with Mort Zuckerman at 12:30. Car to Teterboro at 4.” While even scheduling details can be evidentiary (they prove who was where, when), the volume of mundane logistics dramatically dilutes the substantive material.

What “Substantive” Means (38%)

The genuinely significant material includes financial records, legal correspondence, witness depositions, wire transfer records, property documents, communications between Epstein and high-profile associates, and documents from criminal proceedings. This includes:

What “Noise” Means (18%)

Nearly one-fifth of the files are genuinely useless for investigative purposes: blank pages, cover sheets, fax headers, duplicate transmissions, auto-forwarded news articles, and personal messages with no evidentiary value (e.g., “Miss you! Good night xxxxoo” texts between Epstein and unnamed contacts).

Document Type Distribution

Across all datasets, approximately 96% of files are emails or email-derived content (email bodies, attachments, forwarded messages). The average document contains approximately 1,492 characters of extractable text — about half a printed page.

Emails & email content
~96%
Legal filings / court docs
~2.2%
Financial records
~1%
Photos / media
~0.5%
Other (forms, IDs, etc.)
~0.3%

4. Image Extraction & Classification

Because every document is a PDF — including photographs and screenshots — we built an image extraction pipeline that pulls embedded images out of every PDF and classifies them using CLIP (Contrastive Language–Image Pre-training), an AI vision model. This allows us to categorize the visual contents of the release without opening each PDF manually.

835,718
Images Extracted
47
Category Labels
564,807
Blank Pages Found
67.6%
Blank Rate

Top Categories by Image Count

Images classified by CLIP into descriptive categories. The top two categories alone account for 68.5% of all images.

deposition-transcript
327,718
email-screenshot
245,042
invoice
42,973
contact-list
28,999
uncategorized
27,346
legal-filing
24,595
scanned-document
24,221
typed-page
21,975
affidavit
19,792
bank-statement
16,641
court-document
16,145
text-message-screenshot
9,420

Full Category Inventory

CategoryImages% of Total
deposition-transcript327,71839.2%
email-screenshot245,04229.3%
invoice42,9735.1%
contact-list28,9993.5%
uncategorized27,3463.3%
legal-filing24,5952.9%
scanned-document24,2212.9%
typed-page21,9752.6%
affidavit19,7922.4%
bank-statement16,6412.0%
court-document16,1451.9%
text-message-screenshot9,4201.1%
book-page5,9070.7%
spreadsheet5,0750.6%
person1,8530.2%
computer-screen1,3870.2%
indoor-scene1,0640.1%
chart1,0240.1%
phone-screen8220.1%
business-card7300.1%
signature7270.1%
letter6930.1%
portrait-photo3460.04%
passport2720.03%
handwritten-note2010.02%
group-photo1750.02%
newspaper-clipping1260.02%
id-card970.01%
check56<0.01%
receipt64<0.01%
+ 16 minor categories~5,7000.7%

5. Blank Pages & Waste

A significant portion of the release consists of completely blank pages — white rectangles stored as full PDF documents with filenames, file system entries, and archive space. These are not redacted pages (which would typically show redaction bars or stamps) — they are genuinely empty pages.

564,807
Blank Pages Found
67.6%
of All Images
3,554
Bad/Corrupt Images
250,243
Duplicates

Blank Page Breakdown by Source

Source CategoryScannedBlanks FoundBlank %Status
deposition-transcript424,716233,74755.0%✓ Complete
email-screenshot261,902186,30971.1%✓ Complete
uncategorized28,54426,97594.5%✓ Complete
contact-list29,07224,13583.0%✓ Complete
typed-page22,89618,73681.8%✓ Complete
invoice43,22915,45735.8%✓ Complete
+ 41 other categories138,76459,44842.8%✓ Complete
Total949,123564,80759.5%
🔴 What this means: At the image level, two out of every three files in the Epstein release appeared to be a completely blank page. However, our Deep Sweep recovered hidden text from 400,913 of those pages, finding 40.3 million words of previously invisible content. Even after recovery, the remaining genuinely empty pages still inflated file counts, occupied significant disk space, and wasted every researcher’s time.

6. Dataset-by-Dataset Breakdown

The 12 datasets vary enormously in size and content. Three datasets (DS9, DS10, DS11) contain 98% of all files. The remaining nine datasets combined hold fewer than 16,000 files.

Dataset Files Archive Size Primary Content Scan Status
DS13,1671.23 GB Mixed correspondence, legal filings ✓ Scanned
DS25790.62 GB Correspondence, financial docs ✓ Scanned
DS3700.58 GB Legal filings, depositions ✓ Scanned
DS41550.34 GB Property records, legal docs ✓ Scanned
DS51230.06 GB Correspondence ✓ Scanned
DS6160.05 GB Compact legal set ✓ Scanned
DS7200.09 GB Small supplementary set ✓ Scanned
DS810,5959.95 GB MCC prison records, guard logs, indictment docs ✓ Scanned
DS9254,477137.90 GB Lesley Groff emails, Kahn correspondence, bulk email ✓ Scanned
DS10302,60078.64 GB Concordance/IPRO litigation format, financial records ✓ Financial scan complete
DS11331,99725.56 GB Largest email set — bulk Epstein office email ✓ Person scan (11 targets)
DS121550.11 GB Supplementary documents ✓ Scanned
DS11 (331,997)
36.7%
DS10 (302,600)
33.5%
DS9 (254,477)
28.1%
DS8 (10,595)
1.2%
DS1-7 + DS12 (4,285)
0.5%
✅ All datasets scanned. All 12 datasets have been fully processed including text extraction (286M words), blank page detection (564,807 blanks), duplicate scanning (250,243 duplicates), and person reference scanning (27 targets across all datasets).

7. What It Takes to Process These Files

There is no “download and read” path for these files. Any meaningful analysis requires building custom software infrastructure. Here is what our pipeline involves:

1. Download
461 GB across 13 archives from DOJ/Archive.org sources
2. Extract
Decompress ZIP/TAR/ZST → 1.86 TB on disk, 904K files
3. Parse PDFs
PyMuPDF extracts text + images from every PDF
4. OCR
Tesseract/OCR on scanned pages with no embedded text
5. Classify
CLIP AI vision model sorts 835K images into 45 categories
6. Scan
Custom Python scripts scan for person names, financial patterns
7. Index
Results wired into searchable web interface

Technical Requirements

ComponentRequirementNotes
Storage~2 TB minimumArchives + extracted + working copies
RAM16 GB+PDF processing is memory-intensive
Python3.10+PyMuPDF, Pillow, NumPy, various ML libs
GPU (optional)CUDA-capableCLIP classification, ~4hrs with GPU vs ~40hrs CPU
Processing Time~72 hoursFull pipeline: extract → classify → scan (per-dataset varies)
Custom Scripts40+Extraction, OCR, scanning, classification, site generation

Software Stack

✅ Open source: Every script used in this project is available in the project repository. The goal is that any researcher can replicate and extend this analysis.

8. Editorial: The Document Dump Strategy

This is a classic document dump. The release format follows a well-known pattern used when institutions want to appear transparent while making genuine scrutiny as difficult as possible.

The Pattern

Document dumps share common characteristics, and this release exhibits all of them:

The Effect

The practical result is that meaningful analysis requires:

This effectively limits genuine analysis to well-funded organizations, while allowing officials to point to the release as evidence of transparency. The files are “public” in the same way that a needle in 3 million pages is “findable.”

What Genuine Transparency Would Look Like

The question is not whether these files contain important information — they do. The question is whether the format of the release was designed to facilitate public understanding or to obstruct it. The evidence strongly suggests the latter. Every formatting choice made in this release — individual PDFs, Concordance load files, blank page inclusion, fragmented archives, no index — increases the barrier to analysis without adding any transparency value.

This project exists to break down those barriers. Every file has been extracted, classified, scanned, and indexed so that the public can actually see what’s in these documents without needing a law firm’s IT department.

9. Content Samples

To illustrate the range of material, here are representative samples from different content categories, drawn from actual files in the release.

Scheduling / Admin (43% of content)

From: Office of Jeffrey Epstein Subject: JE Schedule — Tuesday 10:00am — Haircut 12:30pm — VEGAN LUNCH w/ Mort Zuckerman 2:00pm — Call with Richard Kahn re: trust restructuring 4:15pm — Car to Teterboro 6:30pm — Dinner at 9 East 71st [Typical scheduling email — one of thousands with similar content]

Substantive Financial (38% of content)

WIRE TRANSFER CONFIRMATION Date: [REDACTED] From: JP Morgan Chase Account ending ****4721 To: [REDACTED] Trust Account Amount: $[REDACTED] Reference: Property acquisition — NM Ranch [Financial records like this appear across DS10 and DS11, documenting the flow of funds through Epstein's network of accounts and entities]

Personal / Noise (18% of content)

From: [REDACTED] To: Jeffrey Epstein Miss you! Good night xxxxoo [Short personal messages, forwarded news articles, and blank/duplicate pages make up this category]

MCC Prison Records (Dataset 8)

METROPOLITAN CORRECTIONAL CENTER — NEW YORK SPECIAL HOUSING UNIT — INMATE CHECK LOG Date: [REDACTED] 2019 Inmate: EPSTEIN, JEFFREY Register No: [REDACTED] Guard checks at 15-minute intervals 03:00 ✓ | 03:15 ✓ | 03:30 ✓ | 03:45 — | 04:00 — [DS8 contains MCC institutional records including guard check sheets, incident reports, and facility logs from Epstein's detention and death in August 2019]

Legal Correspondence

PRIVILEGED AND CONFIDENTIAL ATTORNEY-CLIENT COMMUNICATION RE: Epstein — Government Investigation Dear [REDACTED], As discussed, we recommend the following approach regarding the pending subpoena for financial records from [REDACTED]. The production should be limited to... [Legal strategy discussions between Epstein's attorneys regarding investigations and civil litigation]

THE PDF FILES

Analysis based on processing 904,395 files across 12 DOJ datasets.
Image classification via CLIP (ViT-B/32). Text extraction via PyMuPDF.
All scripts and methodology are open source and reproducible.

← Back to Main Site  ·  🕸️ The Inner Circle  ·  ⚖️ Complicity Gradient  ·  💣 Explosive Documents  ·  🔍 Deep Sweep  ·  Cast Hub  ·  Search Files