FILE ANALYSIS & FINDINGS

1. The Scale of the Release

The DOJ released Jeffrey Epstein’s files across 12 separate ZIP/TAR archives totaling approximately 461 GB of compressed data, expanding to over 1.86 TB on disk. The release contains 904,395 individual files, of which approximately 900,956 are PDFs.

904,395

Total Files

~461 GB

Compressed Archives

1.86 TB

Expanded on Disk

12

Separate Datasets

99.6%

Files Are PDFs

835,718

Extracted Images

To put this in perspective: if you spent 30 seconds reviewing each document, working 8 hours a day with no breaks, it would take 2.6 years to get through all of them. The format choices made by the releasing agency ensure this is exactly what would be required for any meaningful review.

2. Format & Delivery Choices

Everything Is a PDF

Nearly every file in the release — 99.6% — is a PDF. This includes emails, text messages, screenshots, photographs, handwritten notes, spreadsheets, legal filings, and blank pages. Each individual email is its own separate PDF file. A conversation thread that might be a single scrollable email in your inbox becomes 5–15 individual PDF files that must be opened, read, and cross-referenced manually.

Why Not Just Release the Native Formats?

The emails in these files originally existed as .msg, .eml, or in PST archive format. The images existed as .jpg and .png files. The spreadsheets were .xls or .csv. Converting everything to individual PDF files:

Strips all metadata (timestamps, sender info, thread context)
Makes full-text search impossible without OCR processing first
Inflates the total file count from what would be ~50,000 logical documents to 900,000+ individual PDFs (containing over 3 million actual pages)
Forces every researcher to independently solve the same extraction problems
Fragments email conversations across thousands of unlinked files

Images Stored as PDFs

Photographs, screenshots, scanned documents, and even business cards are all wrapped in PDF containers. A simple JPEG photograph that would be 200KB becomes a 1.5MB PDF. Our extraction pipeline pulled 835,718 images out of these PDF wrappers. Of those, over 564,000 appeared blank at the pixel level. However, our Deep Sweep later recovered hidden text from 400,913 of those pages — they were blank as images but contained embedded text layers that yielded 40.3 million additional words.

Concordance / IPRO Load File Format

Dataset 10 (302,600 files) uses Concordance/IPRO litigation support format. This is an industry tool used by law firms for document review — it assigns each page a Bates number and stores it as a single-page TIFF or PDF. A 47-page deposition becomes 47 individual files named EPSTEIN-00384721.pdf through EPSTEIN-00384767.pdf. Reassembling documents from this format requires the accompanying load file (DAT/OPT), which is not always included or is itself in a proprietary format.

âš Format note: The Concordance/IPRO format is specifically designed for billing-by-the-hour legal review. It is not designed for public transparency. The choice to release files in this format â€” rather than converting to merged, searchable PDFs â€” is itself a significant editorial decision.

3. Content Quality Breakdown

We performed a systematic analysis of the file contents by manually categorizing a 500-document sample from Dataset 11 (the largest dataset at 331,997 files). Because these files are released in numbered batches, we reviewed contiguous ID windows (plus spot checks outside those windows) so adjacent files could preserve missing thread context. Results:

43%

Admin / Scheduling

38%

Substantive Content

18%

Noise / Junk

What “Admin / Scheduling” Means (43%)

Nearly half the files are scheduling emails, appointment confirmations, flight arrangements, and logistical coordination. These are emails like: “Haircut at 2pm. Vegan lunch with Mort Zuckerman at 12:30. Car to Teterboro at 4.” While even scheduling details can be evidentiary (they prove who was where, when), the volume of mundane logistics dramatically dilutes the substantive material.

What “Substantive” Means (38%)

The genuinely significant material includes financial records, legal correspondence, witness depositions, wire transfer records, property documents, communications between Epstein and high-profile associates, and documents from criminal proceedings. This includes:

Wire transfer records and bank statements from multiple financial institutions
Correspondence with attorneys (Dershowitz, others) regarding legal strategy
Property records for homes in New York, Palm Beach, New Mexico, USVI
Flight manifests and travel arrangements referencing specific individuals
Communications between Ghislaine Maxwell and Epstein
Documents from the 2006–2008 Florida investigation and plea deal
Meeting schedules that place specific associates at specific locations
Leon Black / Apollo Global financial relationship documentation
MCC Metropolitan Correctional Center records (Dataset 8)

What “Noise” Means (18%)

Nearly one-fifth of the files are genuinely useless for investigative purposes: blank pages, cover sheets, fax headers, duplicate transmissions, auto-forwarded news articles, and personal messages with no evidentiary value (e.g., “Miss you! Good night xxxxoo” texts between Epstein and unnamed contacts).

Document Type Distribution

Across all datasets, approximately 96% of files are emails or email-derived content (email bodies, attachments, forwarded messages). The average document contains approximately 1,492 characters of extractable text — about half a printed page.

Emails & email content

~96%

Legal filings / court docs

~2.2%

Financial records

~1%

Photos / media

~0.5%

Other (forms, IDs, etc.)

~0.3%

4. Image Extraction & Classification

Because every document is a PDF — including photographs and screenshots — we built an image extraction pipeline that pulls embedded images out of every PDF and classifies them using CLIP (Contrastive Languageâ€“Image Pre-training), an AI vision model. This allows us to categorize the visual contents of the release without opening each PDF manually.

835,718

Images Extracted

47

Category Labels

564,807

Blank Pages Found

67.6%

Blank Rate

Top Categories by Image Count

Images classified by CLIP into descriptive categories. The top two categories alone account for 68.5% of all images.

deposition-transcript

327,718

email-screenshot

245,042

invoice

42,973

contact-list

28,999

uncategorized

27,346

legal-filing

24,595

scanned-document

24,221

typed-page

21,975

affidavit

19,792

bank-statement

16,641

court-document

16,145

text-message-screenshot

9,420

Full Category Inventory

Category	Images	% of Total
deposition-transcript	327,718	39.2%
email-screenshot	245,042	29.3%
invoice	42,973	5.1%
contact-list	28,999	3.5%
uncategorized	27,346	3.3%
legal-filing	24,595	2.9%
scanned-document	24,221	2.9%
typed-page	21,975	2.6%
affidavit	19,792	2.4%
bank-statement	16,641	2.0%
court-document	16,145	1.9%
text-message-screenshot	9,420	1.1%
book-page	5,907	0.7%
spreadsheet	5,075	0.6%
person	1,853	0.2%
computer-screen	1,387	0.2%
indoor-scene	1,064	0.1%
chart	1,024	0.1%
phone-screen	822	0.1%
business-card	730	0.1%
signature	727	0.1%
letter	693	0.1%
portrait-photo	346	0.04%
passport	272	0.03%
handwritten-note	201	0.02%
group-photo	175	0.02%
newspaper-clipping	126	0.02%
id-card	97	0.01%
check	56	<0.01%
receipt	64	<0.01%
+ 16 minor categories	~5,700	0.7%

5. Blank Pages & Waste

A significant portion of the release consists of completely blank pages — white rectangles stored as full PDF documents with filenames, file system entries, and archive space. These are not redacted pages (which would typically show redaction bars or stamps) — they are genuinely empty pages.

564,807

Blank Pages Found

67.6%

of All Images

3,554

Bad/Corrupt Images

250,243

Duplicates

Blank Page Breakdown by Source

Source Category	Scanned	Blanks Found	Blank %	Status
deposition-transcript	424,716	233,747	55.0%	âœ“ Complete
email-screenshot	261,902	186,309	71.1%	âœ“ Complete
uncategorized	28,544	26,975	94.5%	âœ“ Complete
contact-list	29,072	24,135	83.0%	âœ“ Complete
typed-page	22,896	18,736	81.8%	âœ“ Complete
invoice	43,229	15,457	35.8%	âœ“ Complete
+ 41 other categories	138,764	59,448	42.8%	âœ“ Complete
Total	949,123	564,807	59.5%

ðŸ”´ What this means: At the image level, two out of every three files in the Epstein release appeared to be a completely blank page. However, our Deep Sweep recovered hidden text from 400,913 of those pages, finding 40.3 million words of previously invisible content. Even after recovery, the remaining genuinely empty pages still inflated file counts, occupied significant disk space, and wasted every researcher’s time.

6. Dataset-by-Dataset Breakdown

The 12 datasets vary enormously in size and content. Three datasets (DS9, DS10, DS11) contain 98% of all files. The remaining nine datasets combined hold fewer than 16,000 files.

Dataset	Files	Archive Size	Primary Content	Scan Status
DS1	3,167	1.23 GB	Mixed correspondence, legal filings	âœ“ Scanned
DS2	579	0.62 GB	Correspondence, financial docs	âœ“ Scanned
DS3	70	0.58 GB	Legal filings, depositions	âœ“ Scanned
DS4	155	0.34 GB	Property records, legal docs	âœ“ Scanned
DS5	123	0.06 GB	Correspondence	âœ“ Scanned
DS6	16	0.05 GB	Compact legal set	âœ“ Scanned
DS7	20	0.09 GB	Small supplementary set	âœ“ Scanned
DS8	10,595	9.95 GB	MCC prison records, guard logs, indictment docs	âœ“ Scanned
DS9	254,477	137.90 GB	Lesley Groff emails, Kahn correspondence, bulk email	âœ“ Scanned
DS10	302,600	78.64 GB	Concordance/IPRO litigation format, financial records	âœ“ Financial scan complete
DS11	331,997	25.56 GB	Largest email set â€” bulk Epstein office email	âœ“ Person scan (11 targets)
DS12	155	0.11 GB	Supplementary documents	âœ“ Scanned

DS11 (331,997)

36.7%

DS10 (302,600)

33.5%

DS9 (254,477)

28.1%

DS8 (10,595)

1.2%

DS1-7 + DS12 (4,285)

0.5%

âœ… All datasets scanned. All 12 datasets have been fully processed including text extraction (286M words), blank page detection (564,807 blanks), duplicate scanning (250,243 duplicates), and person reference scanning (27 targets across all datasets).

7. What It Takes to Process These Files

There is no “download and read” path for these files. Any meaningful analysis requires building custom software infrastructure. Here is what our pipeline involves:

1. Download

461 GB across 13 archives from DOJ/Archive.org sources

2. Extract

Decompress ZIP/TAR/ZST â†’ 1.86 TB on disk, 904K files

3. Parse PDFs

PyMuPDF extracts text + images from every PDF

4. OCR

Tesseract/OCR on scanned pages with no embedded text

5. Classify

CLIP AI vision model sorts 835K images into 45 categories

6. Scan

Custom Python scripts scan for person names, financial patterns

7. Index

Results wired into searchable web interface

Technical Requirements

Component	Requirement	Notes
Storage	~2 TB minimum	Archives + extracted + working copies
RAM	16 GB+	PDF processing is memory-intensive
Python	3.10+	PyMuPDF, Pillow, NumPy, various ML libs
GPU (optional)	CUDA-capable	CLIP classification, ~4hrs with GPU vs ~40hrs CPU
Processing Time	~72 hours	Full pipeline: extract â†’ classify â†’ scan (per-dataset varies)
Custom Scripts	40+	Extraction, OCR, scanning, classification, site generation

Software Stack

PyMuPDF (fitz) — PDF text and image extraction
OpenAI CLIP — Zero-shot image classification (ViT-B/32)
Pillow + NumPy — Image analysis, blank detection (mean pixel > 252, std < 3)
Tesseract OCR — Optical character recognition for scanned pages
Python 3.11 — All scripts, batch processing, report generation
Static HTML/JS — Website with search, maps, timeline, cast tracking

âœ… Open source: Every script used in this project is available in the project repository. The goal is that any researcher can replicate and extend this analysis.

8. Editorial: The Document Dump Strategy

This is a classic document dump. The release format follows a well-known pattern used when institutions want to appear transparent while making genuine scrutiny as difficult as possible.

The Pattern

Document dumps share common characteristics, and this release exhibits all of them:

Volume as obstruction: 900,000 documents (3 million+ pages) sounds like full transparency. In practice, it means no single human can review the material. The sheer volume is itself a form of concealment.
Hostile formatting: Converting native email archives to individual PDFs destroys searchability, threading, metadata, and context. This is a choice — the original files could have been released in their native format.
Noise inflation: Including 564,000+ blank pages, duplicate transmissions, and junk inflates the file count and dilutes substantive content. Every blank page a researcher opens is time not spent on real evidence.
Fragmentation: Splitting material across 12 separate archives with inconsistent naming schemes, mixed formats, and no cross-referencing index means researchers must independently solve the same organizational problems before any analysis can begin.
No index or manifest: A 900,000-document, 3-million-page release with no master index, no date range guide, no subject categorization, and no document-type breakdown. The releasing agency knows what these files contain â€” they reviewed them before release. The choice not to include an index is deliberate.

The Effect

The practical result is that meaningful analysis requires:

2+ TB of storage space
Custom software engineering capability
Machine learning infrastructure for classification
Weeks to months of processing time
Domain expertise in legal, financial, and criminal investigation documents

This effectively limits genuine analysis to well-funded organizations, while allowing officials to point to the release as evidence of transparency. The files are “public” in the same way that a needle in 3 million pages is “findable.”

What Genuine Transparency Would Look Like

Release emails in their native format (PST/EML) with intact threading and metadata
Provide a searchable index with document types, date ranges, and subject categories
Remove blank pages and duplicates before release
Merge single-page Concordance/IPRO files back into complete documents
Release images as image files, not PDF-wrapped containers
Provide a single consolidated archive rather than 12 fragmented datasets
Include a document manifest mapping Bates numbers to logical document boundaries

The question is not whether these files contain important information — they do. The question is whether the format of the release was designed to facilitate public understanding or to obstruct it. The evidence strongly suggests the latter. Every formatting choice made in this release â€” individual PDFs, Concordance load files, blank page inclusion, fragmented archives, no index â€” increases the barrier to analysis without adding any transparency value.

This project exists to break down those barriers. Every file has been extracted, classified, scanned, and indexed so that the public can actually see what’s in these documents without needing a law firm’s IT department.

9. Content Samples

To illustrate the range of material, here are representative samples from different content categories, drawn from actual files in the release.

Scheduling / Admin (43% of content)

From: Office of Jeffrey Epstein Subject: JE Schedule â€” Tuesday 10:00am â€” Haircut 12:30pm â€” VEGAN LUNCH w/ Mort Zuckerman 2:00pm â€” Call with Richard Kahn re: trust restructuring 4:15pm â€” Car to Teterboro 6:30pm â€” Dinner at 9 East 71st [Typical scheduling email â€” one of thousands with similar content]

Substantive Financial (38% of content)

WIRE TRANSFER CONFIRMATION Date: [REDACTED] From: JP Morgan Chase Account ending ****4721 To: [REDACTED] Trust Account Amount: $[REDACTED] Reference: Property acquisition â€” NM Ranch [Financial records like this appear across DS10 and DS11, documenting the flow of funds through Epstein's network of accounts and entities]

Personal / Noise (18% of content)

From: [REDACTED] To: Jeffrey Epstein Miss you! Good night xxxxoo [Short personal messages, forwarded news articles, and blank/duplicate pages make up this category]

MCC Prison Records (Dataset 8)

METROPOLITAN CORRECTIONAL CENTER â€” NEW YORK SPECIAL HOUSING UNIT â€” INMATE CHECK LOG Date: [REDACTED] 2019 Inmate: EPSTEIN, JEFFREY Register No: [REDACTED] Guard checks at 15-minute intervals 03:00 âœ“ | 03:15 âœ“ | 03:30 âœ“ | 03:45 â€” | 04:00 â€” [DS8 contains MCC institutional records including guard check sheets, incident reports, and facility logs from Epstein's detention and death in August 2019]

Legal Correspondence

PRIVILEGED AND CONFIDENTIAL ATTORNEY-CLIENT COMMUNICATION RE: Epstein â€” Government Investigation Dear [REDACTED], As discussed, we recommend the following approach regarding the pending subpoena for financial records from [REDACTED]. The production should be limited to... [Legal strategy discussions between Epstein's attorneys regarding investigations and civil litigation]

THE PDF FILES

Analysis based on processing 904,395 files across 12 DOJ datasets.
Image classification via CLIP (ViT-B/32). Text extraction via PyMuPDF.
All scripts and methodology are open source and reproducible.

← Back to Main Site · ðŸ•¸ï¸ The Inner Circle · âš–ï¸ Complicity Gradient · ðŸ’£ Explosive Documents · ðŸ” Deep Sweep · Cast Hub · Search Files