1. The Scale of the Release
The DOJ released Jeffrey Epstein’s files across 12 separate ZIP/TAR archives totaling approximately 461 GB of compressed data, expanding to over 1.86 TB on disk. The release contains 904,395 individual files, of which approximately 900,956 are PDFs.
To put this in perspective: if you spent 30 seconds reviewing each document, working 8 hours a day with no breaks, it would take 2.6 years to get through all of them. The format choices made by the releasing agency ensure this is exactly what would be required for any meaningful review.
2. Format & Delivery Choices
Everything Is a PDF
Nearly every file in the release — 99.6% — is a PDF. This includes emails, text messages, screenshots, photographs, handwritten notes, spreadsheets, legal filings, and blank pages. Each individual email is its own separate PDF file. A conversation thread that might be a single scrollable email in your inbox becomes 5–15 individual PDF files that must be opened, read, and cross-referenced manually.
Why Not Just Release the Native Formats?
The emails in these files originally existed as .msg, .eml, or in PST
archive format. The images existed as .jpg and .png files. The spreadsheets
were .xls or .csv. Converting everything to individual PDF files:
- Strips all metadata (timestamps, sender info, thread context)
- Makes full-text search impossible without OCR processing first
- Inflates the total file count from what would be ~50,000 logical documents to 900,000+ individual PDFs (containing over 3 million actual pages)
- Forces every researcher to independently solve the same extraction problems
- Fragments email conversations across thousands of unlinked files
Images Stored as PDFs
Photographs, screenshots, scanned documents, and even business cards are all wrapped in PDF containers. A simple JPEG photograph that would be 200KB becomes a 1.5MB PDF. Our extraction pipeline pulled 835,718 images out of these PDF wrappers. Of those, over 564,000 appeared blank at the pixel level. However, our Deep Sweep later recovered hidden text from 400,913 of those pages — they were blank as images but contained embedded text layers that yielded 40.3 million additional words.
Concordance / IPRO Load File Format
Dataset 10 (302,600 files) uses Concordance/IPRO litigation support format.
This is an industry tool used by law firms for document review — it assigns each page a
Bates number and stores it as a single-page TIFF or PDF. A 47-page deposition becomes 47
individual files named EPSTEIN-00384721.pdf through EPSTEIN-00384767.pdf.
Reassembling documents from this format requires the accompanying load file (DAT/OPT), which
is not always included or is itself in a proprietary format.
3. Content Quality Breakdown
We performed a systematic analysis of the file contents by manually categorizing a 500-document sample from Dataset 11 (the largest dataset at 331,997 files). Because these files are released in numbered batches, we reviewed contiguous ID windows (plus spot checks outside those windows) so adjacent files could preserve missing thread context. Results:
What “Admin / Scheduling” Means (43%)
Nearly half the files are scheduling emails, appointment confirmations, flight arrangements, and logistical coordination. These are emails like: “Haircut at 2pm. Vegan lunch with Mort Zuckerman at 12:30. Car to Teterboro at 4.” While even scheduling details can be evidentiary (they prove who was where, when), the volume of mundane logistics dramatically dilutes the substantive material.
What “Substantive” Means (38%)
The genuinely significant material includes financial records, legal correspondence, witness depositions, wire transfer records, property documents, communications between Epstein and high-profile associates, and documents from criminal proceedings. This includes:
- Wire transfer records and bank statements from multiple financial institutions
- Correspondence with attorneys (Dershowitz, others) regarding legal strategy
- Property records for homes in New York, Palm Beach, New Mexico, USVI
- Flight manifests and travel arrangements referencing specific individuals
- Communications between Ghislaine Maxwell and Epstein
- Documents from the 2006–2008 Florida investigation and plea deal
- Meeting schedules that place specific associates at specific locations
- Leon Black / Apollo Global financial relationship documentation
- MCC Metropolitan Correctional Center records (Dataset 8)
What “Noise” Means (18%)
Nearly one-fifth of the files are genuinely useless for investigative purposes: blank pages, cover sheets, fax headers, duplicate transmissions, auto-forwarded news articles, and personal messages with no evidentiary value (e.g., “Miss you! Good night xxxxoo” texts between Epstein and unnamed contacts).
Document Type Distribution
Across all datasets, approximately 96% of files are emails or email-derived content (email bodies, attachments, forwarded messages). The average document contains approximately 1,492 characters of extractable text — about half a printed page.
4. Image Extraction & Classification
Because every document is a PDF — including photographs and screenshots — we built an image extraction pipeline that pulls embedded images out of every PDF and classifies them using CLIP (Contrastive Language–Image Pre-training), an AI vision model. This allows us to categorize the visual contents of the release without opening each PDF manually.
Top Categories by Image Count
Images classified by CLIP into descriptive categories. The top two categories alone account for 68.5% of all images.
Full Category Inventory
| Category | Images | % of Total |
|---|---|---|
| deposition-transcript | 327,718 | 39.2% |
| email-screenshot | 245,042 | 29.3% |
| invoice | 42,973 | 5.1% |
| contact-list | 28,999 | 3.5% |
| uncategorized | 27,346 | 3.3% |
| legal-filing | 24,595 | 2.9% |
| scanned-document | 24,221 | 2.9% |
| typed-page | 21,975 | 2.6% |
| affidavit | 19,792 | 2.4% |
| bank-statement | 16,641 | 2.0% |
| court-document | 16,145 | 1.9% |
| text-message-screenshot | 9,420 | 1.1% |
| book-page | 5,907 | 0.7% |
| spreadsheet | 5,075 | 0.6% |
| person | 1,853 | 0.2% |
| computer-screen | 1,387 | 0.2% |
| indoor-scene | 1,064 | 0.1% |
| chart | 1,024 | 0.1% |
| phone-screen | 822 | 0.1% |
| business-card | 730 | 0.1% |
| signature | 727 | 0.1% |
| letter | 693 | 0.1% |
| portrait-photo | 346 | 0.04% |
| passport | 272 | 0.03% |
| handwritten-note | 201 | 0.02% |
| group-photo | 175 | 0.02% |
| newspaper-clipping | 126 | 0.02% |
| id-card | 97 | 0.01% |
| check | 56 | <0.01% |
| receipt | 64 | <0.01% |
| + 16 minor categories | ~5,700 | 0.7% |
5. Blank Pages & Waste
A significant portion of the release consists of completely blank pages — white rectangles stored as full PDF documents with filenames, file system entries, and archive space. These are not redacted pages (which would typically show redaction bars or stamps) — they are genuinely empty pages.
Blank Page Breakdown by Source
| Source Category | Scanned | Blanks Found | Blank % | Status |
|---|---|---|---|---|
| deposition-transcript | 424,716 | 233,747 | 55.0% | ✓ Complete |
| email-screenshot | 261,902 | 186,309 | 71.1% | ✓ Complete |
| uncategorized | 28,544 | 26,975 | 94.5% | ✓ Complete |
| contact-list | 29,072 | 24,135 | 83.0% | ✓ Complete |
| typed-page | 22,896 | 18,736 | 81.8% | ✓ Complete |
| invoice | 43,229 | 15,457 | 35.8% | ✓ Complete |
| + 41 other categories | 138,764 | 59,448 | 42.8% | ✓ Complete |
| Total | 949,123 | 564,807 | 59.5% |
6. Dataset-by-Dataset Breakdown
The 12 datasets vary enormously in size and content. Three datasets (DS9, DS10, DS11) contain 98% of all files. The remaining nine datasets combined hold fewer than 16,000 files.
| Dataset | Files | Archive Size | Primary Content | Scan Status |
|---|---|---|---|---|
| DS1 | 3,167 | 1.23 GB | Mixed correspondence, legal filings | ✓ Scanned |
| DS2 | 579 | 0.62 GB | Correspondence, financial docs | ✓ Scanned |
| DS3 | 70 | 0.58 GB | Legal filings, depositions | ✓ Scanned |
| DS4 | 155 | 0.34 GB | Property records, legal docs | ✓ Scanned |
| DS5 | 123 | 0.06 GB | Correspondence | ✓ Scanned |
| DS6 | 16 | 0.05 GB | Compact legal set | ✓ Scanned |
| DS7 | 20 | 0.09 GB | Small supplementary set | ✓ Scanned |
| DS8 | 10,595 | 9.95 GB | MCC prison records, guard logs, indictment docs | ✓ Scanned |
| DS9 | 254,477 | 137.90 GB | Lesley Groff emails, Kahn correspondence, bulk email | ✓ Scanned |
| DS10 | 302,600 | 78.64 GB | Concordance/IPRO litigation format, financial records | ✓ Financial scan complete |
| DS11 | 331,997 | 25.56 GB | Largest email set — bulk Epstein office email | ✓ Person scan (11 targets) |
| DS12 | 155 | 0.11 GB | Supplementary documents | ✓ Scanned |
7. What It Takes to Process These Files
There is no “download and read” path for these files. Any meaningful analysis requires building custom software infrastructure. Here is what our pipeline involves:
Technical Requirements
| Component | Requirement | Notes |
|---|---|---|
| Storage | ~2 TB minimum | Archives + extracted + working copies |
| RAM | 16 GB+ | PDF processing is memory-intensive |
| Python | 3.10+ | PyMuPDF, Pillow, NumPy, various ML libs |
| GPU (optional) | CUDA-capable | CLIP classification, ~4hrs with GPU vs ~40hrs CPU |
| Processing Time | ~72 hours | Full pipeline: extract → classify → scan (per-dataset varies) |
| Custom Scripts | 40+ | Extraction, OCR, scanning, classification, site generation |
Software Stack
- PyMuPDF (fitz) — PDF text and image extraction
- OpenAI CLIP — Zero-shot image classification (ViT-B/32)
- Pillow + NumPy — Image analysis, blank detection (mean pixel > 252, std < 3)
- Tesseract OCR — Optical character recognition for scanned pages
- Python 3.11 — All scripts, batch processing, report generation
- Static HTML/JS — Website with search, maps, timeline, cast tracking
8. Editorial: The Document Dump Strategy
This is a classic document dump. The release format follows a well-known pattern used when institutions want to appear transparent while making genuine scrutiny as difficult as possible.
The Pattern
Document dumps share common characteristics, and this release exhibits all of them:
- Volume as obstruction: 900,000 documents (3 million+ pages) sounds like full transparency. In practice, it means no single human can review the material. The sheer volume is itself a form of concealment.
- Hostile formatting: Converting native email archives to individual PDFs destroys searchability, threading, metadata, and context. This is a choice — the original files could have been released in their native format.
- Noise inflation: Including 564,000+ blank pages, duplicate transmissions, and junk inflates the file count and dilutes substantive content. Every blank page a researcher opens is time not spent on real evidence.
- Fragmentation: Splitting material across 12 separate archives with inconsistent naming schemes, mixed formats, and no cross-referencing index means researchers must independently solve the same organizational problems before any analysis can begin.
- No index or manifest: A 900,000-document, 3-million-page release with no master index, no date range guide, no subject categorization, and no document-type breakdown. The releasing agency knows what these files contain — they reviewed them before release. The choice not to include an index is deliberate.
The Effect
The practical result is that meaningful analysis requires:
- 2+ TB of storage space
- Custom software engineering capability
- Machine learning infrastructure for classification
- Weeks to months of processing time
- Domain expertise in legal, financial, and criminal investigation documents
This effectively limits genuine analysis to well-funded organizations, while allowing officials to point to the release as evidence of transparency. The files are “public” in the same way that a needle in 3 million pages is “findable.”
What Genuine Transparency Would Look Like
- Release emails in their native format (PST/EML) with intact threading and metadata
- Provide a searchable index with document types, date ranges, and subject categories
- Remove blank pages and duplicates before release
- Merge single-page Concordance/IPRO files back into complete documents
- Release images as image files, not PDF-wrapped containers
- Provide a single consolidated archive rather than 12 fragmented datasets
- Include a document manifest mapping Bates numbers to logical document boundaries
The question is not whether these files contain important information — they do. The question is whether the format of the release was designed to facilitate public understanding or to obstruct it. The evidence strongly suggests the latter. Every formatting choice made in this release — individual PDFs, Concordance load files, blank page inclusion, fragmented archives, no index — increases the barrier to analysis without adding any transparency value.
This project exists to break down those barriers. Every file has been extracted, classified, scanned, and indexed so that the public can actually see what’s in these documents without needing a law firm’s IT department.
9. Content Samples
To illustrate the range of material, here are representative samples from different content categories, drawn from actual files in the release.
Scheduling / Admin (43% of content)
Substantive Financial (38% of content)
Personal / Noise (18% of content)
MCC Prison Records (Dataset 8)
Legal Correspondence
THE PDF FILES
Analysis based on processing 904,395 files across 12 DOJ datasets.
Image classification via CLIP (ViT-B/32). Text extraction via PyMuPDF.
All scripts and methodology are open source and reproducible.
← Back to Main Site · ðŸ•¸ï¸ The Inner Circle · âš–ï¸ Complicity Gradient · 💣 Explosive Documents · 🔠Deep Sweep · Cast Hub · Search Files