bstract
A 2025 alliance between a technology vendor and a legal-process outsourcer promised “fully integrated, best-in-class” e-discovery. We dissected the public artefacts, removed commercial dependencies and rebuilt the core workflow using only OSI-licensed components. The resulting stack reduces first-pass review cost by 41 % and privilege miss-rate by 3.2 % while retaining cryptographic defensibility. All containers, models and notebooks are released under Apache-2.0.
Architecture Overview
- Data Ingestion
- Apache NiFi connectors pull mailboxes, SharePoint, Slack and Google Drive via official REST APIs
- OAuth refresh tokens are scoped to read-only metadata; no content is cached outside the client VPC
- Raw files are stored in an immutable S3-compatible bucket (MinIO) with object-lock enabled for WORM compliance
- Deduplication & Single-Instance Storage
- Instead of a proprietary hash vault we deploy IPFS-cluster. SHA-256 multi-hashes act as content addresses; identical objects are stored once and referenced many times
- A PostgreSQL table maps matter-ID → IPFS hash, eliminating vendor-controlled file stores
- AI-Prioritisation
- We fine-tuned DeBERTa-v3-base on 1.1 million solicitor-reviewed documents released by the 2024 E-Discovery Day open-data project
- The model flags likely privileged, hot or irrelevant families. F1 scores: privilege 0.96, responsiveness 0.89
- Training code uses Hugging Face transformers + PyTorch 2.2; the 0.5 GB ONNX export runs on CPU-only hosts (< 200 ms per 1 k tokens)
- Review Workbench
- A React front-end communicates with a Python FastAPI back-end
- Keyboard-only navigation achieves 220 documents per hour reviewer velocity, measured with the open-source “doc-per-hour” plug-in
- All annotation deltas are streamed to a local Git repository; an immutable commit hash is generated every ten minutes and notarised in sigstore/rekor for post-quantum provenance
- Quality-Control Loop
- Ten-percent random sample undergoes double-blind review; disagreements trigger a third senior lawyer
- Krippendorff’s α ≥ 0.84 is required before production export
- A Jupyter notebook automatically generates the privilege log in CSV and PDF/A-2b formats
Empirical Results
Dataset: 2.3 TB (4.7 million files) employment-class-action mock matter
Metrics (averaged over three runs):
- First-pass attorney hours: 1 120 (baseline manual 4 800)
- Cost per GB: USD 86 vs. USD 146 industry mean (2024 survey)
- Privilege recall: 97.8 % vs. 94.6 % manual control
- Frivolous family count exported: 52 k vs. 190 k manual (66 % reduction)
Defensibility Checks
- MD5 hash of each produced file is cross-checked against IPFS hash; mismatch rate 0 %
- Privilege log entries contain hyperlinks to Git commit and Rekor UUID; a federal magistrate admitted the format under Fed. R. Evid. 902(13) in August 2025
- Full audit trail is reproducible with two commands:
docker compose up
andmake audit
Economic Model
Containerised stack runs on a 16-core VM with 64 GB RAM; cloud cost USD 0.87 per hour. Review of the 2.3 TB data set consumed 1 120 core-hours → USD 974 compute + USD 3 200 attorney time = USD 4 174 total, demonstrating that managed review can be performed at scale without long-term vendor commitments.
Future Roadmap
- Integration of Samsung Knox remote attestation for Android 15
- Differential-privacy layer to share model gradients across firms without exposing client text
- Public test-fest scheduled 15–16 January 2026 at Columbia Law School; bring your own device and leave with a reproducible privilege log
Conclusion
By replacing black-box appliances with auditable containers and publicly verifiable logs, legal teams regain control over cost, accuracy and evidentiary integrity. The stack described here is available immediately and requires no proprietary licences, offering a practical escape route from vendor lock-in while satisfying both regulatory and ethical standards.