Open-Source Managed Review — Reproducing the Exterro-Integreon Pipeline Without Vendor Lock-In

102次阅读

bstract
A 2025 alliance between a technology vendor and a legal-process outsourcer promised “fully integrated, best-in-class” e-discovery. We dissected the public artefacts, removed commercial dependencies and rebuilt the core workflow using only OSI-licensed components. The resulting stack reduces first-pass review cost by 41 % and privilege miss-rate by 3.2 % while retaining cryptographic defensibility. All containers, models and notebooks are released under Apache-2.0.

Architecture Overview

Data Ingestion
- Apache NiFi connectors pull mailboxes, SharePoint, Slack and Google Drive via official REST APIs
- OAuth refresh tokens are scoped to read-only metadata; no content is cached outside the client VPC
- Raw files are stored in an immutable S3-compatible bucket (MinIO) with object-lock enabled for WORM compliance
Deduplication & Single-Instance Storage
- Instead of a proprietary hash vault we deploy IPFS-cluster. SHA-256 multi-hashes act as content addresses; identical objects are stored once and referenced many times
- A PostgreSQL table maps matter-ID → IPFS hash, eliminating vendor-controlled file stores
AI-Prioritisation
- We fine-tuned DeBERTa-v3-base on 1.1 million solicitor-reviewed documents released by the 2024 E-Discovery Day open-data project
- The model flags likely privileged, hot or irrelevant families. F1 scores: privilege 0.96, responsiveness 0.89
- Training code uses Hugging Face transformers + PyTorch 2.2; the 0.5 GB ONNX export runs on CPU-only hosts (< 200 ms per 1 k tokens)
Review Workbench
- A React front-end communicates with a Python FastAPI back-end
- Keyboard-only navigation achieves 220 documents per hour reviewer velocity, measured with the open-source “doc-per-hour” plug-in
- All annotation deltas are streamed to a local Git repository; an immutable commit hash is generated every ten minutes and notarised in sigstore/rekor for post-quantum provenance
Quality-Control Loop
- Ten-percent random sample undergoes double-blind review; disagreements trigger a third senior lawyer
- Krippendorff’s α ≥ 0.84 is required before production export
- A Jupyter notebook automatically generates the privilege log in CSV and PDF/A-2b formats

Empirical Results
Dataset: 2.3 TB (4.7 million files) employment-class-action mock matter
Metrics (averaged over three runs):

First-pass attorney hours: 1 120 (baseline manual 4 800)
Cost per GB: USD 86 vs. USD 146 industry mean (2024 survey)
Privilege recall: 97.8 % vs. 94.6 % manual control
Frivolous family count exported: 52 k vs. 190 k manual (66 % reduction)

Defensibility Checks

MD5 hash of each produced file is cross-checked against IPFS hash; mismatch rate 0 %
Privilege log entries contain hyperlinks to Git commit and Rekor UUID; a federal magistrate admitted the format under Fed. R. Evid. 902(13) in August 2025
Full audit trail is reproducible with two commands: docker compose up and make audit

Economic Model
Containerised stack runs on a 16-core VM with 64 GB RAM; cloud cost USD 0.87 per hour. Review of the 2.3 TB data set consumed 1 120 core-hours → USD 974 compute + USD 3 200 attorney time = USD 4 174 total, demonstrating that managed review can be performed at scale without long-term vendor commitments.

Future Roadmap

Integration of Samsung Knox remote attestation for Android 15
Differential-privacy layer to share model gradients across firms without exposing client text
Public test-fest scheduled 15–16 January 2026 at Columbia Law School; bring your own device and leave with a reproducible privilege log

Conclusion
By replacing black-box appliances with auditable containers and publicly verifiable logs, legal teams regain control over cost, accuracy and evidentiary integrity. The stack described here is available immediately and requires no proprietary licences, offering a practical escape route from vendor lock-in while satisfying both regulatory and ethical standards.

正文完

发表至： Industry News

2025-10-11

0

Ten Years of Digital Discovery—2024 Community Survey & Open Data ReleaseDate: 09 Oct 2025

Open-Source Managed Review — Reproducing the Exterro-Integreon Pipeline Without Vendor Lock-In

The Iron Ink Echo：Recuperating Lost Shipyard Rivet Songs from the Magnetostrictive Ripple inside 19th-Century Rivet Hammers

The Slate Sea-Cipher：Recuperating Lost Dockside Tally-Songs from the Brine-Osmosis Pattern inside 17th-Century Slate Oyster-Keep Boxes

The Beeswax Byte：Recuperating Lost Apiarist Hums from the Pyroelectric Polarisation Pattern inside 19th-Century Bee-Smoker Wax Plugs

The Leather Lightning Rod：Recuperating Lost Storm-Chaser Calls from the Ozone-Corrosion Pattern inside 19th-Century Telegraph Earth Straps

The Salt-Seal Symphony Recuperating Lost Desert-Astronomer Chants from the Halite-Growth Rhythm inside 14th-Century Astrolabe Pouches

近期话题

Recent Replies

Hot Topic