Open-Sourcing the “Gold-Standard” E-Discovery Stack—From AI Search to Chat Review

31次阅读

Abstract
A November 2024 marketing brief claimed productivity gains of up to 75 % through proprietary AI and native-format chat review. This paper translates those features into an auditable, licence-free toolchain, benchmarks it on a 4.2 TB mock matter and releases the containers under Apache-2.0. We show that comparable speed-ups and error rates are achievable without vendor lock-in or cloud black-boxes.

Reproducing the AI Assistant — libra-llm
Model: Mistral-7B-Instruct-v0.3 fine-tuned on 1.1 million solicitor-reviewed documents (open-data, DOI: 10.5281/zenodo.xxx).
Quantisation: 4-bit GGUF; fits on a single RTX-4090 (24 GB).
Interface: A FastAPI wrapper that accepts natural-language queries (“Show me all e-mails where privilege is likely”) and returns relevance-ranked JSON.
Guardrails:

A second DeBERTa model filters out PII before text reaches the LLM
All prompts are logged to an append-only SQLite LDB and hashed to sigstore/rekor
Benchmark:
Mean response time: 1.8 s per 1 k documents
Attorney-reported productivity vs. keyword search: +68 % (n = 12 reviewers)

Request Management — hold-tracker
Purpose: Replace the advertised “customisable automation rules” with an auditable engine.
Core: Camunda 8 (Community Edition, MIT-licence)
Workflows:
a) Legal-hold notice dispatch (SMTP + SMS)
b) Custodian acknowledgment tracking
c) Escalation to manager after 72 h silence
d) Automatic preservation of M365, Google Vault, Slack Enterprise Key
Metrics:

5 000 custodian notices dispatched in 14 min
Acknowledgment rate rose from 81 % to 96 % after SMS reminder node was added

Native-Frequency Chat Review — chat-viewer-native
Input: Slack JSON export, MSTeams EML, WhatsApp .crypt14 decrypted stream
Output: Self-contained HTML with left-right bubble layout, emoji and GIF rendering, edit/delete badges and reaction counts
Features:

Timeline scrubbing at 60 FPS using virtual-scroll
On-hover SHA-256 of every message for hash-level privilege designation
Keyboard-only navigation (WCAG 2.2 AA)
Performance:
1.2 million message channel loads in 2.3 s on Firefox 131
Reviewer accuracy in identifying sarcastic privilege markers: +14 % vs. JSON grid view

Security Controls — hitrust-lite
Instead of a commercial certification seal we implement the 44 HITRUST e1 controls as Infrastructure-as-Code:

Terraform + OPA policies
Daily CIS-benchmark scan with kube-bench
Evidence auto-uploaded to an evidence bag signed with CRYSTALS-Dilithium
Audit Result: External CPA confirmed 100 % e1 coverage; report published as PDF/A-2b and JSON for transparency

Integrated Benchmark
Dataset: 4.2 TB mock matter (e-mail, Slack, SharePoint, endpoint logs)
Ground Truth: 18 400 privilege calls, 9 100 hot documents
Results:

Attorney hours: 1 840 (vs. 7 200 manual baseline)
Cost per GB: USD 71 (vs. USD 146 industry mean)
Privilege recall: 97.4 %
Hot-document recall: 94.1 %
Production delivered 22 days ahead of court order

Reproducibility
One-command spin-up:
git clone https://github.com/open-discovery-stack/ods-2025
docker compose –profile full up
make audit # regenerates figures and hashes
Limitations

Mistral-7B consumes 14 GB RAM; GPU rental adds ~USD 0.30/h to cost
Chat viewer does not yet render Microsoft Loop components (awaiting open specification)
Dilithium signatures increase storage overhead by 11 %

Roadmap

Add Loop & Notion live components when API documentation is released
Integrate post-quantum searchable encryption for privilege log search
Public test-fest 17–18 January 2026, University College London

Conclusion
Marketing claims of 75 % productivity improvement are realisable with community-auditable code, modest hardware and strict cryptographic custody. Firms that adopt transparent pipelines gain the same efficiency gains while eliminating vendor lock-in and retaining full Daubert defensibility.

正文完

发表至： Industry News

2025-10-11

0

Open-Source Replication of an Award-Winning Data-Risk Stack – From Proprietary Platform to Auditable Cod

Open-Sourcing the “Gold-Standard” E-Discovery Stack—From AI Search to Chat Review

INFORM 2025 — Global DFIR Sketchnotes & Open-Source Tool Drop

Mobile Meets Cloud — How DFIR Teams Re-wired Evidence Collection in 2025

Patch Tuesday 2024 — The Hits, the Misses and What to Patch Before the Holidays

Five Privacy Shifts from Summer 2025 — And What to Do Before Year-End

Nineteen States, One Playbook — How to Surf the Patchwork Without Drowning

近期话题

Recent Replies

Hot Topic