E-Discovery/Forensic Investigations
Machine intelligence in a legal context is most advanced in electronic discovery (or e-Discovery). It can be used in a variety of ways, including streamlining aspects of document review, analysing a document production received from an opposing party, preparing for depositions and expert discovery.
E-discovery was the first (big) data application for law, emerging in the early 2000s. Nowadays, it is a multi- billion-dollar industry in the U.S. alone with many companies and notable exits. The e-discovery industry also continuously attracts new funding rounds. E-discovery has already significantly changed and disrupted the document review process which consumed much time of associates years ago. Law firms set up e-discovery units or outsource this work to third-party providers.E-discovery is, simply put, the application of general methods of machine search to the review of legal documents. In its simpler form, it uses keyword search terms employing Boolean logic to find relevant documents. Search terms are then electronically applied to a set of documents. Lawyers will—based on this preselection—review those documents manually to determine if they are relevant.
Predictive coding (or Technology-Assisted Review, TAR) has fundamentally transformed the prospects for e-discovery. Predictive coding has proven to be faster, better, cheaper and more consistent than Human Powered Review (HPR) by applying methodologies including contextual searching, concept searching and metadata searching to document review. Predictive coding matches the text of entire documents (rather than individual words) to other documents using statistical sampling and modelling. Predictive coding involves a machine learning and various algorithmic tools. In general, experienced lawyers “train” the software by identifying relevant documents (seed set) from a broader set of potentially relevant documents.
Drawing on the seed set, the algorithm learns to identify relevant documents. Through an iterative process of several learning cycles, the software is “trained” with additional documents.Predictive coding has its downsides and limitations. Currently, predictive coding cannot effectively evaluate spreadsheets or documents without searchable text or other file types, such as videos, graphics and audio files. Predictive coding has some well-known problems which apply to machine learning in general. It will only be useful if the class of future cases has pertinent features in common with the previously analysed cases in the training set. The kind of relationship between future and past cases within a data set is an important dimension for the success of predictive coding. In addition, predictive coding generally requires a relatively large sample of past examples before robust generalisations can be inferred. Another common problem is overgeneralization (overfitting) which might occur when a model is excessively complex, such as having too many parameters relative to the number of observations. Overfitting happens when a model begins to “memorize” training data rather than “learning” to generalize from trend. Similarly, problems occur when the machine learning algorithm is trained with a biased data set and is therefore unable to infer useful rules for predictive purposes.
The same techniques used in e-discovery can also be applied to forensic investigations. This kind of investigation focuses on internal investigations as well as compliance programs. Forensic investigations tend to be more specific, targeted, less voluminous and much more technical. They can require a deeper inspection of lower data into operating systems, applications and server activity. The tools can not only be used in reactive scenarios where noncompliant or fraudulent activity has already happened. Legal analytics tools can also be used to proactively detect and head off potentially improper transactions before employees, third parties, or even criminals engage in the activities.
4.2