“Data Cleansing to the Rescue: Wrangling with the Wild West of Unstructured Data”
by David Beauchemin, InOutsource
Decades-old paper files stored in hundreds or even thousands of boxes that have not been properly indexed. Oddly named files in shared drives. Information stored in Access databases alongside Excel spreadsheets that haven’t been opened in years.
Even with a physical records management system and a document management system in place, the reality is that most law firms still have a Wild West of unstructured repositories — both physical and electronic — where data resides. Trying to apply and enforce a records retention policy is somewhat haphazard at best if you’re never 100% sure where all the relevant client data is stored and exactly what those unstructured repositories contain.
Typically, the Office of the General Counsel, Information Governance and IT teams are keenly aware of the risks, particularly as a growing number of clients are specifying records retention requirements in RFPs and outside counsel guidelines, and conducting security audits to verify compliance.
At the same time, firms are also understandably reluctant to dedicate staff time to manually review these records — what would be an extremely tedious, time-consuming and expensive undertaking. In the meantime, the firm has to pay for the ongoing storage of files, many of which may be completely unnecessary (and even inadvisable) to retain.
Technology for data profiling and data cleansing
At InOutsource, we’ve worked with many firms that find themselves in this situation and turn to us to find out how they can apply technology to help with cleaning up their data to contain costs, improve security and reduce the risk of non-compliance.
The good news is that there are innovative data quality management tools available that can efficiently profile data to determine what exactly is in a large volume of “unstructured” files — an essential first step in the data cleansing process. Even better, these tools are now available with “low code” or “no code” graphical user interfaces which make them accessible to less technical users for a broader range of use cases than ever before, at a much lower cost of ownership to the firm over time.
Using a regular expression tool, we can quickly scan through strings of data to pick out client/matter numbers, much faster than it would take a human who would have to review individual files one by one when categorizing electronic files on a shared drive.
Fuzzy matching tools identify relevant files without matching perfectly — for example, it’s not uncommon for inconsistencies to emerge in client/matter numbers or names. Using fuzzy matching tools, we can help to reconcile these discrepancies and take out the “wrong” client/matter numbers or names where they occur. The fuzzy matching tools take advantage of a mathematical concept called “edit distance” — part of the underlying technology behind search engines such as Google.
Data analytics workflows can also scrape away at the data, visualize the state of your data health, and help identify the most efficient path forward. If we know that 1/3 of the data has “matched” and other 2/3 has not, we can then begin a manual mapping process to review and cull the list of remaining files collaboratively with the internal information governance or records management team.
In most of these projects, we actually find that the knowledge workers (Records, Information Governance and IT staff) are generally able to apply institutional knowledge about the data to point out patterns to easily identify items. By eliminating duplicate and non-essential files and sorting through unstructured and poorly structured data up front, we can reduce the number of consultant hours required to complete a data cleansing project. Furthermore, by working collaboratively to apply both institutional knowledge and technology, we are also cutting down on the amount of hours that firm staff spends cleaning up unstructured data that remains unidentified.
Assessing and managing multiple repositories
In a typical scenario, a firm will have several different repositories of physical files — including files that may have only been partially indexed and migrated to the firm’s main records management system. There might be SQL databases, Access files, Excel databases and cloud sources that need to be accessed via APIs. Data analytics tools allow us to compare repositories and quickly identify what has already been indexed in the records management system versus what is missing.
Very large law firms tend to rely on different offsite records storage facilities over time, with each vendor adding a new bar code to their files each time they move. Again, with the help of data analytics tools, it becomes straightforward to compare and marry a firm’s offsite physical data with its onsite data, and strip away the layers of unnecessary codes. This gives the firm better visibility into the state of their files which not only has compliance and risk benefits, but will also lead to cost savings in the long run by identifying material that is easy to dispose of.
The best data analytics tools provide visualizations that provide an intuitive understanding of the state of your data, surface key events, and allow exploration to find out more. Much of the underlying technology behind these data management tools has existed for some time, but has only recently become more useful to non-technical end users as well as IT.
If you are interested in learning more about how InOutsource can help you apply data analytics to your data cleansing projects to streamline information governance, reduce storage costs and reduce risks — or would like advice on how to define and enforce your data retention policies more broadly — please get in touch.