Companies are drowning in data today. Data lakes pull in all kinds of data to preserve it for later analysis, including unstructured data, like comments in user forums, which would have previously been discarded. In many cases, big data projects start as small experiments that collect this data without much oversight and grow to pull in new data sources as the team gains experience.
As a result, there’s the potential of discovering valuable insights, but the volume and format of the data makes it hard for companies to know exactly what they have and to meet their compliance mandates. There are other risks that come with big data that result from the tools used, not just the volume of data. Data is often stored and processed using Hadoop, which doesn’t use the traditional relational databases that have well-defined metadata and controls.
Establishing data governance procedures that increase awareness of the data lake content is essential to ensure companies can satisfy regulatory requests and protect customers’ information. With data governance, a business can know where the data in a data lake originated, what it means, who accessed it, and how it’s used. That knowledge lets the business implement appropriate policies regarding whether the data needs to be archived and how long in order to comply with regulations.
Establishing Data Governance Controls Over Data Lakes
As with any data, the most important step in establishing data governance over a data lake is to understand what the data contains. This requires doing an inventory to identify all the data feeding into the lake and identifying the data owners. Then appropriate controls can be put in place regarding access and retention.
The next step is governing big data is to evaluate its quality and define the metadata to ensure that the data is used appropriately and you get the most value from it. Because data lakes often include unstructured data, identifying metadata and data relationships can be challenging. It’s best to identify critical data first and get that metadata established before putting effort into assessing less important feeds.
This is a time-consuming process, but it can be made easier with tools like Veritas Data Insight and Veritas Information Map that give visibility into the data, both structured and unstructured, as well as how it’s being used. Using tools and automation is critical to data governance success, as the process is an ongoing one; new analytics methods will constantly pull new data feeds into the data lake and it’s impossible to keep up without technological support.
Data Governance Is A Team Effort
Effective data governance requires a team with the authority to impose controls, knowledge of the business to understand the data, and the technical ability to bring in the best technology to support the governance process. It requires executive support to promote the importance of the data governance project, plus hands-on data stewards with the authority to implement and enforce controls.
dcVAST can be your partner in implementing data governance around your big data and data lakes. We’re experienced in building and supporting the physical infrastructure required by big data as well as the Veritas technology to help you gain insight into that data. Contact us to learn what we can bring to your data lake data governance projects.