In Understanding data ownership in the data lake Elizabeth Koumpan, Executive Architect at IBM, writes:
Depending on the organizational point of view, different ownership rules may apply in different situations to data. It is a tricky part when we deal with Data ownership while using external sources, especially if we use social data which is an essential element, as we build our cases for front office digitization, customer sensitive analysis and so on. While we deal with tremendous amount of social data which describes the interactions of people, moving this data around, changing it for our analysis, at the end have a difficult question – who owns this social data? Is this the real authentic data that was truly originated from a person and has some valid purpose? Or it is modified, changed and became fake or misleading, which if used and analyzed can lead us to unreliable and wrong decisions.
While the topic of the above paragraph is "social" data obtained from external sources, the issue of "ownership" is an important one that attaches to all types of data and metadata that are created, managed, or used within an organization. This includes the structured data as well, regardless of whether it's tied to an external customer or to an internal process or system.
Authority and responsibility
The term "ownership" implies authority and responsibility. We would like to think that the "owner" of the data (a) will be responsible for the data's timeliness and accuracy and (b) can be contacted when questions arise about the data or its meaning. Both are relevant for internal and external data as well as for structured and unstructured data.
As IBM's Koumpan suggests, things become complex in the real world when various types of external (and internal) data are gathered, transformed, analyzed, interpreted, and moved around. Even when a formal determination is made for responsibility for a "master" source for key data or metadata (i.e., data about the data), the practical value of the data in question may only be realized when the data are used to support a defined or evolving business process or system.
A team effort
Since a particular piece of data may be only one of many inputs to the value generating process or system, assigning credit (or blame) to a data "owner" can be difficult. It's like any team effort. Data may constitute only one of many inputs to the process or the decision where it is used. Parsing the contribution made by an individual fact or data value may be a fool's errand.
As Koumpan states, " ... different ownership rules may apply in different situations." Defining what these different situations are so ownership and responsibility can be assigned is important in many situations including the handling of personal or medical information, proprietary or top secret data, employee performance records, basic financial records, or data concerning any regulated industry or industrial process.
Sources and processes
In such real world situations, the concept of data governance must address both the manner in which the data are sourced or generated as well as subsequent processes where the data are manipulated or transformed. This means that anyone tasked with managing a data warehouse, data lake, or other resource where data are gathered, processed, and analyzed, may need to address ownership related questions that occur anywhere along the "data value chain" that leads up to the addition of the new or updated data to the collection.
From abstract to real
Here are examples where an understanding of "data origins" will be needed by the data warehouse manager:
- Two operating departments have two different approaches to defining customer addresses. Management needs data predicting customer churn by geographical area. Which geographic data will be used in the analysis? Who decides?
- The data warehouse department is building a database of financial data from different operating departments to compare product pricing with revenue and cost data. One high-turnover department's data has been recorded inconsistently over the years. How will this quality issue be addressed should management require this analysis to be updated on a regular basis?
- Collection of regular temperature and humidity data from earth based remote sensing stations was periodically interrupted by solar storms. Do public satellite data exist that can be used to model and estimate the missing earthbound data?
- Field trials of an experimental drug involve gathering data electronically from data collection forms completed by health professionals as well as a content analysis of call center messages gathered via phone, text messaging, and email from consumers. The objective is to analyze data over time for a tracking study. Problem is, the language used by patients is less controlled and standardized than the language used by health care professionals. How can a mapping of one to the other be performed and by whom?
While it's possible to talk about "data ownership" in the abstract, the need to address situations such as the above hypothetical examples requires someone to take responsibility for both understanding the processes via which data are generated as well as both the technical and business owners of those processes.
The data inventory
Ideally, an inventory of the data managed in a data warehouse or data lake will include information about both the technical aspects of the systems responsible for supplying the data as well as the identity of the business owners of the processes associated with the data. As the data collection expands, this inventory will grow as the number and type of data sources grow and as both internally generated data as well as data from external and unstructured sources are added.
Copyright (c) 2017 by Dennis D. McDonald