Www.actiac.org



Data Architecture isn’t agile...enough.By Soyini Taylor and Michael ConlinData Science is by its nature Agile but Data Architecture hasn’t become Agile to match. By Agile we mean, responding to ever-shifting priorities by delivering value in small increments through continuous, rapid cycles of time-boxed work. Sometimes the cycle speed may be driven by developers creating new data products in daily sprints or even fortnightly sprints. Or the cycle speed may be driven by Data Scientists training and releasing ML/AI algorithms over a 30-60 day period. 47053482581273Hallmarks of Good Data Quality:Visible (V) – Users can find needed dataAccessible (A) – Users are able to access the dataUnderstandable (U) – Users can find descriptions of the data to understand context and applicabilityLinked (L) – Users can leverage complementary data elements through innate data relationshipsTrusted (T) – Users can be confident in all aspects of the data for decision-makingInteroperable (I) – Users can expect to successfully exchange data, information, and services between sender and recipientSecure (S) – Users know that data is protected from unauthorized access and manipulationHallmarks of Good Data Quality:Visible (V) – Users can find needed dataAccessible (A) – Users are able to access the dataUnderstandable (U) – Users can find descriptions of the data to understand context and applicabilityLinked (L) – Users can leverage complementary data elements through innate data relationshipsTrusted (T) – Users can be confident in all aspects of the data for decision-makingInteroperable (I) – Users can expect to successfully exchange data, information, and services between sender and recipientSecure (S) – Users know that data is protected from unauthorized access and manipulationBut here’s the catch, most of Data Science is applied research and development. The dirty secret of Data Science is, we’re all winging it, even the experts. Especially the experts; we’re dealing with cutting edge problems and there’s no cookbook. We’re figuring it out as we go along. By contrast, the practice of Data Architecture is anything but Agile, with an adherence to high ceremony, exhaustive steps, and endless frameworks. Data Architects have been trained to operate on long, predictable cycles and deliver comprehensive, nay exhaustive, products. The exercise often emphasizes documenting the Current Mode of Operations, rather than the Digital Transformation the organization wants to create. What’s worse the outputs tend to be arcane diagrams the average person can’t understand, let alone get value out of. It’s no wonder Data Architects have developed a reputation as the Department of Slow. Few Data Scientists have an appetite for this approach.Ironically, you can’t conduct Data Science – Artificial Intelligence and Machine Learning – without good quality data (side bar). Most Data Science exercises spend 80% of their time and resources ‘wrangling data’ before they can analyze it. If you haven’t heard the term ‘wrangling data’ it means the series of step taken to prepare data for use. Typical steps include:Looking for data.Finding data and working with the ‘owners to gather.Standardizing and cleaning the data.Joining and integrating data from disparate sources.Analyzing data features in a desperate search for useful patterns and relationships.Wailing in frustration and despair.Starting over. Data wrangling effort is the single biggest source of delays (Figure 1) in delivering value through insights. In other words, when it comes to good quality data (see sidebar – Hallmarks of Good Data Quality) the choice is ‘pay me now, or pay me later’. So why aren’t data architects being invited to the party?Figure 1-Source modelling in itself is an iterative process, so this in itself fits quite closely within agile methodology. There are lots of hurdles that can trip up data architects. They might lack the information to build an accurate data model. They might not be involved in the project from the start. They might lack the domain expertise to assimilate necessary information fast enough to produce the data models needed to get the program/project on its feet quickly. Or they might be slotting into a project or program already a few program iterations (PIs) in and need to get up to speed quickly. So what are some possible ways to fix this? Let’s examine them using the framework of People, Process and (technical) Products:81279198118PeopleAll change starts with people. Talent, culture and attitudes all have to be considered and addressed in the context of your organization and supply chain. What does data architecture need from a people perspective to become more agile?Crowdsourcing data quality tasks including making sure everyone understands the data and importance of the data they’re interacting with is one of the most powerful ways to make everyone a data steward. Everyone, from a developer to the customer who gets an inaccurate or confusing invoice, is affected by data quality issues. Mature data governance practices that institutionalize roles & responsibilities for data dictionaries, data quality issues, etc. This might involve a cross functional team structure (Table 1) where everyone (Developers, Data Modelers, Solution and Data Architects, Data Scientists, Data Analysts, Business Analysts, and Security Analysts) is responsible for gathering requirements, creating user stories while making sure they are detailed enough and valid, along with creating and maintaining specific elements of the data dictionary, models, taxonomy and security anizational change knowledge and help from the business change and organizational change staff, teams or departments as data changes affect many business processes and may cause huge changes to every day ways of working.Better access to communal information and learningFamiliarity with concepts and techniques like Agile software development and DevSecOps.A common – or at least common-enough - data language / taxonomy. Many sectors have industry-standard taxonomies and ontologies that can help in this regard especially if there’s a need to create from scratch or align current instances.The biggest wins that data architecture provides are interoperability, linkage and understanding the data we have. This will lead to the data being trusted and understood. Interoperability and linkage is provided through consistent naming conventions, definitions (through a data dictionary), and data taxonomy.Artifact (Hallmark of Good Data Quality)RoleGather requirements, User stories(U, T, S)Data Dictionary(U, T)Data Models (V, A, L, I)Data Taxonomy and Metadata(U, L, T, I)Data security classification (V, A, U, T, I, S)Business Analyst*Responsible and AccountableResponsibleConsulted and responsibleResponsibleResponsibleData ArchitectResponsibleResponsibleAccountable and ResponsibleAccountableAccountableSolution Architect*ResponsibleResponsibleResponsibleAccountableAccountableDeveloper*ResponsibleResponsibleResponsibleResponsibleResponsibleData ScientistResponsibleResponsibleConsulted and InformedConsulted and/or ResponsibleConsulted and/or ResponsibleData Analyst*ResponsibleResponsibleConsulted and InformedConsulted and/or responsibleConsulted and/or ResponsibleSecurity AnalystConsultedConsulted and/or InformedConsulted and InformedConsulted and/or responsibleAccountableTable 1 - The above shows some of the main artifacts data architecture relies on and needs to have updated quickly as changes arise along with the corresponding hallmark of good data quality each role and artifact help to amplify.*At many organizations, the roles with an asterisk next to them also are responsible for data modelling in addition to the data architect.1042606137540ProcessData architecture’s goal is to bridge the knowledge gaps between subject matter experts (SMEs) in order to map out an organization’s critical data, and make sure that data fits into the organization’s overall data needs and strategies. This [process] can take a long time and can be very labor intensive because of the number of handoffs needed and the amount of time it takes to gather the level of detail needed to understand the data needs and its impact on the organization and its critical data. Also the art of data architecture, whether drawn by hand on a flipchart, in Visio or within a data modelling tool, is still very siloed and produces visual art - separate diagrams that never fully join together digitally for testing or are too hard to update as change occurs on a regular basis. So the real question is how to change this? How do we make data architecture fit into the agile puzzle or lifestyle? New and improved practices include:MVP (Minimum Viable Product) approaches to Master Data Management...starting with utterly cynical and stingy definitions of what data is critical to the exercise at hand.Constrained diagrams: Model what is only critical or necessary instead of trying to diagram the world, universe and the stars. High level (conceptual/logical) data models showing the critical data can be done within a sprint. As conceptual level changes happen, make sure dev teams clearly model their logical interpretations. This gives dev teams leeway in how to implement, means less changes to conceptual models, and allows data architecture to be iterative and play to its strengths which is mapping the big picture. This also stops data architects from being ‘in the weeds’; so down in the detail that no one is looking at the big picture and how everything fits together.Start with the end in mind: A main criticism of data architecture is that it doesn’t cope well with change or is unable to embrace change. This is very true when there is no end goal or vision for data architecture to align and/or aspire to. Data architecture excels when we start with an end goal which we expect to evolve continuously. The agile mindset allows dev teams to respond quickly to changes but the big picture is usually thought about last and this is where data architecture and agile can complement one another. Data architecture keeps tabs on the compass heading and final destination. The best way to bridge the gap is an agile road map that exists as a living product and not as a stagnant snapshot that never evolves. Teams are still able to respond to change as needed and data architecture can align with the overall main goal and program/project roadmap, with dedicated milestones and checkpoints, which also adapt to the big picture as needed.A lot of data modelling is done by hand or very manually whether via Visio or in several separate models that never connect to a full picture or get updated as the data evolves. Updating an entire real estate of diagrams can be painful and so time consuming that by the time the update is done it’s outdated. Diagraming changes at a conceptual and logical level is the quickest way to keep up with constant change. Modern digital collaboration tools (next section) can help especially when it comes to tracking changes within the physical data model.Work with application owners to address data quality issues at root cause within source applications and not just in the working data set for the project. Transparency and clarity surrounding definitions of data is vital! If a pen is a pen, then that understanding needs to transcend throughout the whole organization or at the very least have a mapping tool or method that outlines what a pen is and its various aliases. Without consistent definitions, linking data is almost impossible and makes it hard to trust the data's validity. Having the same name doesn’t mean data fields hold the same meaning and that two people in different departments have the same understanding. At smaller companies, it's easier as there are less people to pass the message along to, garner buy-in from, and check back in with. The harder question is how to achieve this at large companies. While technology can help with part of this, the main thing that needs to changed is each employee’s relationship with data. Let’s look at an illustration:Let’s take a look at Kelani’s induction at the fictitious Data Utopia, a data driven company where data informs every business decision. Within the first few weeks, besides the company swag and other perks, there were a few mandatory induction courses that needed to be completed. One’s called Data Utopia’s Data View: Our data language and vision, a course that Kelani hadn’t seen at any other company so it was it an easy decision to start with this one [first]. Data Utopia knows that every new employee’s learning needs to include the organization’s data vocabulary and dictionary. Each employee’s data knowledge at Data Utopia is aligned at a very high level with how their products, services and things that are sold are defined. “Wow, finally, a company that understands that a company-wide ‘data speak’ needs to be developed, maintained, and blasted from the rooftops to all employees especially during our induction”, Kelani remarked impressed. The induction video went on to say ‘At Data Utopia, a product is something that we sell whether it’s intangible or tangible. A service is anything that we as employees do to enhance that product and make it fit our customer’s needs.’ This provided Kelani the foundation to be effective in meetings during the first weeks as it gave him the base understanding of what Data Utopia’s critical data and focus is and put him on equal footing with fellow coworkers as opposed to guessing. It also showed the level of dedication that Data Utopia’s senior level management undertook to present a cohesive data picture that employees could then build on. Nuances to the high level data definitions can't be avoided and this is where data curation and metadata tools come in to help keep track of any exceptions.Overall, if everyone is a data steward, then all staff at all levels need to be brought in on the data big picture from the beginning. So on the first day along with happy smiling faces, employees need to have access to a searchable data dictionary that includes all company acronyms and explains the company’s high level data definitions along with a simple conceptual data model to learn how all the company’s data fits together. This then makes data a part of everyone’s job and knowledge base. A company can’t be data driven if only a few people understand the data. Please note, this company-wide understanding of data is not a one-time thing. The company needs to make sure that employees are kept up to date with any major data changes and keep documentation updated. So more tutorials with robot-voiced generic employee A, B, C will hopefully be coming to a computer near you and it will be an ongoing learning process as the company’s data evolves. It also means you can’t limit data knowledge to just the data architects trying to manage the weight of the world by themselves and emphasizes why everyone being a data steward, from the HR and onboarding team to the top level C suite staff, is so important.4848225914400CautionNothing in this paper is meant as an endorsement of a specific product or firm. We’ve named some widely adopted products purely to give a clear example of the capabilities we have in mind.CautionNothing in this paper is meant as an endorsement of a specific product or firm. We’ve named some widely adopted products purely to give a clear example of the capabilities we have in mind.-21290224003(technical) ProductsYou’ll have noticed we placed ‘Products” at the end of the list. This placement isn’t an accident. Speaking as technologists, the simple fact is technology never ranks as the first building block of a solution. Technology is always last. Nevertheless, technology / products do play a role. You’ll improve quality, speed and productivity with aggressive automation. Let look at some examples:DevSecOps Toolchain – Consider a tool for managing requirements, stories, tasks, risks, bugs and issues; JIRA is a useful example. Consider a tool for planning and collaboration, for example Confluence. Consider a tool for software configuration management, for example GitHub. Consider a tool for managing scripts, for example Jenkins. Consider a tool for managing digital recipes, for example Ansible. Data Pipeline Toolchain – Make aggressive use of tools and automation to process data so everything is faster with each iteration. In the ideal world all of your data would be readily accessible through APIs. You probably don’t live in the ideal world so you’re going to need to get down-and-dirty with the data. Consider tools for Data Processing, for example Apache Spark, Hadoop MapReduce and cloudera Impala. Consider tools for Data Ingestion, for example StreamSets, SFTP, Apache kafka and NODE. Consider tools for Data Storage, for example HIVE, AVRO, Parquet, and Hadoop HDFS. Consider tools for data exploration, Reporting and Business Intelligence, for example Trifacta, R, Python, Scala and Qlik. These are the tools data wranglers reply on.Data Discovery Tools, also known as data spiders, data bots, web crawlers or net crawlers – Most organizations have data in more places than they realize. Worse, between data warehouses, data marts and data cubes most organizations have more copies of the same data than they realize. The trick is finding all the data, especially the authoritative data, in the first place. Data discovery tools can be a big help. Web crawlers, sometimes called spiders or spiderbots can systematically browse your organization’s intranet identifying and indexing data. Other data spiders can take files of historic transaction data and create the best, most actionable, set of variables. You can purchase some of these, or join open source communities and develop your own data discovery tools. Just be aware you are going to need support from your friendly neighborhood cybersecurity team if you intend to use data discovery tools inside your organization’s firewalls.Software Developer Kits (SDKs) – Your Data Scientists, and the app developers who support them, need tools that let them write, test and run code easily. SDKs are the solution. Most come with a wide range of workbook options. And most come with vast libraries of open source algorithms for Machine Learning and Deep Learning. Anaconda is one example of an open source SDK.Data modelling tools are essential in removing the pain and avoidance of creating your organization’s ‘big picture’ of data and tracking lower level changes. Decreasing the manual effort will help with keeping pace with agile change. As with anything, data modeling tools come in a few different flavors. If you want something that’s installed on a desktop then offerings like ERStudio and Erwin will fit in nicely. If you prefer an online archive of your models, draw.io and Lucidchart are useful but keep in mind they may not have the same extensive feature list Erwin and ERStudio do. It’s a tight rope balancing act or compromise choosing which features fit your team best especially the balance between ease of use and drawing very detailed diagrams. Start with your priorities in mind when looking for a tool and this will help narrow down what will work best for you.A metadata tool and/or data curation tool can help create and maintain an organization’s data taxonomy, which will alleviate many of the headaches associated with data integrations internally and externally, and means less time is needed to organize and understand the data. Many times, aliases and mappings are maintained in Excel or by siloed shadow IT and since so few know the definitions…work has to be done to understand and rationalize similar and/or competing definitions of data entities. There are a lot of metadata tools available and in some ways the space seems overcrowded with offerings from Collibra, SAP, Informatica, Alteryx, and Talend to name a few. When choosing you need to clearly understand your budget constraints and appetite for a data steward to update and curate the metadata as a daily hands-on task or whether you’d prefer an automated background activity where the only maintenance is error resolution. There are also other tools like LeanIX that provide metadata capabilities that can also be used as an application repository so this may align better with budgets that need a solution to serve more than one need.Let’s make data architecture agile...together.Data Science is by its nature Agile and data architecture can become Agile to match. Just as most of Data Science is experimental, we have proposed experimental approaches to people, process and (technical) products can make data architects key players and partners in Data Science. Try the approaches out and see what works in your organization.About the Authors:Soyini Taylor – Ms. Taylor is a data practitioner who works within Vodafone’s technology function as an Enterprise Data Architect. She provides best practice guidance and leadership on how best to structure data to ease application and data integration. In addition, she advises how best to define, collate, and manage data it evolves and changes. Current interest and study areas include AI and Data Analytics.Michael Conlin – Following 2 years as the first Chief Data Officer of the U.S. Department of Defense, Mr. Conlin has been appointed the first Chief Business Analytics Officer (CBAO) of the U.S. Department of Defense. As CBAO, Mr. Conlin is responsible for providing Department executives with evidence-based analytics and insights on the cost and performance of the operations of the Department, at the speed of relevance. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download