Data lake operationalisation is a colossal topic with many deliberations on either building the right data lake or defining the right strategy. The five important points that everyone stresses on prior to starting the process of building a data lake are:
This blog provides six mantras for organisations to ruminate on in order to successfully tame the “Operationalising” of a data lake, post production release.
1. ALWAYS have a North star Architecture
Data lakes are not only about pooling data, but also dealing with aspects of its consumption. The choice of data lake pattern depends on the masterpiece one wants to paint.
Central vs Federated vs Hybrid
Depending on the ask of the organisation, you can choose to store the enterprise data either all in one location (Central) closest to the organisation’s headquarters, or due to sovereignty requirements, keep the data stored in their specific subsidiaries (Federated).
If an enterprise has a Global footprint, adopting a Hub and Spoke model (Hybrid) with a satellite of local data closer to the reporting countries would do the trick. Even though this model will have alignment issues (data replication etc.) it will aid performance, regional governance and development. (Fig 1)
Figure 1 – Hybrid Architecture
Streamed vs. Batch vs. Near Real Time
- NRT Streaming – Every 15 minutes/one hour and processed immediately, only where needed
- Lambda – Data fed in both batch layer and speed layer. Speed layer will compute real time views, while the batch layer will compute batch views at regular intervals. A combination of both covers all the needs of data ingestion and distribution.
- Define your Hot and Cold Paths – Choose the right storage(s) for your data lake. Leverage Microsoft offerings of Azure Cosmos DB and ADLS Gen2 respectively.
Build the right HA-DR: High Availability & Disaster Recovery Strategy
High availability strategies are intended for handling temporary failure conditions to allow the system to continue functioning while disaster recovery is recovering from catastrophic loss of application functionality. For the right DR and HA framework, keep the following scenarios in mind along with business criticalities: data corruption; accidental data deletion, regional outage, network/connectivity issues and component failures.
ADLS Gen2 now supports replications such as ZRS or GZRS (preview) which improve HA, while GRS and RA-GRS improve DR. Azure Cosmos DB is known for its 99.999% high availability and globally distributed replications.
Each Azure component checks most of these, so I encourage you to look at their product documentation.
2. Subscription Model
Planning a Data Lake and then scaling it up requires some contemplation.
Each product in Azure has a few boundary considerations and subscription limits, quotas and constraints. Cautious treading will avoid hitting the thresholds and limits of the products while scaling. While defining the lambda architecture you can choose your storage, and ADLS Gen2 and Cosmos DB both do an exceptional job to overcome throughput and limit challenges. Environment isolation should be thought about, especially during resource consumption for a laboratory experiment, as well as features and functionality testing such as firewall rules or life-cycle management.
Businesses may want to keep the billing separate or define a chargeback model through different subscriptions for each business layer, and also consider other influencing factors such as regional legal obligations, regulatory constraints or data sovereignty.
Production costs for Dev/Test environments can be reduced by seeking out providers like Microsoft who offer great discounts on lower environments. It is always advisable to have separate, split subscriptions for Dev/Test and Production based on business functions. Choose wisely and save profusely.
Owing to these constraints, you could rethink on North Star architecture and look at Hub and Spoke models if they’re suitable.
Understand the soul of the Data Sources
It is imperative to feel the pulse and the interaction of different source systems, as this can give us a better idea of how to sufficiently hydrate the data lake. ADF does a great job in covering many data sources; However, for the non-native connectors, identify a pattern for alternative source pull. This can be achieved with API Pull, DataBricks or using blob provisioning as a landing zone for external files. If the external source system is also on a data lake you can use ADLS Share etc.
Choose the right architecture for Ingestion patterns and Refreshes
- Separate the Batch and Stream: Our aim is to mitigate any throttling issues in spinning individual job clusters; we should have a central way to spin limited and central clusters to monitor and eventually use cluster pools to leverage large clusters to run smaller jobs in a faster manner and with more control over the execution. Individualised clusters avoid scaling, throughput issues and limitations. Tip: Use For each and Iterate in ADF for calling existing notebooks.
- ADF Jobs should be run in parallel for attaining optimum performance and to leverage Central DataBricks Cluster. Choose your clusters wisely (and remember limits!).
- Data Refresh: Each source has a different way for handling the Delta Refreshes. For the Raw Layer keep the pattern of data sources. For subsequent layers, metadata or mapping tables/files in SQL DW for reference could be used post-strategising. A mapping file containing the primary key column of data and the processing timestamp will aid with delta loads. Databricks Delta lets organisations remove complexity by getting the benefits of multiple storage systems in one.
CI/CD should be well planned with the right governance body to drive central guidelines. Build a one click deployment framework, parameterise templates, and build ARM templates and DataOps wherever necessary.
4. Access to your Data Lake
Data lake can be accessed at three time points; while hydrating the data lake, access between layers of the data lake, and while exposing the data lake for downstream systems.
RBAC and ACLS
Azure role-based access control (RBAC) lets you assign roles to security principals and helps control a higher level of resources, whereas POSIX-like access control lists (ACLs) help us defining access to individual files or directory. ADLS Gen 2 offers security by supporting both RBAC and ACLs based access controls. Cosmos’s RBAC controls can be leveraged for streamed data or hot path access.
This is very critical for a Multi-Layer Environment (Raw, Curated etc). ADF Job Dependency needs to be managed and can be achieved by implementing a service bus approach, where on completion of each job, an entry is made in service bus which publishes the status to all subscribers. Downstream ADF needs to subscribe to the service bus to capture the completion of the job.
A sample pathway below depicts automation of ADF for downstream systems using a pub/sub technique to alert delta updates or new inserts:
Figure 2 – Pub/Sub Pattern
Databricks: Working Group Concept
We should aim at building an end-to-end data pipeline comprised of functional components of Databricks workspace, per “working group”, to cater to the consumer of the data lake. Namely data engineering, data analysis and machine learning.
Each “working group” may provide a Unified Analytics Platform that brings together Big Data and AI, and allows the different people/users/analysts of the organisation to come together and collaborate in a common and secure space.
Figure 3 – Working Group
External Share within Azure
There is a need to share the data within and across organisations. Azure Data Share enables organisations to simply and securely share data with multiple customers and partners. A comprehensive list of ways to access ADLS is shared here. A similar thought process could go into the other storage accounts.
5. Operations and Logging Framework
Building a data lake is a continuous and interactive process. Large organisations usually have built-in processes for application maintenance activities with which the data lake integrates. There are alerts and monitoring frameworks that are set up to look at things like the health and performance of a system, and it is essential to monitor the data flow to alter functioning during operational snags and integrate it into the ITSM systems of the organisation.
Azure Monitor and Log Analytics provide microscopic details of the functioning of Azure components, pipelines and notebook by assisting in central operation and logging framework. However, detailed info capturing requirements in log analytics could burn a hole in your pocket. Building in a custom monitoring and logging framework with a central database, which enables logging throughout the orchestration pathway by ADF, could help establish a healthy pipeline framework for ADF and event-based data flow within the system.
Figure 4 – Operations and Logging Framework
Trust your data. One single version of the truth. Solo copy. Redundancy. True definition. Data owner.
These are the key challenges companies and data handlers face, fuelling the need of a data lake. It traces the lineage, source of data origin and enlightens the transformation which is the norm even in machine learning operations – it’s a stepping stone for Responsible AI.
Data cataloguing provides a comprehensive view of every detail across databases and data sources. A good data catalogue output should be Data Dictionary, Data Lineage and Entity Relationships. Third party solutions like Informatica serve this purpose as per organisational needs. Economical solutions like a combination of SharePoint and Excel documents, and a manually or automated SQL db instance for search, provide for end-to-end cataloguing.
Choosing the right storage for your needs: