Building your Data Lake on Azure Data Lake Storage gen2 – Part 2

By Nicholas Hurt, Senior Cloud Solution Architect at Microsoft

16/04/2020

Tags
TechNet UK

The Data Lake Analytics logo, next to an illustration of Bit the Raccoon.

Introduction

Part 1 of this blog covered some fundamental data lake topics such as planning, design and structure. Part 2 will focus on key aspects of ADLS gen2 such as implementation, security and optimisation.

Data lakes, storage accounts & filesystems

A common implementation decision is whether to have single or multiple data lakes, storage accounts and filesystems. The data lake itself may be considered a single logical entity yet it could span multiple storage accounts in different subscriptions even in different regions, with either centralised or decentralised management and governance. Whatever the physical implementation, the benefit of using a single storage technology is the ability to standardise across the organisation with numerous ways in which to access the data. Whilst having multiple storage accounts or filesystems does not in itself incur any additional monetary cost (until one actually stores and accesses data) there is however an inherent administrative and operational overhead associated with each resource in Azure to ensure that provisioning, billing, security and governance (including backups and DR) are managed appropriately.

The decision to create one or multiple accounts requires thought and planning based on your individual scenario. Some of the most important considerations might be:

Planning large-scale enterprise workloads may require significant throughput and resources. Considering the various subscription and service quotas may influence your decision to split the lake physically across multiple subscriptions and/or storage accounts. See the addendum for more information.
Regional vs global lakes. Globally distributed consumers or processes on the lake may be sensitive to latency caused by geographic distances and therefore require the data to reside locally. Regulatory constraints or data sovereignty may often prevent data from leaving a particular region. Theses are just a few reasons why one physical lake may not suit a global operation.
Global enterprises may have multiple regional lakes but need to obtain a global view of their operations. A centralised lake might collect and store regionally aggregated data in order to run enterprise-wide analytics and forecasts.
Billing and organisational reasons. Certain departments or subsidiaries may require their own data lake due to billing or decentralised management reasons.
Environment isolation and predictability. Even though ADLS gen2 offers excellent throughput, there are still limits to consider. For example, one may wish to isolate the activities running in the laboratory zone from potential impact on the curated zone, which normally holds data with greater business value used in critical decision making.
Features and functionality at the storage account level. If you want to make use of options such as lifecycle management or firewall rules, consider whether these need to be applied at the zone or data lake level.

Whilst there may be many good reasons to have multiple storage accounts, one should be careful not to create additional silos, thereby hindering data accessibility and exploration. Take care to avoid duplicate data projects due to lack of visibility or knowledge-sharing across the organisation. Even more reason to ensure that a centralised data catalogue and project tracking tool is in place. Fortunately, data processing tools and technologies, like ADF and Databricks (Spark) can easily interact with data across multiple lakes so long as permissions have been granted appropriately. For information on the different ways to secure ADLS from Databricks users and processes, please review the following patterns.

HNS, RBAC & ACLs

It should be reiterated that ADLS gen2 is not a separate service (as was gen1) but rather a normal v2 storage account with Hierarchical Namespace (HNS) enabled. A standard v2 storage account cannot be migrated to a ADLS gen2 afterwards — HNS must be enabled at the time of account creation. Without HNS, the only mechanism to control access is role based access (RBAC) at container level, which for some, does not provide sufficiently granular access control. With HNS enabled, RBAC can be used for storage account admins and container level access, whereas access control lists (ACLs) specify who can access the files and folders, but not the storage account level settings. It should be noted that it is entirely possible to use a combination of both RBAC and ACLs. RBAC permissions are always evaluated first, and if the requested operation does not match the assigned permissions, then ACLs will be evaluated. For example, if a security principal has Storage Blob Data Reader but requests to write to a specific folder, elevated access through write permissions could be granted through ACLs. For further information please see the documentation.

Managing Access

As mentioned above, access to the data is implemented using ACLs using a combination of execute, read and write access permissions at the appropriate folder and file level. Execute is only used in the context of folders, and can be thought of as search or list permissions for that folder.

The easiest way to get started is with Azure Storage Explorer. Navigate to the folder and select manage access. In production scenarios however it’s always recommended to manage permissions via a script which is under version control. See here for some examples.

It is important to understand that in order to access (read or write) a folder or file at a certain depth, execute permissions must be assigned to every parent folder all the way back up to the root level as described in the documentation. In other words, a user (in the case of AAD passthrough) or service principal (SP) would need execute permissions to each folder in the hierarchy of folders that lead to the file.

Resist assigning ACLs to individuals or service principals

When using ADLS, permissions can be managed at the directory and file level through ACLs but as per best practice these should be assigned to groups rather than individual users or service principals. There are two main reasons for this; i.) changing ACLs can take time to propagate if there are 1000s of files, and ii.) there is a limit of 32 access control entries, per access control list, per file or folder. This is a general Unix based limit and if you exceed this you will receive an internal server error rather than an obvious error message. Note that each ACL already starts with four standard entries (owning user, the owning group, the mask, and other) so this leaves only 28 remaining entries accessible to you, which should be more than enough if you use groups…

“ACLs with a high number of ACL entries tend to become more difficult to manage. More than a handful of ACL entries are usually an indication of bad application design. In most such cases, it makes more sense to make better use of groups instead of bloating ACLs.”

One way to prevent a proliferation of execute ACLs at the top level folders is to start out with a security group which is given both default and access execute permissions at the container level, and then add the other groups into this group in order to allow access to traverse the folder tree. More on this in a follow up blog but be aware that this approach should preferably be done before folders and files are created due to the way in which permission inheritance works:

“…permissions for an item are stored on the item itself. In other words, permissions for an item cannot be inherited from the parent items if the permissions are set after the child item has already been created. Permissions are only inherited if default permissions have been set on the parent items before the child items have been created.”

In other words, default permissions are applied to new child folders and files so if one needs to apply a set of new permissions recursively to existing files, this will need to be scripted. See here for an example in PowerShell.

The recommendation is clear — planning and assigning ACLs to groups beforehand can save time and pain in the long run. Users and Service Principals can then be efficiently added and removed from groups in the future as permissions need to evolve. If for some reason you decide to throw caution to the wind and add service principals directly to the ACL, then please be sure to use the object ID (OID) of the service principal ID and not the OID of the registered App ID as described in the FAQ. You may wish to consider writing various reports to monitor and manage ACL assignments and cross reference these with Storage Analytics logs.

File Formats & File Size

As data lakes have evolved over time, Parquet has arisen as the most popular choice as a storage format for data in the lake. Depending on the scenario or zone, it may not be the only format chosen — indeed one of the advantages of the lake is the ability to store data in multiple formats, although it would be best (not essential) to stick to a particular format in each zone more from a consistency point of view for the consumers of that zone.

Choosing the most appropriate format will often be a trade off between storage cost, performance and the tools used to process and consume data in the lake. The type of workloads may also influence the decision, such as real-time/streaming, append-only or DML heavy.

As mentioned previously lots of small files (kbs) generally lead to suboptimal performance and potentially higher costs due to increased read/list operations.

Azure Data Lake Storage Gen2 is optimised to perform better on larger files. Analytics jobs will run faster and at a lower cost.

Costs are reduced due to the shorter compute (Spark or Data Factory) times but also due to optimal read operations. For example, files greater than 4 MB in size incur a lower price for every 4 MB block of data read beyond the first 4 MB. For example, to read a single file that is 16 MB is cheaper than reading 4 files that are 4 MB each. Read more about Data Lake gen2 storage costs here, and in particular, see the FAQ section at the bottom of the page.

When processing data with Spark the typical guidance is around 64MB — 1GB per file. It is well known in the Spark community that thousands of small files (kb in size) are a performance nightmare. In the raw zone this can be a challenge, particularly for streaming data which will typically have smaller files/messages at high velocity. Files will need to be regularly compacted/consolidated or for those using Databricks Delta Lake format, using OPTIMIZE or even AUTO OPTIMIZE can help. If the stream is routed through Event Hub, the Capture feature can be used to persist the data in Avro files based on time or size triggers. Other techniques may be to store the raw data as a column in a compressed format such as Parquet or Avro.

In non-raw zones, read optimised, columnar formats such as Parquet and Databricks Delta Lake format are a good choice. Particularly in the curated zone analytical performance becomes essential and the advantages of predicate pushdown/file skipping and column pruning can save on time and cost. With a lack of RDBMS-like indexes in lake technologies, big data optimisations are obtained by knowing “where-not-to-look”. As mentioned above however, be cautious of over partitioning and do not chose a partition key with high cardinality. Comparisons of the various formats can be found in the blogs here and here.

In summary, with larger data volumes and greater data velocity, file formats are going to play a crucial role in ingestion and analytical performance. In the raw zone where there is a greater likelihood of an accumulation of smaller files, particularly in IoT scale scenarios, compression is going to be another important consideration. Leaving files in raw format such as json or csv may incur a performance or cost overhead. Here are some options to consider when faced with these challenges in the raw layer:

Consider writing files in batches and use formats with a good compression ratio such as Parquet or use a write optimised format like Avro.
Introduce an intermediate data lake zone/layer between raw and cleansed which periodically takes uncompressed and/or small files from raw, and compacts them into larger, compressed files in this new layer. If raw data ever needs to be extracted or analysed, these processes can run more efficiently against this intermediate layer rather than the raw layer.
Use lifecycle management to archive raw data to reduce long term storage costs without having to delete data.

Conclusion

There is no one-size-fits-all approach to designing and building a data lake. Some may grow their data lake incrementally, starting quickly by taking advantage of more cost effective storage and data processing techniques, such as ETL off-loading. Others may decide to spend time in advance; planning their ingestion and consumption needs, the personas involved, their security and governance requirements. As the data lake footprint expands, planning becomes even more crucial but it should not stall progress indefinitely via “analysis paralysis”. The data lake can promote a more data centric, data driven culture through the democratisation of data, but this should be an organisation-wide commitment, not just an IT driven project, to achieve long term success.

Addendum — ADLS gen2 considerations

Whilst quotas and limits will be an important consideration, some of these are not fixed and the Azure Storage Product Team will always try to accommodate your requirements for scale and throughput where possible. At the time of writing here are the published quotas and items to consider:

5 PB for all regions. These are default limits which normally can be raised through a support ticket.
Max request rate 20,000 per second per storage account.
Ingress rate 25 Gbps.
Storage accounts per subscription 250.
Max access & default ACLs per file or folder 32. This is a hard limit hence ACLs should be assigned to groups instead of individual users.
See other limits here. Note some default (max) limits or quotas may be increased via support request.
Azure services which support ADLS gen2.
Blob storage features which are supported.
Other important considerations.