Best practices while migrating large scale data from AWS S3 to Azure Blob

By RK Iyer, Cloud Solution Architect Data & AI; Amit Damle, Cloud Solution Architect Data & AI 

A man sitting before a laptop

Relocating to a new place is always an exciting adventure, but one would surely agree that it equally stressful and an exhausting journey. However, there are still ways to ensure that your relocation goes as smoothly as possible by planning, preparing, and executing the plan perfectly. Similarly, whenever there is data relocation/migration especially from one cloud to another, it also requires a lot of thought to ensure that the migration journey is smooth. 

While working for some of our large customers, we had a challenge of moving around 600TB to 1PBof data from AWS S3 to Azure Blob. In this post, I will share my learnings on different considerations and how to overcome the challenges using some of the best practices. 

Typically there are two main categories of data for a large-scale data migration be it any domain—healthcare and life sciences, media and entertainment, financial services, or retail. 

  • Moving large media assets and content data like audio, video, images, DICOM files, product catalog, email, web page data, etc. 
  • Data lake/Data warehouse migrations moving historical data stored in filesystem/databases. 
Architectural pillars of large-scale data movement

Before embarking on a large-scale data migration, there are a number of factors that need to be considered: 

  • Cost: Development, maintenance and management, monitoring and debugging costs 
  • Time for Setup : Time for creating/arranging for infrastructure and end-to-end migration 
  • Ease of use: How easy it is to perform the migration 
  • Performance: End-to-end data migration must be performed within an acceptable time range depending on business requirements 
  • Security — Data should be transferred in a secured manner. Data protection must be ensured that only authorized entities can view, modify, or delete your data. 
  • Monitoring — Data transfer should be continuously monitored. One should also be able to track the physical order through the Azure portal. 
  • Reliability — Data must be consistent. There needs to be a mechanism to ensure that the source and destination data matches. 
  • Control — Selective transfer of data over a while 
Pros and cons of migration options

Although there are multiple options (as shown in the above chart), we’ve seen customers wanting to avoid offline/physical transfer due to time, uncertainty, high coordination, and unreliability in the current pandemic scenario. In online/non-physical transfer mode, customers want to avoid reinventing the wheel using custom scripts since many of the required features like monitoring, auditing, resumability, and security needed to be developed from scratch.  

In these scenarios, Azure Data Factory (ADF) becomes the unanimous choice since most of the required features are available out of the box as a built-in feature. 

Lessons learnt while using ADF

Selective copy of data with prioritization 
ADF allows performing a selective copy of data allowing to copy only specific folders or files so the overall turnaround time is less. Fine-grained control of data transfer based on priority reduces the risk of migration since the small amount of data with high priority can be transferred and validated rather than waiting for the entire migration to complete before starting the development.

Please refer to Copy data from/to a file system — Azure Data Factory & Azure Synapse | Microsoft Docs for more details

Delta migration
Typically, we’ve observed that till the time AWS S3 bucket is sunset fully, the downstream applications will keep storing the files into AWS S3 post the initial one-time migration. This incremental data needs to be loaded into Azure Blob using the incremental data pipeline in ADF.

ADF provides the capability to identify new files created/updated into AWS S3 buckets using the “Filter By Last Modified” property of Copy Data Activity. Users can specify start and end date-time to fetch the incremental data.

Note: “Filter By Last Modified” feature of Copy Data Activity works efficiently if files to be listed <100000. to mitigate this limitation users can capture the name of the new / modified file into a text file and use the selective copy method. 

Time for Setup 
Since ADF is a fully managed PaaS service, an ADF instance can be created within a minute through a click of a button or using an ARM template, Azure CLI, or Powershell command. 

Monitoring and Alerting  
ADF provides out-of-box monitoring capabilities to monitor copy pipeline runs and set alerts in case of failures. Detailed analysis can be performed by looking at some of the parameters like Data read, Files read, Data written, Files written, copy duration, throughput, DIU’s used, etc. 

For more details, refer to Monitor data factories using Azure Monitor — Azure Data Factory | Microsoft Docs 

Monitor data factories using Azure Monitor — Azure Data Factory | Microsoft Docs 

Pipeline monitoring
Detailed monitoring of copy

Auditing 
Although ADF doesn’t provide out-of-the-box auditing as it is application/use-case dependent, it is very easy to implement auditing. Auditing can be performed for all the files whether copying is successful o not. This will also help for future delta loads so that these auditing tables can be referred to in the future. Some of the key fields which can be audited are Item name, data read, data written, rows copied, copy duration, load date-time, status flag — success or failure. 

Auditing

Reliability 
Reliability and building a robust pipeline is a key requirement for any data migration project. Here are some of the key ADF capabilities that we utilized. 

Resumability
ADF has a resume capability by which you can build robust pipelines for many scenarios. With this enhancement, if one of the activities fails, you can rerun the pipeline from that failed activity. When moving data via the copy activity, you can resume the copy from the last failure point at the file level instead of starting from the beginning, which greatly increases the resilience of your data movement solution especially on large size of files movement between file-based stores. 

When you copy data from Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, and Google Cloud Storage, copy activity can resume from arbitrary number of copied files. 

Rerun from failed activity

Data Consistency verification
When the “Data consistency verification” option is selected, copy activity will do an additional data consistency verification between source and destination store after data movement. The verification includes file size check and checksum verification for binary files, and row count verification for tabular data. 

There are two options to handle inconsistencies—one can abort on failure, or ignore and continue. Copy activity will continue to copy the rest of the data by skipping the inconsistent objects and logging if you also enable logging in copy activity. 

Data consistency verification

Fault tolerance settings
By selecting fault tolerance setting, one can ignore some errors occurred in the middle of copy process. For instance, incompatible rows between source and destination store, file being deleted during data movement etc. 

Security
It is important to store the credentials in Azure key vault so that the credentials are hidden from the data engineers. 

Azure Key Vault

Preserve metadata along with data 
While copying data from source to sink, in scenarios like data lake migration, you can also choose to preserve the metadata and ACLs along with data using copy activity. 

Low Cost 
Since ADF provides many of these features as out of box features, it is less costly compared to custom scripts/solutions that needs to develop from scratch. The Total Cost of Ownership with ADF is much lower compared to custom solution. Please refer to this document for more details. 

Additional references 
Choosing a data transfer technology — Azure Architecture Center | Microsoft Docs 
Monitor data factories using Azure Monitor — Azure Data Factory | Microsoft Docs 
Visually monitor Azure Data Factory — Azure Data Factory | Microsoft Docs