I was set an interesting challenge by a customer to copy the data in their Production Subscription Azure File Shares into their Development Subscription Azure File Shares. The reason behind this was to ensure that any uploads to their Production environment are kept inline with the Development environment, enabling testing to be performed on ‘live’ data.
The customer wanted something which was easy to manage, which provided visibility of data movement tasks within the Azure Portal without needing to manage and maintain PowerShell scripts.
The answer to this was Azure Data Factory.
What Is Azure Data Factory?
Azure Data Factory is a managed data integration service that enables data driven workflows between either on-premises to public cloud or within public clouds.
A pipeline is a logical grouping of activities that together perform a task. The activities within the pipeline define actions to perform on data.
Data Factory supports three types of activities data movement activities, data transformation activities and control activities. In this use case, data movement activities will be used to copy data from the source data store to the destination data sink.
Linked Services are used to link data stores to the Azure Data Factory. With the ‘data set’ representing the structure of the data and the linked service defining the connection to the external data source. The diagram below provides a logical overview of this.
For copy activities an integration runtime is required to determine the source and sink linked services to define the direction of data flow. To ensure data locality a custom integration runtime will be used within West Europe.
Each file share within the vmfwepsts001 Storage Account is an individual linked service. Therefore, four source linked services will be defined for data, documents, images and videos.
Each destination file share within the vmfwedsts001 Storage Account is an individual linked service. Therefore, four source linked services will be defined for data, documents, images and videos.
Copy behaviour to the sink datastore can be undertaken using three methods:
- Preserve Hierarchy the relative path of source file to source folder is identical to the relative path of the target file and folder
- Flatten Hierarchy all files from the source folder are in the first level of target folder. The target files have auto generated names
- Merge Files merges all files from the source folder to one file, using an auto generated name
To maintain the file and folder structure, preserve hierarchy copy behaviour will be used.
Tune in for the next blog post when we will cover the configuration settings.