Microsoft Data Patterns

Move Copy of Data

Version 1.0.0

GotDotNet community for collaboration on this pattern

Complete List of patterns & practices

Context

You have data held in data stores that are serving applications and now you want other applications to use that data. You have decided that:

· You do not want the other applications to access the source data.

· You want to provide these other applications with a redundant copy of the data. In other words, you want to move a copy of the data to the other applications.

The structure of the data required by the other applications may be exactly the same as that of the existing data or it may be completely different. You have not yet decided what sort of transformation to use, replication or Extract-Transform-Load (ETL).

Note: The term data store refers to a collection of data that is managed by a database management system (DBMS) or is held in a file system.

Problem

What proven architectural approach should you follow to design the data movement services?

Forces

Any of the following compelling forces would justify using the solution described in this pattern:

· Data availability no longer matches requirements. For example, your existing centralized data stores were designed to support regular business hours of 08:00 to 18:00, and these data stores must be taken offline for after-hours maintenance and reporting. Your other applications, however, must support customer-direct self-service, which requires 24-hour availability. Another example is you are writing applications that are going to be installed on laptops for a mobile field force that requires the data to be available while working offline. This requires copying the data to the laptops and synchronizing changes later when the laptops reconnect to the network.

· Network or application platform is unreliable. For example, your network fails frequently or is shut down for significant periods of time so that your new applications cannot deliver the required levels of service.

· Your other applications require differently structured data. The existing data store uses a structure that is suitable for the existing applications. If the other applications, however, require the data to be stored in a different structure, you may have to store the data redundantly in both structures.

· Network bandwidth does not support real-time data access performance requirements. In this case you may need to avoid the real-time problem by making a local data copy available.

Hint: This force can lead to disaster if you misjudge it. Your early requirements, benchmarks, or prototyping might lead you to believe that the bandwidth is acceptable. However, a new or rapidly growing application can degrade performance quickly. The redundant data approach can minimize the effects of these changes. This approach carries its own risk, though, if the volume of data to be copied increases significantly, if latency requirements change, or if network bandwidth drops at the time you need to copy the data.

The following enabling forces facilitate the adoption of the solution, and their absence may hinder such a move:

· Latency tolerance. The other applications must be able to tolerate the wait that is associated with moving the data.

· Data changes are generally non-conflicting. Often the business use of the data makes it relatively easy to isolate changes to the original data and its copies. For example, if you are providing a new application on a laptop for a client manager to use when making customer calls, the manager may update client data during the call. It is highly unlikely that the customer will call the company and request changes to an existing copy of the data during the same time period.

· Other applications require only read access or do not require updates to the target to persist. In these circumstances, the process of providing these applications with a copy of data to use locally can be much simpler, and hence easier to implement. Do not assume, however, that because providing a copy is simpler in these circumstances, it is the best solution.

Solution

Create a basic architectural building block and use it alone, or in combination with other such blocks to assemble a solution of greater complexity. The basic architectural building block is called a data movement building block.

The data movement building block consists of the following items:

· A movement set in a source data store

· A data movement link that provides a path from source to target and contains the Acquire, Manipulate, and Write services

· A target data store

Figure 1 illustrates a data movement building block.

Figure 1: Data movement building block

In the figure, the arrows represent the directional flow of the movement set. This does not mean that these are the only data actions. For example, in other data patterns, the Write service gets data from the target.

Source

The source is a data store that contains a set of data to be copied. One data store can be the source for many data movement building blocks. The source data store can also serve as the target for another data movement building block. For example, in the Master-Master Replication pattern, the same pair of data stores swap roles (source becomes target, and target becomes source) for a common movement set that can be updated in either data store.

Movement Set

A movement set is an identified subset of data that is copied from a single source and sent across a data movement link to one or more targets. In the course of the copy operation, the movement set may change its content and form as it is acquired, manipulated, and written. For example, if you want to copy data from a server to laptops for salespersons to use in making daily calls, each person needs a movement set containing the details of the clients that they are going to call on that day. Thus a movement set is the name for the subject of a data copy operation at all stages of that operation. In the course of the copy operation, the movement set may change its content and form as it is acquired, manipulated, and written.

Variant: Composite Movement Set

An application may require redundant data from more than one existing database. A composite movement set comprises all the data you intend to replicate for any particular application. The application's requirements are what binds this grouping of data together and gives it a purpose.

Figure 2 illustrates a composite movement set composed of data from two sources.

Figure 2: Composite movement set

As the figure shows, a composite movement set is a collection of one or more movement sets. For example, the application on your salesperson's laptop needs client data, which might all come from Source 1. The application, however, also needs the contract data related to the clients from Source 2. Each movement set is part of a data movement building block. Both movement sets must be acquired together to select those contracts that are related to the clients.

Data Movement Link

A data movement link is a connection between the source and target along which the relevant movement set can be moved from one data store to another with appropriate security. Moving this data across the link is called a transmission.

The data movement link includes:

· The method of data transmission at each step that moves data (which includes any intermediary transient data stores). For example, the transmission method might be shared data storage, FTP, or a secure electronic link with managed row transmission.

· The Acquire, Manipulate, and Write data movement services.

Acquire

The Acquire service gets the movement set from the source data store. Acquisition may be a simple one-step process, or it may be a multi-step process, for example if the movement set is in several tables in the data store.

Acquire may enrich the data by adding details, such as the time the data was acquired, to allow for management of the overall data integrity.

Acquire can obtain the movement sets from the data store rows directly, or it may acquire them from data caches where only data changes are stored. Typically these are either DBMS log record stores or user-written caching data stores. In this case, these stores should be considered the sources.

When acquiring data from these stores, Acquire must either collect all transactional changes or collect the net change, which is the final result of all changes that have occurred to this row since the last transmission.

Hint: If Acquire collects all transactional changes, the ordering of the changes is vital so that the Write service can follow the correct change sequence. This correct order can be difficult to establish across a composite movement set being acquired through multiple data movement building blocks. You may decide to acquire the composite movement set from one Acquire service so you can order the set. Even then you may have problems with time-clock inconsistencies across platforms. On the other hand, if Acquire collects the net change instead, you have to define rules to resolve the conflicts that arise (for an example, see the Master-Master Row-Level Synchronization pattern).

Manipulate

The Manipulate service changes the form of the data in some way and passes it on in a format that can easily be written to the target. Manipulate can vary in complexity from a null event (where it does nothing to change the data) to very radical data alternations. More detailed architectural patterns discuss this topic.

Write

The Write service writes the data that Manipulate prepared to the target. If Write finds that the target has changed the data that it got from the last data movement, then Write must conform to the attributes on the data movement link regarding how to behave; it must either force the new data over the target data, write the new data somewhere else, and raise an error, or it must take some conflict resolution action. These issues are discussed in the replication design patterns.

Target

In a data movement building block, the target is the data store where the acquired and manipulated data is written. As noted earlier, sometimes the target can be several data stores.

If the data that you move to the target can be updated by applications there, and if the changes must be reflected back into the source, you should have a second data movement link returning so that the roles of source and target are exchanged on this link. This relationship must be explicit because of the data integrity issues; it is a data movement link attribute and is described in the Master-Master Replication pattern.

Resulting Context

This pattern results in the following benefits and liabilities:

Benefits

· Target data store optimization. The Target data stores can be configured and optimized in different ways to best support their different types of data access. For example, one case might be about manipulating individual rows, while another might be about reports on many records.

· Data autonomy. When the data stores are relatively independent of each other and can have different owners, the content of the source data store can be provided as a product, and it is then the responsibility of the target owner to operate on its data.

· Data privacy. By restricting the movement set to an agreed subset of the source data store, Move Copy of Data can provide only data that the application (or users) at the target may see.

· Security. Source data stores are not accessed by the target applications and hence are more secure.

Liabilities

· Administration complexity. This pattern may introduce additional operational tasks for data store administrators. For example, the ongoing transmissions have to be checked to ensure that they are running smoothly. Also, administrators must monitor the impact of the involved resources, such as the growth of cached changes, log files, and so on.

· Potential increased overhead on the sources. Every acquisition of data loads a certain overhead on the source. It is important to properly plan the additional load caused by extracting snapshots or by logging transactions that will replicated. The additional load has to be compared to the load that would occur if all applications were connected to a single data store. You can use this pattern to optimize the operational load.

· Potential security exposure. The target data stores must not allow access to source data that the source would not permit. This is another administration challenge.

Next Considerations

After you have decided to implement the data movement solution, the next challenge is to decide between the Data Replication and Extract-Transform-Load (ETL) patterns. The distinguishing criterion is the complexity of the data movement link, which essentially translates to one of the following options:

· ETL is appropriate if Acquire and Manipulate are complex, but Write is relatively simple. An ETL process can handle complex acquisition, such as merging data from heterogeneous sources. ETL also allows for complex manipulations, such as cleansing of the acquired data or aggregations.

· Replication is appropriate if Acquire and Manipulate are simple, and Write is also either simple or it is complex because of conflict detection and resolution. A replication process generally reads a single source only and the manipulations are restricted to calculations on a current record, such as data type conversions, concatenating, or splitting strings. Write can detect changes in the target that have occurred since the last transmission and resolve any resulting conflicts by defined rules.

Examples

The following examples illustrate how to use the data movement building block to solve common data movement problems of differing complexity. Some of these examples reappear in later patterns in this cluster.

Simple Data Movement for Reporting Purposes

The simplest use of this pattern moves data to a data mart or warehouse when the schema of the mart or warehouse is very similar to the counterparts in the operational data store. In this example, you need to build a new system that provides online transaction processing (OLTP) transactions and summarized reports based on the information of the previous day. The summary reports are not updatable; they are management reports and are not used for what-if analysis. You do not want the platform that hosts the operational data store to bear the additional load of the reporting and the additional complexity for accessing the previous day's data.

The solution is to implement a data movement link with target overwrite between the operational source data store and a reporting target data store as shown in Figure 3. (In this data movement, the applications on the target data store are either read-only or any updates to the movement set are not to be moved back to the source data store.)

Figure 3: Simple data movement from an operational data store to an informational data store

The operational data store remains available for the ongoing transactions. Every night a snapshot from the operational data store is taken and transferred to the reporting data store. Because all elements of the data movement link are simple, the implementation can follow the Data Replication pattern.

Complex Data Movement for Reporting Purposes

Frequently, the mart and warehouse schemas are very different, or the manipulation is very complex.

Suppose you have three source data stores, two of which are independent databases and one of which is a flat file. You plan to merge the contents of the data stores, which have partially overlapping information. Analysis shows that the data acquired from the different data stores contains some contradictions. Thus, you must do some data cleansing in the movement process. In addition, the target does not present the information on the same detailed level as the source, but it does aggregate the raw data and write these summaries to the target only.

The solution is to apply the ETL pattern because the Write is still simple. Figure 4 shows a sketch of the solution with the complex parts highlighted.

Figure 4: Complex data movement from several data stores to an informational data store

Master-Master Data Movement

In a master-master data movement, any changes that the target makes to the copied data are sent back to the source so that the source can stay synchronized with the target. Figure 5 illustrates this type of data movement.

Figure 5: Master-master data movement

This particular source-target relationship is two-way, and this is implemented by a pair of related data movement links. Write must include logic for conflict detection and conflict resolution. That is, it must check to see if the data has changed since the last transmission. If so, any conflicts must be resolved according to defined rules.

The solution is to apply Data Replication because Acquire and Manipulate are simple, but Write is complex. Then use the Master-Master Replication pattern, which deals with the conflict detection and resolution issues.

Related Patterns

For more information, see the following related patterns:

Patterns That May Have Led You Here

· Maintain Data Copies. This pattern may have led you to Move Copy of Data, based on your requirements and the complexity of your environment.

Patterns That You Can Use Next

· Data Replication. As mentioned in "Resulting Context," Move Copy of Data leads naturally to Data Replication, depending on the level of complexity of the data movement link. Data Replication presents the architecture of a data movement, where Acquire and Manipulate are relatively simple, but Write might be complex.

· Extract-Transform-Load (ETL). As mentioned in "Resulting Context," Move Copy of Data leads naturally to ETL, depending on the level of complexity of the data movement link. ETL describes the architecture of a data movement, where Acquire and Manipulate may by complex, but Write is always simple.

Other Patterns of Interest

Publisher-Subscriber. The data movement building block is an instance of the more general Publisher-Subscriber pattern where a publisher offers a content publication service and subscribers subscribe to all or parts of the publication service.