...
Below is a list of the most critical factors that influence transfer performance:
1)Corpus Profile
...
: For the purposes of this document, the collection of all documents and folders located in any given storage platform is known as the “Corpus”. The constitution of the corpus can have a significant impact on transfer throughput. Particularly when at least one of the transfer endpoints employs an API for managing content (basically any endpoint other than a standard server file system), the size and number of documents can have a dramatic effect on performance.
Given 100GB of data, if that 100GB consists of (10,240) 10MB files, transfer throughput will be considerably higher than if that data consisted of (209,715) 500KB files. This is because SkySync will have to make approximately 200,000 more API calls to transfer the 100GB of 500KB files vs 100GB of 10MB files.
So if the corpus is weighted more towards many small files versus relatively fewer large files, it should be expected that the transfer throughput will generally be lower due to the latency expense of significantly more API calls.
2) Source or Destination Rate Limiting
...
: When at least one of the transfer endpoints employs an API for managing content, those systems will generally employ some form of API “rate limiting”. Essentially, most API based endpoint platforms implement algorithms that throttle the number of API calls that can be made in a specified length of time.
This is a necessary device to ensure that all tenant environments have an opportunity to experience the same level of performance. It also provides a mechanism to guard against Denial of Service (DoS) attacks that could potentially cause all tenant environments to become inoperable.
Typically, when a highly scalable transfer application such as SkySync “trips” the API throttle mechanism, the API platform will send a rate limit message to the calling application. When this happens, SkySync has no choice but to “back off” and wait for an increasing amount of time until throttle messages are no longer encountered. This process of throttling and backing off can result in a significant decrease in transfer throughput. SkySync utilizes “Smart-Throttling” technology to pull and push data as fast as it possibly can while respecting these rate limit messages.
...
3)
...
Source Platform read performance
...
: It is possible to saturate a source platform with requests such that adding additional job processing servers and threads no longer improves transfer performance. Also, when saturation occurs, end users are typically being affected. So, we must balance transfer performance with the ability of end users to continue performing job functions.
...
4) Number of Parallel Writes. The number of parallel writes configured in SkySync determines how hard SkySync hits the destination platform. If that platform is API based, the number of parallel writes directly correlates to the likelihood of seeing rate limit messages. When the destination connector supports multiple user connection pooling, this is much less of a factor and Parallel Writes can often be increased. If the destination connector does not support multiple user connection pooling, the likelihood of seeing rate limit messages is much higher and often the number of Parallel Writes must be decreased in the SkySync configuration.
...
5) Number of Processing Threads. Each SkySync server can be configured for a specific number of processing threads. By default, SQLCE based deployments allow up to (2) processing threads. By default, SQL Server Express and standard SQL Server based deployments allow up to (6) processing threads. In general, if rate limits are not being experienced and the SQL Server CPU, RAM, and disk I/O are not being taxed, then the number of processing threads can be increased.
...
6) Number of SkySync servers. More processing servers running more job processing threads typically results in a higher overall transfer throughput and a lower overall transfer duration at the expense of additional hardware and administrative overhead. However, there is a limit to this as mentioned above.
...
7) SQL Server Performance. SQL Server storage I/O, CPU, and RAM resources all affect the performance of the migration from end to end. During the data transfers, SkySync audits source and destination file system objects allowing it to manage copy and synchronization processes while respecting delete propagation and conflict resolution rules. However, logging operations can impact transfer throughput when the SQL Server is not properly scaled based on the amount of content being tracked in the database and the number of SkySync servers and threads that are processing work. For this reason, SQL disk I/O should be optimized for highly intense READ and WRITE operations (generally enterprise class Solid State disk based storage subsystems are recommended for data volumes). This is also one of the reasons that ensuring the SkySync database is configured for “Simple” mode recovery model is recommended. This allows for reduced log file I/O.
...
8) Network performance. Network performance affects all aspects of the migration. For example, legacy document retrieval, exported binary storage, SQL Server call latency, import server binary retrieval, and uploading to Office 365 are all affected by network performance. SkySync provides network utilization controls to manage network saturation.
...
9) Network / Cloud Connection Performance. The level of available bandwidth to connect to the source and destination endpoints is a significant factor in transfer throughput. For on premise to on premise transfers, network latency is much less of a factor. However, when a cloud storage platform is in play as a source, destination, or both, then internet bandwidth and latency becomes a significant factor in transfer performance.
For cloud to cloud migrations, it is strongly recommended that Azure or AWS VMs be used to host SkySync. In addition to significant potential benefits with data center co-location between SkySync processing and the source or destination endpoint, it simply doesn’t make a lot of sense to pull data down from the cloud, on premise, and then push it back up to the cloud. This transfer path is very inefficient. This concept will be expanded upon in additional prescriptive guidance later in the document.
Ultimately, it is impossible to accurately predict transfer throughput performance until the end-to-end infrastructure is configured and testing is performed. Based on various transfer throughput metrics, it is possible to make small adjustments to configuration or major adjustments like adding hardware resources to improve the weakest link within the throughput performance chain. Initial phases of deployment and configuration will include a series of performance tuning configuration adjustments. Throughput metrics should not be gathered until performance tuning has been completed.
...
SkySync can scale from very small, easy to implement single server deployments to very large multi-server clustered deployments. For simple implementations with a few million documents or less where transfer performance isn’t of primary concern, SkySync can be implemented very easily. But for larger corpus transfer solutions or when throughput is of high value, it’s important to understand the correct architectural considerations.
...
Because there are significant differences in the deployment guidance based on the size and throughput requirements of various transfer solutions, this guidance will be broken down into (3) different architectural models. Actual transfer throughput will not be identified as part of this guidance due to the significant variables involved in achievable throughput as identified in Transfer Performance Factors above.
The models below are provided for the purposes of providing general architectural examples that serve as a good starting point for solution design. SkySync can be easily scaled up if necessary and prudent.
Model A – Low Volume / Modest Throughput
This model represents deployments that include the transfer or synchronization of (3) million file and folder objects or less. Achievable throughput will be somewhat modest and generally isn’t the top priority for the solution. This will be an easy configuration with minimal planning requirements.
While this whitepaper can help and deployments can be fully managed by typical administrative staff, SkySync recommends that the Client Solutions Silver Launch Package be utilized to ensure solution success.
Model A SkySync Architecture Guidance
...
This is a single server model with no clustering configured
8+GB RAM, 60GB+ system drive, dual core processor or better
Windows Server 2008 SP2 or newer, fully patched, with .NET framework 4.5 or newer
SkySync installation generally consists of accepting the default answers which results in the creation and usage of a local SQL CE database
SQL CE supports a maximum database size of 4GB
If the 4GB SQL CE database size is reached, it is possible to convert the database to a full SQL Server database format
When SkySync is deployed to use SQL CE for the database, a maximum of (2) jobs may be run simultaneously
This model is intended for ease of implementation as opposed to high throughput and/or redundancy
Model B – Moderate Volume and/or Moderate Throughput
This model represents deployments that include the transfer or synchronization of (10) million file and folder objects or less at a moderate throughput. This model essentially represents the scale of what can be reasonably accomplished with a single server solution that is provisioned with better than standard resources. This solution will require some planning around database configuration and overall server specifications.
Again, this whitepaper will be useful in deploying this solution model, but SkySync strongly encourages all customers to take advantage of the Client Solutions Gold Launch Package be utilized to ensure solution success.
Model B SkySync Architecture Guidance
The following architecture configuration concepts apply to Model B:
This is a single server model with no clustering configured
32+GB RAM, 60GB+ system drive (solid state if possible), 4 or 8 core processor or better
Windows Server 2008 SP2 or newer, fully patched, with .NET framework 4.5 or newer
Before SkySync installation, SQL Server Express or full SQL Server Standard/Enterprise should be deployed on the SkySync processing server
Database Planning and Tuning Concepts (below) should also be implemented
SQL Server Express supports a maximum database size of 10GB
SQL Server Standard/Enterprise maximum database size is not a factor for this model
When SkySync is deployed to use SQL Server Express/Standard/Enterprise for the database, a maximum of (6) jobs may be run simultaneously by default
SkySync Tuning Concepts (below) may be employed to increase throughput when processing server resources are available and rate limiting is not a factor
This model is intended for advanced, single server implementation as opposed to the simpler default install that uses SQL CE
Model C – High Volume and/or High Throughput
This model represents deployments that include the transfer or synchronization of more than (10) million file and folder objects and/or transfers or synchronizations that require a very high throughput. Achievable throughput will be significantly determined by available bandwidth, network latency, SQL Server I/O and other resources, number of processing servers/threads and the level of API rate limiting on either the source or destination.
This solution will require very careful planning around database configuration, SkySync cluster configuration, cluster specifications, and cluster location with respect to source and destination location. To be able to achieve the highest levels of throughput, SkySync Client Solutions will need to be engaged to assist with planning, deployment and cluster tuning processes.
Model C SkySync Architecture Guidance
Exact cluster sizing guidance can be provided by SkySync Client Solutions based on an assessment of customer requirements. For the purposes of this document, a (4) server cluster is specified. With proper configuration (and guidance from SkySync Client Solutions), an extreme scale SkySync cluster serviced by a single high performance SQL Server (or Availability Group) has been tested to support up to (10) high performance processing servers.
The following architecture configuration concepts apply to Model C:
This is a (3) processing server and (1) database server cluster model
SkySync Processing Servers (3): 16+GB RAM, 60GB+ system drive, 4 core processor or better
Microsoft SQL Server [Standard Scale] (1): 32+GB RAM, 60GB+ system drive, 200GB data drive, 8 core processor or better
Microsoft SQL Server [Extreme Scale] (1): 128+GB RAM, 60GB+ system drive, 300GB solid state data drive (high IOPS/low latency), 300GB solid state tempdb/snapshot drive (high IOPS/low latency) to support report processing. SQL Server I/O performance will be a significant driving factor in cluster transfer throughput and stability.
Microsoft SQL Server [Extreme Scale] additional guidance: The Performance Best Practices for SQL Server in Azure Virtual Machines whitepaper from Microsoft provides additional guidance for building a high scale SQL Server instance in an Azure VM. Much of this guidance also applies to an on premise high scale SQL Server as well.
SkySync Processing Server OS: Windows Server 2008 SP2 or newer, fully patched, with .NET framework 4.5 or newer
Before SkySync installation, SQL Server Standard/Enterprise should be deployed
In order to deploy SkySync in a clustered configuration, follow the Guide to Installation and Upgrade of Multiple SkySync Nodes.
Database Planning and Tuning Concepts (below) should also be implemented
SQL Server Standard/Enterprise maximum database size is not a factor for this model
When SkySync is deployed to use SQL Server Express/Standard/Enterprise for the database, a maximum of (6) jobs may be run simultaneously by default
SkySync Tuning Concepts (below) may be employed to increase throughput when processing server resources are available and rate limiting is not a factor
...
The database subsystem can have a significant positive or negative impact on performance and scalability of the SkySync solution. That said, not all SkySync solutions are required to be high scale implementations. The concepts in this section walk through database deployment and tuning guidance depending on the amount of content being transferred and the desired performance of the solution.
...
Database Versions
At the most basic level, SkySync can function with a simple SQL Server CE database that is deployed automatically by the SkySync installer application. SkySync also supports using SQL Server Express/Standard/Enterprise versions 2012, 2014 and 2016. There are however significant differences that can impact a SkySync synchronization or migration configuration in each of these versions.
Microsoft SQL Server Compact Edition (CE)
Microsoft SQL CE is a small footprint relational database that is useful for low scale applications. It is a freely licensed database that works well with SkySync solutions that do not require high throughput levels and that will manage a lower number of file and folder objects (3 million or less). The 3 million is just a rough number and there are other factors. The primary limiting factor of SQL CE is the 4GB database file size limitation.
...
Microsoft SQL Server Express
Microsoft SQL Express is also free to download and use. It is a more robust database engine that is essentially a “light” version of the full SQL Server Standard database engine with many of the same capabilities. SQL Express can manage many more file and folder objects in a SkySync solution (up to approximately 10 million). It does, however, have some key limitations that affect solution scalability. These limitations make SQL Express a reasonable candidate for a single server SkySync solution when SQL Server licensing cost is an issue but they do limit its effectiveness as a database that can support a multi-node SkySync cluster. Those limitations include:
...
A maximum database size of 10GB (SQL Server 2008 R2 Express and higher)
No SQL Server Agent Service to handle job scheduling and automated tasks. Database maintenance and backups are generally a manual effort involving command line operations and Windows Scheduler.
Will consume a maximum of 1GB RAM even when more is available
Limited to 1 physical CPU. Note that it can consume up to 4 CPU cores available in a single physical CPU.
...
Microsoft SQL Server Standard
...
Microsoft SQL Enterprise is the gold standard SQL platform for a standard or extreme scale multi-node SkySync clustered solution. CPU and RAM limitations are only bound by Operating System maximums. Database file size can be as large as 524PB (petabyte). Database backup compression and SQL snapshots are supported. AlwaysOn Availability groups, online table index rebuilds and full resource governance is also available.
SQL Enterprise will always be the best choice for extreme scale SkySync solutions particularly when custom reporting is required, which is almost always the case in larger synchronization / migration solutions. The ability to run maintenance jobs that run online index rebuilds without interrupting 24/7 transfer processing is key to maintaining throughput performance. The ability to query large processing tables using SQL snapshots also ensures that SQL interaction from continuous transfer operations are not impacted.
Disk I/O Performance
Disk I/O performance will have a direct impact on the number of SkySync servers and total processing threads that can be used to execute transfer jobs. Because more concurrent transfer jobs (usually) means higher total throughput, database resources, particularly disk I/O, must be carefully planned for any SkySync solution where performance is important.
The most important storage resource for SkySync transfer performance and scalability is the disk volume where the SkySync database data file is stored. Also, the disk volume where the SQL Server TEMPDB is stored can be very important as well if custom reporting packages will be processing against SkySync data tables or SkySync database snapshot data tables.
“Standard” storage performance for a given organization disk subsystem will be sufficient when there are from approximately 1 to 3 or maybe 4 processing servers. Anything more than that or when a very large volume of file and folder objects will be under transfer management (10’s of millions) and the storage volume performance must be better than standard.
...
For these extreme scale SkySync solutions with (5) or more clustered processing servers, it’s good to start thinking about Solid State Drive (SSD) class or Tier 1 class storage. Some organizations consider SSD class performance to be Tier 0. The point here is that very high IOPS support and very low latency becomes the single most important factor in ensuring high sustained transfer throughput and overall solution stability throughout the SkySync processing cluster. The SkySync transfer engine can be massively parallelized and given the amount of data logging that occurs during data transfer operations, this can put tremendous READ/WRITE pressure on the SQL Server database disk subsystem.
Guidance for disk performance is simple for extreme scale solutions. If throughput is the highest priority, then the SkySync SQL Server should be provisioned with the best available disk the organization has access to. If the disk is strong enough, then additional processing servers can be added to the cluster. If not, then adding additional processing servers will actually make transfer performance or cluster stability worse. So the key then becomes how to understand whether or not the disk subsystem is “keeping up” with requests.
The easiest way to monitor the SQL disk I/O subsystem to ensure that it is servicing requests fast enough is to monitor the Disk Queue Length available in Resource Monitor. The easiest way to do this is to launch the Windows Task Manager, navigate to the “Performance” tab where basic system performance can be viewed first. Then click the “Open Resource Monitor” link at the bottom of the Task Manager
Once the Resource Monitor is open, navigate to the “Disk” tab and observe the disk queue length for the logical disk associated with the volume where the SkySync database and TempDB database data files are stored.
...
The general rule of thumb is that Disk Queue Length should be less than the total number of physical drives that comprise the volume. That can get complicated with certain storage arrays where the number of disks for a given volume is obfuscated or, in the case of an SSD volume, there may only be a single drive.
In general, the ideal Disk Queue Length is less than 1. Anything in the single digit range during heavy processing is generally OK. If Disk Queue Length is in the low double-digit range then the disk is a little under powered and is beginning to impact performance. But if Disk Queue continues to climb, or is in the hundreds or thousands range, then your disk subsystem is certainly insufficient to handle incoming requests. In this situation, the disk subsystem will need to be upgraded to improve transfer throughput. If this is not possible, then it is recommended to shut down one or more SkySync processing servers while monitoring Disk Queue Length until reasonable numbers are achieved and maximum transfer throughput is identified.
...
As data and log files grow in a SQL Server, by default, during the extent operation SQL Server will overwrite any data in the new segment with zeros. This is a security measure designed to make sure that data from any old files, that once consumed the same space on disk, have no possibility of being read by SQL administrators. It is a pretty edge case scenario but it is a very minor security risk so, by default, Instant File Initialization is not enable by default which allows the “zero overwrite” to occur.
This becomes a relevant concept as SQL databases grow by substantial amounts (another topic addressed below). An extent operation of several hundred megabytes or even gigabytes can take many seconds or even minutes to occur under the default condition when zeros must be written. During this time, the tables in the database are blocked. Blocked tables are very bad for SkySync’s highly tuned transfer scheduler engine. This condition can cause unexpected behavior as job processing threads try to figure out how to continue.
There are (2) ways to combat this condition. The best way to ensure fast extent operations is to implement Instant File Initialization. Another way, also recommended as a best practice, is to pre-create multiple data files that are pre-sized in anticipation of the amount of data that may potentially be stored in the database. This latter option will be addressed below in further detail.
The issue with the second option is that it can be difficult for inexperienced SkySync Administrators and DBAs to determine just how big those pre-sized files should be. So, the tendency is to over-allocate which can waste precious high performance disk space. Essentially, the best solution is a combination of the two. Pre-create and conservatively pre-size the SkySync and TempDB database data files to minimize extent operations. Then enable Instant File Initialization to ensure that if there must be extent operations, they happen quickly.
Steps to Enable Instant File Initialization
This topic is discussed in greater detail on blog sites from leading SQL experts such as Kimberly Tripp and Brent Ozar as well as others. But these are the general steps are included below:
1) Ensure that the SQL Server Service is running as a domain service account or at least a standard local service account (essentially not System Account or another “special” account).
2) Launch the Local Security Policy editor by typing and then executing “secpol.msc” after clicking the Start button.
3) Navigate to Local Policies è User Rights Assignment and then edit the “Perform Volume Maintenance Tasks” policy.
4) Add the SQL Server service account to this list and click OK.
5) Restart the SQL Server service
Pre-Creating and Pre-Sizing SQL Data Files
SQL Server experts suggest that multiple database files can have a meaningful impact on database performance. Paul Randal is one of the leading authorities on all things SQL Server. His company, SQLSkills, maintains a website full of useful and reliable information regarding SQL Server. Among the many articles in his “In Recovery…” blog is this one which highlights the benefits of properly scaling out database data files. It also draws attention to potential performance pitfalls of not doing it correctly.
When a database is first created in SQL Server through the user interface, the database will be created with a single “mdf” data file by default. It is possible to create this database with the primary “mdf” data file and then multiple “ndf” data files as well. This gives the database an opportunity to store content across multiple data files.
Configuring SQL data files in this way allows for improved data access performance because it allows SQL Server to multi-thread disk I/O operations. This is particularly helpful when the administrator has the freedom to store each of the data files on a unique, high performance disk volume. This technique can be used to enhance the ability of the SkySync solution to scale out with many processing servers.
The general rule of thumb for the number of data files that should be created is ¼ to ½ the number of physical cores available to SQL Server. For an 8 core SQL Server, deploying (2) to (4) data files is ideal. For a 16 core SQL Server (4) to (8) files would be good. For fewer cores, trend towards the “1/2” number. For a machine with 16 cores or more, trending towards the “1/4” number is generally typical.
SkySync Database Maintenance
Given the highly transactional nature of any migration platform, database maintenance plans become very important. Without proper maintenance, indexes can quickly become fragmented resulting in reduced performance.
Info |
---|
It is beyond the scope of this document to provide specific scripts for database maintenance. However, it is important to run index defragmentation and reorganization scripts at least twice per week to ensure that SkySync tables are properly maintained. If maintenance scripts are not standardized in the organization, Ola Hallengren provides very useful scripts that can rebuild indexes across all tables in a database but only when necessary. |
...
SkySync Database Recovery Mode
Like any migration solution, the data in the database can generally be considered transient. In other words, even if the database is lost, organizational content is not lost. This means that while database backups are useful to minimize any lost migration or synchronization processing time, recent log backups are not necessary.
Info |
---|
For these reasons, it is recommended that the SkySync database be configured for Simple Recovery model. There is no need to waste storage space and processing resources on data logging. |
...
SkySync Database Backups
With the SkySync database operating in the Simple Recovery model, database backups are straightforward. For high throughput solutions, a nightly full backup is generally sufficient. However, if the Recovery Point Objective (RPO) for the organization indicates that a full day of migration/synchronization processing is too much time loss, then additional incremental backups can be scheduled at intervals throughout the day.
Ideally, database backups should be executed with compression enabled to minimize backup storage size and backup duration. However, this will come at a cost of increased CPU utilization on the SQL Server which is important to consider.
...
SkySync includes a wide array of options for both scale up and scale out solutions to drive higher overall transfer throughput.
Parallel Processing Concepts
Simply having a multi-threaded transfer processing engine is not good enough in the complex world of migration and synchronization. This is because, for any given transfer operation, source and destination systems operate at different speeds. The SkySync solution manages this inequity by providing a single processing thread for each overall job but then it also provides an additional pool of “parallel write” processors that are available to all job threads.
...
There are limits to scaling (up and out) the number of processing threads as well as the number of parallel writes. These limits are discussed below.
Scaling Up
In the world of server processing, conventional wisdom indicates that when 80% of any one server resource are at capacity, it is time to consider scaling server resources (either up or out).
...
With just (1) processing server in the SkySync solution, scaling up the concurrent jobs and parallel writes will eventually cause the server to reach 80% capacity for network, CPU, or RAM. When this happens, the admin can either add additional resources to the server, or consider a scale OUT solution.
Scaling Out
Once the optimal number of parallel processing threads is identified for a given processing server hardware configuration, it is best to scale out using the exact same configuration. Having the processing servers use the same configuration will make it easier to manage transfer processing tests and eventually the transfer jobs during active migration.
...
The SkySync database is highly optimized. If the SQL Server is properly scaled, particularly in the I/O subsystem, a tremendous amount of parallelism is possible. A transfer solution of over 200 concurrent processing jobs has been successfully implemented resulting in a cloud to cloud transfer peak rate of over 22TB per day. This occurred under very carefully managed conditions, but it is a very real and proven throughput.
Parallel Writes
The concept of “Parallel Writes” is an interesting concept that can have a significant effect, positively or negatively, on transfer throughput. This setting allows SkySync to determine exactly “how hard” (how parallel) the API of the destination solution is called.
...
2)Click on the performance tab in the SkySync configuration window:
3)Increase or decrease parallel writes as necessary by (1) or (2) at a time.
...
When multiple servers are scaled up and/or out, it is common to see parallel writes at (4) or (6). But given the right conditions, they may be able to go higher. For example, when the destination is a Network File Share, then rate limiting is not really a factor. As long as I/O is not saturated, parallel writes can be increased.
Rate Limits
As mentioned above, rate limiting or throttling can happen when too many calls are made to a cloud platform API in too short of a time. When this occurs, most cloud providers require an incremental back-off strategy. This means that SkySync must wait a specified period of time if a rate limit is encountered before retrying. Once the retry happens, if SkySync receives another rate limit error message, then it must double the original wait time before trying again. This pattern continues until the rate limit error is not encountered. However, given the increasing back-off strategy, this can dramatically reduce transfer throughput.
...
There are two primary architectural nodes in a SkySync configuration. While the database node and the SkySync processing server node can both exist on the same server, they can also be deployed in such a way as to facilitate high availability
Database Node
As Mentioned above, the SkySync database can operate on SQL Server Express or even SQL CE on the same box as a SkySync processing node. However, for high availability, it should be deployed in a fault tolerant solution.
It is beyond the scope of this document to provide architectural guidance for SQL Server high availability, but standard high availability solutions are supported by the SkySync platform. This would include SQL Server Availability Groups or other clustered SQL Server solutions. However, it is important to understand that any delays in transaction processing may impact overall SkySync solution scalability and throughput. For example, a synchronous mirroring solution over a long distance or otherwise high latency link could have significant negative impacts on transfer throughput.
SkySync Processing Node
All SkySync processing servers talk directly to the database. There is no centralized communication server or other single point of failure in the SkySync solution. If there are at least 2 processing servers operating on independent hardware, the SkySync Processing solution is highly available. If one server becomes unavailable, another available server will eventually pick up where the failing server left off and continue processing the job.Summary
...