In-Depth
Build Big-Data Apps in SQL Azure with Federation
Get ready to scale out SQL Azure databases beyond today's 50GB limit with the Transact-SQL and ADO.NET elastic sharding features, which are coming in the 2011 SQL Azure Federation Community Technology Previews.
An urban myth that relational databases and SQL can't achieve Internet-scale of terabytes -- or petabytes -- has fostered a growing "NoSQL" developer community and a raft of new entity-attribute-value, also known as key value, data stores. The Microsoft SQL Azure team gave credence to the myth by limiting the maximum size of the initial Business Edition of the cloud database to just 10GB. SQL Azure adds a pair of secondary data replicas to assure high availability, and the team cited performance issues with replication as the early size-limiting factor.
Today, you can rent a 50GB SQL Azure database for $499.95 per month, but the SQL Azure Team isn't talking publicly about future scale-up options. Instead, Microsoft recommends that you scale out your SQL Azure databases by partitioning them horizontally into smaller instances, called shards, running on individual SQL Azure database instances, and group the shards into federations.
Sharding increases database capacity and query performance, because each added SQL Azure database brings its own memory and virtual CPU. Microsoft Software Architect Lee Novik first described SQL Azure sharding details in his "Building Scale-Out Applications with SQL Azure" session at the Microsoft Professional Developers Conference 2010, held last October in Redmond.
Horizontal partitioning isn't new to SQL Server. Horizontally partitioning SQL Server 7.0 and later tables to multiple files and filegroups improves performance by reducing average table and index size. Placing each filegroup on an individual disk drive speeds T-SQL queries. Partitioning also streamlines backup and maintenance operations by reducing their time-window length.
SQL Server 2005 automated the process with the CREATE PARTITION FUNCTION command, which lets you automatically map the rows of a table or index into specified partitions based on the values of a specified column. You design a CREATE PARTITION SCHEME to determine how to assign partitioned files to filegroups. SQL Server partitioned views make rows of all partitions appear as a single table; distributed partitioned views enable partitioning data across multiple linked servers, not just filegroups, for scaling out. SQL Server 2000 introduced updateable distributed views with distributed transactions, and SQL Server 2000 SP3 used OLE DB to optimize query-execution plans for distributed partitioned views.
A group of linked servers that participates in distributed partitioned views is called a "federation." The partitioning column, whose values determine the partition to which the row belongs, must be part of the primary key and can't be an identity, timestamp or default column.
New Taxonomy
Scaling out SQL Azure with federated database instances follows a pattern similar to that for on-premises SQL Server, but is subject to several important limitations. For example, SQL Azure doesn't support linked servers, CREATE PARTITION FUNCTION, CREATE PARTITION SCHEME, cross-database joins, distributed (cross-database) transactions, OLE DB or the T-SQL NewSequentialID function. These restrictions require architectural changes for SQL Azure federations, starting with this new taxonomy:
- Federation consists of a collection of all SQL Azure database instances that contain partitioned data having the same schema. The T-SQL script in "How to Create a Federation with Customers, Orders and OrderItems Tables" shows a T-SQL script to create an Orders_Federation with a schema based on three tables of the Northwind sample database.
- Federation Members comprise the collection of SQL Azure databases that contain related tables with partitioned data, called Federated Tables. A Federation Member also can contain replicated lookup tables (Products) that provide supplementary data that's not dependent on the Federation Key.
- Federation Key is the primary key value (CustomerID) that determines how data is partitioned among Federated Tables, each of which must contain the Federation Key in their primary key, which can be a big integer (bigint, a 64-bit signed integer) or GUID (uniqueidentifier) data type. For example, the Orders and OrderItems tables have composite primary keys (OrderID + CustomerID and OrderID + ProductID + CustomerID, respectively).
- Atomic Unit (AU)is a cluster of a single parent table (Customers) row and all related rows of its dependent tables (Orders and OrderItems). AU clusters can't be separated in the partitioning (sharding) process or when moving data between Federation Members.
- Federation Root is the initial database that contains metadata for specifying the Partitioning (sharding) Method, range of valid values for the Federation Key, and minimum/maximum Federation Key value ranges for each Federation Member.
- Partitioning Method determines whether the Federation Key is generated by the application or the data tier. For this article's example, the data tier uniqueidentifier data type provides random 128-bit (16-byte) GUID values, which balance additions across multiple Federation Members. Sequential bigint values are easier to read, but require a feature similar to the SQL Server Denali Sequence object to generate identity values that are unique over multiple Federated Tables.
The SQL Azure Team plans to release SQL Azure Federation features in piecemeal fashion starting with a Community Technology Preview (CTP) of version 1 in 2011. The current plan is for the CTP 1 to support partitioning by uniqueidentifier FederationKey values only; a post-CTP 1 drop will add bigint FederationKeys (see Table 1).
New T-SQL Key Words for Scripting SQL Azure Federation
The following new T-SQL syntax will support SQL Azure Federations in the version 1 CTP (see "How to Create a Federation with Customers, Orders and OrderItems Tables"):
CREATE FEDERATION Federation_Name(Federation_Key_Name RANGE uniqueidentifier)
DROP FEDERATION Federation_Name
CREATE TABLE Table_Name FEDERATE ON (Federation_Key_Name)
USE FEDERATION Federation_Name (Fed_Key_Value) WITH FILTERING=ON
USE FEDERATION Federation_Name (Fed_Key_Value) WITH FILTERING=OFF
ALTER FEDERATION Federation_Name SPLIT AT(Fed_Key_Value)
ALTER FEDERATION Orders_Federation DROP AT(Fed_Key_Value)
The FILTERING=ON option restricts the visibility of AUs to the specific cluster specified by Fed_Key_Value; FILTERING=OFF makes all AUs in the Federation Member visible for bulk operations. The ALTER FEDERATION ... SPLIT AT Fed_Key_Value instruction lets you balance the size of Federation Members that have grown substantially larger than the average by moving AUs greater than Fed_Key_Value into a new Federation Member. SPLIT AT and DROP AT operations work with AUs exclusively.
The version 1 CTP ADO.NET sharding library will include sample Microsoft .NET Framework-based code for connecting to the Federation Root, which automatically routes the connection to the appropriate Federation Member based on information from the sys.federation_member_columns view, and retrieves AUs as ADO.NET DataSets. Here's the generic code for instructions that open a connection to a Federation Member whose member_id specified by dbname_postfix:
SqlConnection cn = new
SqlConnection("Server=tcp:servername.db.windows.net;" +
"Db=SalesDB_" + dbname_postfix +
";User ID=username;Password=password;" +
"Trusted_Connection=False;Encrypt=True");
cn.Open();
...