Speed up ETL

Minimal logging to load data precisely where you want it as fast as possible:

Writing data to a database typically involves two separate write-to-disk processes, once writing to the database and once to the log (so transactions can be rolled back or re-done). When inserting data into an existing table it is, in fact, possible to write only once in some cases by using the minimally logged INSERT feature. Minimal logging enables transactions to be rolled back but does not support point-in-time recovery. It is also only available with the bulk logged and simple recovery models. In SQL Server 2008 R2, minimal logging can be used with INSERT INTO….…SELECT FROM Transact-SQL statements when inserting a large number of rows into an existing table if they are inserted into an empty table with a clustered index and no nonclustered indexes, or a heap, empty or not, with no indexes.

One huge benefit of minimal logging is that it speeds up the loading of empty partitions or tables that are on specific filegroups.

Now you can just create a table on the filegroup(s) you want it to be in, define its partitioning scheme and then load it with INSERT INTO tbl WITH(NOLOCK) SELECT FROM and you achieve minimal logging. Minimal logging makes it much easier to put the data just where you want it and write it to disk only once. As an added bonus, load performance is increased and the amount of log space required is reduced.

Data extraction by using Change Data Capture in the SQL Server source systems: Data tracking feature that is of particular benefit in data warehousing. The Change Data Capture process tracks changes to user tables and collects them into a relational format. A typical use would be to track changes in an operational database for later inclusion in the warehouse. Collects change data from the database’s transaction log and inserts it into a change table. Metadata about each transaction is also inserted into a metadata table so that changes can be ordered with regard to time. This enables the identification of, for instance, the type of change made to each row, and which column or columns changed in an updated row. It is also possible to request all rows that changed between two time/dates. Change Data Capture is a big step towards improved extraction performance, and makes programming the change capture portion of your ETL jobs much easier.

Speed up ETL with improved Lookup:

The performance of the Lookup component was greatly improved performance in SQL Server 2008 Integration Services and is much easier to program. A Lookup verifies whether each row in a stream of rows has a matching row in a set of reference data. This is often used within the ETL process to check, for example, the ProductID column in a fact table (acting as the data source) against a dimension table holding a complete set of products (the reference set). Lookup transformation supports two connection types when connecting to the reference dataset: the Cache connection manager and the OLE DB connection manager. The reference dataset can be a cache file, an existing table or view, a new table, or the result of an SQL query.

Reference data is usually cached for efficiency and now a dataflow can be used to populate the cache. Many potential sources can be used as reference data: Excel, XML, text, Web services—anything within reach of an ADO.Net provider. SQL query and a Lookup could only take data from specific OLE /DB connections. The new Cache Transform component populates a cache defined by the Cache connection manager. The cache no longer needs to be reloaded each time it is used: this removes the speed penalty incurred by reloading from a relational source. If a reference dataset is used by two pipelines in a single package, the cache can be saved to permanent file storage as well as to virtual memory so it is available to multiple Lookups within one package. Furthermore the cache file format is optimized for speed and its size is unrestricted.

The miss-cache feature is also new. When running directly against the dataset, a Lookup component can add to the cache any key values from the source where there is no matching value in the reference dataset. So if Lookup has once determined that the reference set does not contain, for example, the value 885, it does not waste time inspecting the reference set for that value if it appears again in the source data. Under certain conditions this feature can produce a performance improvement of 40%. Finally there is now a ‘Lookup no match output’ to which ‘miss-cache’ rows can be directed instead of going to the error output.

Relational Data Warehouse Setup, Query, and Management:

The “” Relational Database “” is the heart of any BI system. Best practices here affect not only the performance of the entire system, but also its flexibility and value to the enterprise. For a more in-depth discussion of how to get the best performance from your SQL Server 2008 data warehouse for large-scale data warehouses for more info visit book online.

Anyhow, here we go rebuild statistics and indexes: The query optimizer uses information from the database statistics (number of rows, data distribution, and so on) in order to help determine the optimal query plan. If the statistics are inaccurate, a less optimal plan may be chosen, which degrades query performance. If you can afford the time during your ETL process, rebuild statistics after every load of the data warehouse. This ensures that the statistics are always accurate. However, rebuilding statistics takes time and resources. In addition, the time between rebuilds is less important than the change to the distribution of the data that has occurred. When the data warehouse is new and evolving, and also relatively small, it makes sense to update statistics frequently, possibly after every load. As the warehouse matures, you can sometimes reduce the frequency of rebuilding without degrading query performance significantly. If it is important to reduce the cost of updating statistics, you can determine the appropriate statistic refresh frequency by monitoring query performance as you reduce the frequency of refreshes. Be aware that if the bulk of your queries target only the most recently loaded data, you will not have statistics for that data unless you update statistics after each load. This is a fairly common situation, which is why we recommend updating statistics after each load of the data warehouse by default.

Backup compression to reduce storage footprint:

Backup compression is now available and should be used unless you find good reason not to do so. The benefits are the same as other compression techniques—there are both speed and volume gains. We anticipate that for most data warehouses the primary gain of backup compression is in the reduced storage footprint, and the secondary gain is that backup completes more rapidly. Moreover, a restore runs faster because the backup is smaller. Backup compression won’t compress encrypted databases (because encryption hides the redundancy in the data). If you want small encrypted backups, backup unencrypted data with backup compression, then use an encryption utility to encrypt the backup file.

Partition large fact tables: Partitioning a table means splitting it horizontally into smaller components (called partitions). Partitioning brings several benefits. Essentially, partitioning enables the data to be segregated into sets. This alone has huge advantages in terms of manageability. Partitions can be placed in different filegroups so that they can be backed up independently. This means that we can position the data on different spindles for performance reasons. In data warehouses it also means that we can isolate the rows that are likely to change and perform updates, deletes, and inserts on  those rows alone.

Query processing can be improved by partitioning because it sometimes enables query plans to eliminate entire partitions from consideration. For example, fact tables are frequently partitioned by date, such as by month. So when a report is run against the July figures, instead of accessing 1 billion rows, it may only have to access 20 million. Indexes as well as tables can be (and frequently are) partitioned, which increases the benefits. Imagine a fact table of 1 billion rows that is not partitioned. Every load (typically nightly) means insertion, deletion, and updating across that huge table. This can incur huge index maintenance costs, to the point where it may not be feasible to do the updates during your ETL window. If the same table is partitioned by time, generally only the most recent partition must be touched, which means that the majority of the table (and indexes) remain untouched. You can drop all the indexes prior to the load and rebuild them afterwards to avoid index maintenance overhead. This can greatly improve your load time.

Partition-align your indexed views:

You to create indexed views that are aligned with your partitioned fact tables, and switch partitions of the fact tables in and out. This works if the indexed views and fact table are partitioned using the same partition function. Typically, both the fact tables and the indexed views are partitioned by the surrogate key referencing the Date dimension table. When you switch in a new partition, or switch out an old one, you do not have to drop the indexed views first and then re create them afterwards. This can save a huge amount of time during your ETL process. In fact, it can make it feasible to use indexed views to accelerate your queries when it was not feasible before because of the impact on your daily load cycle.

Manage Multiple Servers Uniformly:

Use Policy-Based Management to enforce good practice across multiple servers SQL Server 2008 R2 supports Policy-Based Management, which makes it possible to declare policies (such as “all log files must be stored on a disk other than the data disk”) in one location and then apply them to multiple servers. So a (somewhat recursive) best practice is to set up best practices on one server and apply them to all servers. For example, you might build three data marts that draw data from a main data warehouse and use Policy-Based Management to apply the same rules to all the data marts.

PowerPivot: It’s most powerfull tool, i love to learn more.

Microsoft introduced the ability to analyze hundreds of millions of rows of data in Microsoft Excel 2010. You can create pivot tables and pivot charts, and filter and summarize the data quickly and easily. The performance provided is incredible – you can filter and summarize hundreds of millions of rows on a desktop PC in sub-second time. Moreover, a new column store technology called VertiPaq is used in PowerPivot, and this means no aggregates are required to get this speed, simplifying development and management, and giving predictable fast performance. Analytical applications created in PowerPivot can be shared by uploading them to SharePoint and Excel Services. You should evaluate these tools for both end-user analytics and general reporting. Consider initiating a pilot project with analytics for “data explorers” or other power users with PowerPivot. See http://powerpivot.com for more information.

Hardware capacity and hardware price:

To improve performance under load, you can either scale up or scale out. Scale up is undeniably simple: put the cube on an extra large high-performance server. This is an excellent solution—it is quick and easy and is the correct decision in many cases. However, it is also expensive because these servers cost more per CPU than multiple smaller servers. Previous version of Analysis Services offered a scale-out solution that used multiple cost-effective servers. The data was replicated across the servers and a load balancing solution such as Microsoft Network Load Balancing (NLB) was installed between the clients and the servers. This worked but incurred the additional costs of set up and ongoing maintenance. SQL Server 2008 Analysis Services introduced a new scale-out solution called Scalable Shared Database (SSD). Its workings are very similar to the SSD feature in the SQL Server 2005 relational database engine. It comprises three components: * Read-only database – enables a database to be designated ‘read-only’ * Database storage location – enables a database to reside outside the server Data folder * Attach/detach database – a database can be attached or detached from any UNC path Used together, these components make it easier to build a scale-out solution for read-only Analysis Services cubes. For example, you can connect four blade servers to a shared, read-only database on a SAN, and direct SSAS queries to any of the four, thereby improving your total throughput by a factor of four with inexpensive hardware. The best possible query response time remains constrained by the capabilities of an individual server, since only one server runs each query.

System Architecture and Performance:

Report server catalog on the same computer as the report server Reporting Services generates and uses a database that is essentially a catalog. This is used during report processing for a number of tasks, including managing the data returned from queries. While it is possible to store this database on a server other than the reporting server, doing so requires that the report data be pushed across the network unnecessarily, which slows down report execution. It is a much better practice to hold the catalog database on the reporting server to avoid this network traffic and the associated delays.

Reporting Services on a different server from your data warehouse: While pulling the data from the queries across the network does slow things down, it is beneficial to not have your data warehouse and SSRS competing for memory and processing time.

Decision Making:

Good decision making is as important in the working world as it is in the rest of our lives. Every day a number of decisions must be made that determine the direction and efficiency of the organizations we work for. Decisions are made concerning production, marketing, and personnel. Decisions are made affecting costs, sales, and margins. Just as in our personal lives, the key to organizational success is to make good choices. The organization must have effective decision making.

Who is a decision maker ……?

Mehboob

MCTS & MCITP

 

 

Advertisements

Tagged: , , ,

One thought on “Speed up ETL

  1. Gerson 12/05/2012 at 07:44 Reply

    The column names of the rsulet set that are returned by EXCEPT or INTERSECT are the same names as those returned by the query on the left side of the operand.Column names or aliases in ORDER BY clauses must reference column names returned by the left-side query.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: