SQL Server 2014 performance & improvements ..

SQL Server 2014 performance Vs improvements …

The new performance features in SQL Server 2014 that can help our systems perform more efficiently.










There have been some other improvements and features added in this latest version of SQL Server that can also, if used correctly, help improve the performance of our databases. These improvements and how we can make use of them to help us get the most out our databases. Here are some of these help DBA to get most out of database and use this to have bread and butter on table.

* Partition switching and indexing
* Columnstore indexes
* Buffer pool extensions
* Cardinality estimator
* Statistics
* Resource Governor

Partition switching and indexing:
With SQL Server 2014, it is possible for individual partitions of partitioned tables to be rebuilt and let start with index.







Columnstore indexes:
Columnstore index is a technology used for storing, retrieving, and indexing using the column data as opposed to the traditional row-based formats. Columnstore indexes were first introduced in SQL Server 2012 with certain limitations, and some of the limitations have been removed in SQL Server 2014.

In SQL Server 2014, there is support for both clustered and non-clustered columnstore indexes, which is a change from SQL Server 2012 as it only supported non-clustered index columnstore indexes.During peak hours, the system supports read-only queries for reports. Columnstore index, if configured appropriately on the correct tables, can be used to achieve up to 10 times the query performance compared to traditional row-based storage and up to 7 times the data compression over the
original data size, number of benefits, including the following:


• Columns often have similar data that is stored contiguously on disks. This can result in high compression rates.

• High compression rates improve the query performance as they have a smaller in-memory footprint; therefore, SQL Server can carry out more query processing in the memory.

• Columnstore indexes can reduce CPU usage. Batch-mode execution is a new query execution mechanism added to SQL Server that reduces CPU usage. Batch-mode execution is integrated with the columnstore storage format. This is sometimes referred to as vector-based or vectorized execution.

• A typical query will often select only a few columns from a table, which can reduce the total I/O from the physical storage, thus improving the overall performance again. Microsoft on the columnstore versions at the clustered columnstore index can now be updated in SQL Server 2014  the clustered columnstore index can now be updated in SQL Server 2014.

Creating a clustered columnstore index We can create a clustered columnstore index for a table using either TSQL or the management studio GUI. Using Management Studio, connect to the instance of SQL Server that houses your database, expand the databases and tables, and right-click on the index folder to choose New Clustered Columnstore Index. It is worth noting here that if you have a normal rowstore-clustered index, you won’t be able to create a clustered columnstore index


When the wizard starts, we will notice that there is no key column. it’s also worth noting that all columns are included columns. It is also worth noting that the clustered columnstore index does not work with other indexes. If another index
exits, you won’t be able to create a clustered columnstore index, and if you have a clustered columnstore index, you won’t be able to create another index type. Much like a normal clustered index which sorts and stores the data rows of table, a clustered columnstore index will store the data for the entire table. Clustered columnstore indexes are an enterprise-only feature in SQL Server 2014.

Buffer pool extensions:
Buffer pool extensions allow you to make use of solid-state drives as extra RAM on your database server. They provide unified integration of a nonvolatile random access memory (solid-state drive) extension to the Database Engine buffer pool, which can significantly improve the I/O throughput. Buffer pool extensions are an Enterprise edition feature.

Enterprise Edition to use this feature, the main goal of a SQL Server database is to store, retrieve, and manipulate data. Therefore, you should expect higher disk access and disk I/O on many production
systems. These disk storage operations can consume many server resources and take a relatively long time to complete; this delay can be the cause of performance issues that are caused solely by the I/O throughout. SQL Server, in order to counteract the delays that I/O operations can cause, has always tried to have an efficient I/O. It does this by making use of the memory or RAM. The buffer pool serves as the main memory allocation source for SQL Server.

The buffer management component consists of two mechanisms: the buffer manager to access and update database pages, and the buffer pool to reduce the database file I/O. Data pages get modified in memory, and the changed data pages are known as dirty pages. The checkpoint process, internal to SQL Server, writes the data pages back to the disk. Memory pressure on the server, along with database checkpoints, will cause dirty pages in the buffer cache to be removed from the cache and written to mechanical disks and then read back into the cache. SQL Server I/O operations are usually small random reads and writes of the order of 4 to 16 KB of data. Small random I/O patterns will incur frequent seeks, which will compete for the mechanical disk arm.

This can reduce the I/O performance and reduce the aggregate I/O throughput of the system. SQL Server, a reasonable approach to solving these I/O bottlenecks was to add more RAM, or alternatively, to add additional highperformance I/O spindles, or a combination of the two. These options are definitely helpful and are still likely to be helpful in SQL Server 2014. However, there are some drawbacks associated with them.

* RAM is generally more expensive than data storage drives and adding extra spindles and disk drives increases capital expenditure in hardware acquisition. This can increase operational costs by increasing the power consumption and the probability of component failure.

* Mechanical disk drives can and will fail eventually.

The buffer pool extension feature allows SQL Server to extend the buffer pool cache by making use of Solid State Drives (SSD). This enables the buffer pool to accommodate a larger database working set, which forces the paging of I/O between RAM and the SSDs instead of the mechanical disk. This effectively offloads small random I/O operations from mechanical disks to SSDs. Because SSDs offer better performance through lower latency and better random I/O performance, the buffer pool extension can significantly improve the I/O throughput, thus removing I/O bottlenecks and speeding up the performance of a system. As I/O is usually the slowest component Cardinality estimator and query plans:

The cardinality estimator has been redesigned in SQL Server 2014. It has been redesigned in order to improve the quality of query plans and thus improve the query performance. The new cardinality estimator includes assumptions and algorithms that work well on modern transactional systems (OLTP) and data warehousing databases. The changes to the cardinality estimator means that you should allow additional testing of your database as you migrate and move them to SQL Server 2014 to ensure
that this change does not affect your database performance in a negative way.







One of the problems while updating statistics on large tables in SQL Server is that the entire table has to be scanned, for example, while using the WITH FULLSCAN option to scan the entire table, even if only recent data has changed. This is also true when using partitioning. Even if only the newest partition has changed since the last time, the statistics are updated. Updating the statistics again requires a scan of the entire table, not just the current partition. This scan option includes all the partitions that didn’t change. You can now update statistics incrementally with the introduction of SQL Server 2014, which can provide help with this problem.








The CREATE STATISTICS option and related statistic statements now allow for individual partition statistics to be created and updated using the INCREMENTAL option. Other related statements that allow or report incremental statistics include: UPDATE STATISTICS, sp_createstats, CREATE INDEX, ALTER INDEX, ALTER DATABASE SET, DATABASEPROPERTYEX, sys.databases, and sys.stats options.

Resource Governor
In the previous editions of SQL Server, the Resource Governor allowed you to specify the limits on the amount of CPU or memory that a process can use as part of the resource pool. In SQL Server 2014, this has now been improved with new Resource Governor settings to include the I/O activity too. In SQL Server 2014, you can use the new MIN_IOPS_PER_VOLUME and MAX_IOPS_PER_VOLUME settings to control the physical I/Os issued for user threads in a given resource pool.

As you can see, SQL Server 2014 has some great new features for the DBA. These new features will allow the DBA to implement a more robust and highly available production environment. The enhancements and changes to some of the SQL Server performance features, namely In-Memory Optimized tables and Delayed Durability, provide the DBA with an additional set of tools to get their databases performing at an optimal level.                                                                Back to top

Let me know if you have any further question and your comments will be learning point.


Microsoft Certified Solutions Associate (MCSA)

Hive – Hadoop

Hive – Hadoop

Hive (not HIV Disease, this something different 🙂 ) HIVE is a framework that sits on top of core Hadoop. It acts as a data warehousing system on top of HDFS and provides easy query mechanisms to the underlying HDFS data.

Programming MapReduce jobs could be tedious and will have their own development, testing, and maintenance investments. Hive queries,  called Hive Query Language (HQL) are broken down into MapReduce jobs under the hood and remain a complete abstraction to the user and provide a query-based access mechanism for Hadoop data.

Traditional relational database; enabling business users to use familiar tools such as Excel and SQL Server Reporting Services to consume data from Hadoop in a similar way as they would from a database system as SQL Server remotely through a ODBC connection. Hive ODBC driver to consume the data.

1. Download and install the Hive ODBC Driver
2. Configure the driver to connect to Hive running on your Hadoop cluster on Linux/Windows
3. Use SQL Server Integration Services (SSIS) to import data from Hive to SQL Server

One of the main advantages of Hive is that it provides a querying experience that is similar to that of a relational database, which is a familiar technique for many business users. Essentially, it allows all ODBC compliant clients to consume HDFS data through familiar ODBC Data Sources (DSN), thus exposing Hadoop to a wide and diverse range of client applications.

Microsoft ODBC driver for Hive that bridges the gap between Hadoop and SQL Server along with its rich business intelligence and visualization tools. Downloaded here driver from the driver comes in two flavors, 64 bit and 32 bit, so please make sure that you download and install the appropriate driver depending upon the bit configuration of your client application.















Once the driver is installed successfully, perform the following steps to ensure that you can make a successful connection to Hive:

1. First, create a System DSN. In ODBC Data Sources Administrator, go to the System DSN tab and click on the Add Button

odbc administrator







2. HIVE driver in the next screen of the Create New Data Source wizard













3. The final ODBC Hive Setup Hadoop cluster details as well as the credentials used
to connect to the cluster.

















4. Once created, a new System DSN







5. The quickest way to test this out is to test from an .udl (Universal Data Link) file. To do this, create an empty .txt file and change the file extension to .udl. Then, double-click and open the .udl file, by default it should open with Microsoft OLE DB Core components.
















Provide the User name and Password for your Hadoop cluster and click on the Test Connection button. A successful test connection will confirm that the Hive ODBC driver is correctly set up.

















Microsoft SQL Server is a complete suite of tools that include an RDBMS system, a multidimensional OLAP and tabular database engines, as well as other services, for example a broker service, a scheduling service (SQL Agent), and many more. One of these BI components is an extract, transform, and load (ETL) tool called SQL Server Integration Services (SSIS). SSIS have the ability to merge structured and un-structured data by importing Hive data into SQL Server and apply powerful analytics on the integrated data. Next will be SSIS as an ETL before jul tiden 🙂

Let me know if you have any further question and your comments will be learning point.                                                                                                                  Back to top


Microsoft Certified Solutions Associate (MCSA)


Big Data – Hadoop

Big Data – Hadoop

Hadoop is an open source software framework that supports data-intensive distributed applications available through the Apache Open Source community. It consists of a distributed file system HDFS, the Hadoop Distributed File System and an approach to distributed processing of analysis called MapReduce. It is written in Java and based on the Linux/Unix platform.

The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. It provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. It enables applications to work with thousands of computation-independent computers and petabytes of data. The entire Apache Hadoop platform is commonly considered to consist of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS), and number of related projects including Apache Hive, Apache HBase, Apache Pig, Zookeeper etc.

The real magic of Hadoop is its ability to move the processing or computing logic to the data where it resides as opposed to traditional systems, which focus on a scaled-up single server, move the data to that central processing unit and process the data there.

This model does not work on the volume, velocity, and variety of data that present day industry is looking to mine for business intelligence. Hence, Hadoop with its powerful fault tolerant and reliable file system and highly optimized distributed computing model, is one of the leaders in the Big Data world.

Hadoop is it’s storage system and it’s distributed computing model

Hadoop Distributed File System is a program level abstraction on top of the host OS file system. It is responsible for storing data on the cluster. Data is split into blocks and distributed across multiple nodes in the cluster.

MapReduce is a programming model for processing large datasets using distributed computing on clusters of computers. MapReduce consists of two phases: dividing the data across a large number of separate processing units (called Map), and then combining the results produced by these individual processes into a unified result set called Reduce.

This is also called the Head Node/Master Node of the cluster.

Secondary NameNode
This is an optional node that you can have in your cluster to back up the NameNode if it goes down. If a secondary NameNode is configured, it keeps a periodic snapshot of the NameNode configuration to serve as a backup when needed.

These are the systems across the cluster which store the actual HDFS data blocks. The data blocks are replicated on multiple nodes to provide fault tolerant and high availability solutions.

This is a service running on the NameNode, which manages MapReduce jobs and distributes individual tasks.

This is a service running on the DataNodes, which instantiates and monitors individual Map and Reduce tasks that are submitted.

Hive is a supporting project for the main Apache Hadoop project and is an abstraction on top of MapReduce, which allows users to query the data without developing MapReduce applications. It provides the user with a SQL-like query language called Hive Query Language (HQL) to fetch data from Hive store.

Pig is an alternative abstraction on MapReduce, which uses dataflow scripting language called PigLatin. This is favored by programmers who already have scripting skills.

Flume is another open source implementation on top of Hadoop, which provides a data-ingestion mechanism for data into HDFS as data is generated.

Sqoop provides a way to import and export data to and from relational database tables (for example, SQL Server) and HDFS.

Oozie allows creation of workflow of MapReduce jobs. This is familiar with developers who have worked on Workflow and communication foundation based solutions.

HBase is Hadoop database, a NoSQL database. It is another abstraction on top of Hadoop, which provides a near real-time query mechanisms to HDFS data.

HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly. HBase is now being used by many other workloads internally at Facebook and many other company using this amazing big data technolog.

Mahout is a machine-learning library that contains algorithms for clustering and classification.

Good to know Apache Cassandra Terms:
CQL – Cassandra Query Language
RP – Random Partitioner
OPP – Order Preserving Partitioner
BOP – Byte Ordered Partitioner
RF – Replication Factor
CF – Column Family
JSON – Java Script Object Notation
BSON – Binary JSON
TTL – Time To Live
HDFS – Hadoop Distributed File System
CFS – Cassandra File System
UUID – Universal Unique IDentifier
DSE – Datastax Enterprise
AMI – Amazon Machine Image
OOM – Out Of Memory
SSTables – Sorted String Table
SEDA – Staged Event-Driven Architecture
CRUD – Create Read Update Delete

Big Data Architecture :
Big Data architecture is premised on a skill set for developing reliable, scalable, completely automated data pipelines. That skill set requires profound knowledge of every layer in the stack, beginning with cluster design and spanning everything from Hadoop tuning to setting up the top chain responsible for processing the data.

The main detail here is that data pipelines take raw data and convert it into insight (or value). Along the way, the Big Data engineer has to make decisions about what happens to the data, how it is stored in the cluster, how access is granted internally, what tools to use to process the data, and eventually the manner of providing access to the outside world. The latter could be BI or other analytic tools, the former (for the processing) are likely tools such as Impala or Apache Spark. The people who design and/or implement such architecture I refer to as Big Data engineers.

Facebook is starting to use Apache Hadoop technologies to serve realtime workloads. why facebook decided to use Hadoop technologies, the workloads that facebook have on realtime Hadoop, the enhancements that facebook did to Hadoop for supporting our workloads and the processes and methodologies facebook have adopted to deploy these workloads successfully.

Here is a link to the complete paper for those who are interested in understanding the details of why decided to use Hadoop_dwarehouse_2010  for a complete reference, please visit the Apache website

NoSQL Databases

Amazon SimpleDB
Azure Table Storage
Berkeley DB
CDB (Constant Databse)
Clusterpoint Server
Couchbase Server
EMC Documentum xDB
Event Store
Execom IOG
Faircom C-Tree
HSS Database
IBM Lotus/Domino
Infinite Graph
VMware vFabric GemFire
Intersystems Cache
ISIS Family
MarkLogic Server
Ninja Database Pro
OpenLink Virtuoso
Oracle NoSQL Database
RDM Embedded
Sones GraphDB
TIBCO Active Spaces
Tokyo Cabinet / Tyrant
U2 (UniVerse, UniData)


Let me know if you have any further question and your comments will be learning point.                                                                                                                  Back to top


Microsoft Certified Solutions Associate (MCSA)

Installing Mirror Setup in a SQL Server 2012

SQL Server 2012 in a Mirror

Enable SQL Server for high availability by enabling them in a mirror setup, There are two modes of database mirroring – synchronous and asynchronous. With synchronous mirroring, transactions cannot commit on the principal until all transaction log records have been successfully copied to the mirror (but not necessarily replayed yet). This guarantees that if a failure occurs on the principal and the principal am mirror are synchronized, committed transactions are present in the mirror when it comes online – in other words, it is possible to achieve zero data loss.

Synchronous mirroring can be configured to provide automatic failover, through the use of a third SQL Server instance called the witness server (usually hosted on another physically separate server). The sole purpose of the witness is to agree (or not) with the mirror that the principal cannot be contacted. If the witness and mirror agree, that mirror can initiate failover automatically. If synchronous mirroring is configured with a witness, the operating mode is known as high-availability mode and povides a hot standby solution. When no witness is defined, the operating mode is known as high-safety mode, which provides a warm standby solution.

With asynchronous mirroring there is no such guarantee, because transactions can commit on the principal without having to wait for database mirroring to copy all the transaction’s log records. This configuration can offer higher performance because transactions do not have to wait, and it is often used when the principal and mirror servers are separated by large distances (that is, implying a large network latency and possible lower network bandwidth). Consequently, the operating mode is also known as high-performance mode and provides a warm standby solution.

If a failure occurs on the principal, a mirroring failöver occurs, either manually (in the high-performance and high-safety modes) or automatically (only in the high-availability mode). The mirror database is brought online after all the transaction log records have been replayed (that is, after recovery has completed). The mirror becomes the new principal and the applications can reconnect to it. The amount of downtime required depends on how long it takes for the failure to be detected and how much transaction log needs to be replayed before the mirror database can be brought online.

I created three VM’s and attached an extra disk. As was the case for the two domain controllers, and attach the disK, i removed some of info due to confidentiality, virtual machines for the two front-end servers, the two application servers and the three servers to be used by SQL Server. So let start and install .Net and go to Server Manager —>


select Manage —> Add Roles and Features, i used as sp_sql1








Next —>









Next —>










Next —>











Next —>  Install .NET Framework 3.5 Features











Next —>











Next —>











The installation has completed.


Download and attach media (run from DVD)










SQL Server 2012 SP1 Enterprise Edition.

attach download next —> Run the Setup.exe efter couple of mins this show up











Off course select the New SQL Server stand-alone installation next —>


stand-alone installation









Efter support file installation enter product key


product key









Read & accept the terms and click Next —>


Accept the terms and click Next









Select SQL Server Feature Installation Next —>
My case I did not require SSAS, SSIS or SSRS, but i did select to install the mgm tools. If you need those tools in future you can always come back and install.


Feature Selection select











If all is green click Next even process will be blocked you 🙂


If all is green click Next.









Next —>


enough space for the installation









Enough space for the installation Next —>


I keep the default settings, but if you plan to use this server in a mirror setup – Default setting  I will recommend that you use a domain account. It will make setting up the security during the mirror configuration so much easier. The reason being that the local account on Server 1 does not know anything about the local account on Server 2.


account on Server 2









Just keep default setting values you can always change them later using the SQL Server Configuration Manager. Database Engine Configuration dialog on the Server Configuration tab keep the default value for the Authentication Mode.  Authentication Mode Click Next.


Authentication Mode









Select the Data Directories change the Data root directory to the additional disk we attached, drives C, D, E, F:\











You want to select both the Enable FILESTREAM for Transact-SQL access and the Enable FILESTREAM for file I/O access options.











Next —>


Error Reporting option









If all is green in the Installation Configuration Rules. The installation will begin and it is take a time, all should be green then go next otherwise fixed error then click Next —>


green markers close


Next —>


Note:- Repeat the above process for the other two SQL Servers. Once all are installed we will have the primary, mirror partner and witness servers and we are ready to enable the mirror.

However, before actually do this, we need to install SharePoint. The reason is that the mirror is enabled by backing up and restoring databases, hence we need something “in” SQL so to speak. As I am a SharePoint Admin also, nothing to worry. I need to config in farm site in SP and SQL in order to fully work.

Start a working SharePoint installed in to the primary SQL Server –SP-SQL1 (I given name to sp=SharePoint and SQL is SQL. To ensure that all SQL logins are present on the primary and mirror server, then must ensure that all databases are running with recovery mode set to FULL MODE. RDP into SP-SQL1 (the primary) and open op SQL Server Management Studio.

Note: there are no data connections to the SQL Server you may want to close down the SP site, use simple syntax, here we go!!!

USE master;

First take Full Backup the database

USE master;

TO DISK = ‘D:\BackUp\AdminContent.bak’ –Path can be different i.e. e:\ or f:\

And the log:

USE master;

BACKUP LOG AdminContent
TO DISK = ‘D:\BackUp\AdminContent_log.bak’


Copy the files to the mirror partner (SP-SQL2). Ensure they are placed in the same location, e.g. D:\BackUp in the above example. It is not a requirement, but the syntax of the T-SQL is slightly different if the location is different. Connect to the mirror partner (SP-SQL2) from the open Management Studio or RDP into the server and open SSMS from here.

First we restore the database

USE master;

FROM DISK = ‘D:\BackUp\AdminContent.bak’ –Path can be different i.e. e:\ or f:\

And the log:

USE master;

RESTORE LOG AdminContent
FROM DISK = ‘D:\BackUp\AdminContent_log.bak’
WITH FILE=1, NORECOVERY — with norecovery mode it must be.

SQL Server Always On technology is a good high availability option as compared with other high availability features like Mirroring, Log shipping and clustering.. SQL Server 2012 now has Always On technology, where the database can keep mirror copy/copies of the database that are constantly being refreshed.  In the event of a failure of the Primary, it will seamlessly fail over to another copy. SQL Server 2012 Always On is employs mirroring and clustering together to achieve the high availability

SQL Server 2012 Always On — SQL Server Mirroring

This has significant disadvantages over traditional SQL Server Mirroring.

*   In traditional mirroring we cannot manage groups of databases, so that when a single database    failed they would all failover.
**  There is only one Mirror of the database allowed
*** The Mirror is not useable unless you are prepared to work with Snapshots.

In SQL Server 2012 Always On Technology, SQL have Availability Group option, using which Databases can now be grouped together. When a failure happens, the entire group of database is treated as one. When one fails over they all fail over. This is very powerful when you have an application that uses more than one database in an instance.

SQL Server 2012 Always On — SQL Server Log Shipping.

Log shipping is sort of manual mirroring which allows more than one replica to be kept; perhaps a local one and a remote one.  This is more difficult to setup and failover is not automatic The biggest downfall is we  cannot use the receiving database since it is in a recovery state  and  read-only  as the transaction logs are being applied. In  SQL Server 2012 Always On Technology all the databases can failover automatically.

SQL Server Always On Versus SQL Server Clustering.

With SQL Server clustering, you are dealing with a shared storage system that adds cost and complexity to the equation. AlwaysOn doesn’t need to use shared storage. So here it is more cost effective.

Now ready to enable mirroring in SSMS right click on one of the databases and select Tasks and then Mirrror. Don’t forget to Configure Security in Database Properties. First step in the configuration wizard is to decide whether or not a witness server, here is using with a witness server.  Good Luck 🙂

Let me know if you have any further question and comments is learning point.                    Back to top


Microsoft Certified Solutions Associate (MCSA)

SQL Server vs MongoDB vs MySQL vs Hadoop and HBase NoSQL

SQL Server vs MongoDB vs MySQL vs Hadoop and HBase NoSQL

Microsoft SQL Server is one of the mainstream databases used in most operational systems built using Microsoft technology stack. One of the biggest shortcoming is the inability to support horizontal scaling / sharding. So the next logical choices that are most nearest to SQL Server would be MySQL.

In case you are looking for horizontal scaling / sharding, that would mean that you are gearing up to deal with Big Data. MongoDB is the arguably the first logical step in NoSQL world, in case if someone is considering to experiment with NoSQL to handle BigData. At the stage, one is faced with the requirement to compare all these databases.


Many forms of analysis depends upon data from third-party, and platforms like Windows Azure Marketplace are based on the same principle.

Social Analytics is widely used to forecast the impact on the business and extract insights to counter the same. The interesting question here is, what is the data source that can be used to calculate / derive sentiments of customers related to the respective business ? A majority of this data would come from social / professional / collaboration forums. Examples of such sources are Facebook, YouTube, Twitter, LinkedIn, PInterest, IMDb, Blogs etc. Anyone would agree that the analytics derived from unstructured data created by the public interaction on social media can be expected to be much more close to precision than even any data mining algorithm. But the big question here is, the amount of data – very very very big data. On a daily basis, there are 400 million Tweets, 2.7 billion Facebook Likes, and 2 billion YouTube views. Even these figures might have been outdated today.

Say an organization is influenced by Sharepoint 2013 enhancements related to social media collaboration, and intends to add an ability to derive sentiment analysis in their client offering. Let’s say that as a starting source, Twitter is selected as the source of data, and all the public tweets for a particular product would be analyzed and the results would be stored for future use.

The first challenge is that according to a study, Twitter generates approximately 1 billion tweets in less than 3 days. So how to deal with processing such a huge amount of unstructured data and just consider the kind of infrastructure required to handle this processing. To proceed with the case study, let’s say that we live in the age of cloud and we just signed up on AWS and have beefed up a fat Amazon EMR that uses Hadoop and HBase NoSQL database.

Elasticsearch is also a very powerful value addition to any relational dbms like SQL Server, Oracle, DB2 etc, provided it’s used wisely. Before we look at how to use elasticsearch with SQL Server, we should look at “Why to use elasticsearch with SQL Server”. This question holds the key to the answer.
SQL Server hold data either in relational form or in multi-dimensional form (through SSAS). Full Text Search (FTS) in SQL Server is capable of providing some out-of-box search feature, but when search queries requires exhaustive searching over huge datasets, and add some complexity in the search definition itself, one can evidently see performance impact there. Elasticsearch is primarily a search engine, but loaded with features like Facets and Aggregation framework, it helps solve many data analysis related problems. For example, everyone of us would have visited sites like Amazon.com, Ebay.com, Flipkart.com etc. Whenever we search for a product, it builds all the dynamic categories, ranges and values on the fly. For such features, a product like elasticsearch can be extremely helpful. One such real project example can be read from here.
One of the best explanations on setting up elasticsearch JDBC river with SQL Server, can be read from here.

One point to keep in view is that, if you setup a river and you restart elasticsearch server, the river would execute the query set for the river again. This could result in reloading of the entire data in the index. In case if the IDs are being fetched from the source, all existing records would get updates. But if IDs are autogenerated in elasticsearch, this would result in new records, which would ultimately lead to duplicate data. So use the river cautiously. You can also delete the river once data is loaded into the index, in case its a one time activity for one time data migration.


Microsoft Certified Solutions Associate (MCSA)



High Availability | Disaster Recovery solutions in SQL 2012

High Availability|Disaster Recovery solutions in SQL 2012

HA – DR in SQL 2012 and Windows 2012, let see the landscape of Available Options for SQL HA-DR specific architectures is improving significantly. Below is a summary of the high availability and disaster recovery solutions available for a SQL environment.

Start with, it is important to outline some key terminology.

*       RTO (Recovery Time Objective): The duration of acceptable application downtime, whether from unplanned outage or from scheduled maintenance/upgrades. The primary goal is to restore full service to the point that new transactions can take place.

*       RPO (Recovery Point Objective): The ability to accept potential data loss from an outage. It is the time gap or latency between the last committed data transaction before the failure and the most recent data recovered after the failure. The actual data loss can vary depending upon the workload on the system at the time of the failure, the type of failure, and the type of high availability solution used.

*       RLO ( Recovery Level Objective): This objective defines the granularity with which you must be able to recover data — whether you must be able to recover the whole instance, database or set of databases, or specific tables.

*       High Availability (HA): The principal goal of a high availability solution is to minimize or mitigate the impact of downtime.

*       Disaster Recovery (DR): Disaster recovery efforts address what is done to re-establish availability after an outage. It refers to restoring your systems and data to a previous acceptable state in the event of partial or complete failure of computers due to natural or technical causes.

*       Quorum: Generally speaking, a quorum is the minimum number of members of an assembly, who must be present before the members can conduct business. The requirement for a quorum protects against unrepresentative action in the name of the body by a disproportionately small number of individuals (adapted from Bing Dictionary and Wikipedia).

*   In the context of Windows Clustering, the quorum configuration determines the number of failures the cluster can sustain.

*   In the context of SQL database mirroring or availability groups with automatic failover, quorum makes sure that a database is owned by only one partner or replication at a time.

Single Site SQL Failover Clustered Instance

**     Primarily an HA feature

2.     Uses time tested Windows Clustering technology. Advanced heartbeat, failover, quorum model, and quorum voting options available in the latest versions of Windows.

3.     Protects against Machine failures (ex. CPU/motherboard causes machine to become unresponsive or failed)

4.     Protects against OS failures (ex. blue screens)

5.     Protects against SQL Instance failures (ex. SQL hangs/ AV’s)

6.      Application connects using one virtual server name, which is not tied to a specific machine name. The current owning node is abstracted via the virtual server name.

7.     Works well with most other features  such as log shipping, replication,  and asynchronous availability groups

8.     Manages external dependencies well upon failover. All system and user databases reside on a shared drive – failover simply provides access to the same shared drives to the new owning node. Registry keys are replicated via cluster checkpoints.

9.       Supports rolling upgrades scenarios

10.       Instance-level protection without data redundancy (Instance [-]DATA RLO)

**     Considerations:

a   There is no concept of a secondary database. Does not maintain a redundant copy of the data and so does not protect against  an I/O subsystem failure

b   No special requirements with respect to database recovery models

c   Must ensure nodes are maintained properly (patching levels should match)

d   A major outage like a datacenter power failure, or failure of the network link to the primary data center is not addressed because all nodes are within the same datacenter.

Database Mirroring_01

Automatic Failover_01

1.     Primarily a DR solution. Can be used as HA when used with synchronous/automatic failover options.

2.     Protects against I/O subsystem failure on the primary server or datacenter

3.     Log stream compression available for increased performance

4.     Automatic page repair

5.     Automatic failover is possible, requires a 3rd witness server

6.      Automatic redirection to secondary is possible

7.     Can choose between several mirroring and failover options: synchronous or asynchronous streaming, automatic or manual failover

8.     DB Level protection (database RLO)

9.       Considerations:

*   Database must be using the FULL recovery model

*   Database on secondary is not writeable

*   Database on secondary is not readable (can create database snapshots to work around this, but this can quickly become burdensome)

*   Additional management overhead of third witness server

*   Deprecated in SQL 2012

*   Only 1 secondary allowed

*   Can only failover at database granularity – cannot group a set of databases to failover together

*   If automatic client redirection is required, the client must change the connection string to include the new FAILOVER_PARTNER option. Also, the client connectivity components must be at a version which supports the new connection string – so may not be suitable for legacy applications.

*   Database failover occurs at the user database level, not at the server level. Note that the system databases master, tempdb, and model cannot be mirrored and that only one secondary database is supported with database mirroring in SQL Server 2008. Databases have dependencies on objects that reside in other databases such as the master database. These objects often include logins, jobs, certificates, and server audits but can include other objects. A process (normally involving scripts) needs to be adopted to keep objects that are external to the database synchronized across the servers. Suitable for environments where these external dependencies are not changed very often (which tends to be true in most environments).

*   Some features are not supported, such as cross database transactions.

*   Special configuration considerations when using Replication. Visit here for more info

Log Shipping in 2012


1.     Primarily a DR solution

2.     You Can use compressed backups

3.     Very good way to validate the transaction log backups you are already taking

4.     You Can run in a delayed restore mode to protect secondary from human error on primary site (ex. If TableX was deleted on Primary by mistake, and there is a 2-hour delay, you may have enough time to be able to recover TableX from the Secondary)

5.     Database Level protection – RLO

6.     You Can have multiple secondaries

7.     Very good option when other DR options are not possible or supported by a vendor since it is essentially a constant transaction log backup and restore sequence.

8.     Considerations as fellows:

**   Database must be in Full or Bulk Logged recovery model

**   No automatic failover

**   No automatic redirection

**   Can only failover at database granularity – can’t group a set of databases to failover together

**   Database on secondary not readable during restore

**   Database on secondary not writeable

**   Databases have dependencies on objects that reside in other databases such as the master database. These objects often include logins, jobs, certificates, and server audits but can include other objects. A process normally involving scripts needs to be adopted to keep objects that are external to the database synchronized across the servers. Suitable for environments where these external dependencies are not changed very often which tends to be true in most environments.

Peer 2 Peer T Replica :


1.     Primarily a scalability solution

2.     With Peer-To-Peer replication, we have the capability to keep multiple copies of the data on multiple sites, providing a scale-out, HA (if bandwidth permits) and DR solution. Because data is maintained across the nodes, peer-to-peer replication provides data redundancy, which increases the availability of data.

3.     Database can be in Simple recovery model

4.     Database on secondary is both readable and writeable

5.     No automatic redirection

6.      No concept of a failover since all  “nodes” are always “active”

7.     Can have multiple nodes participate in the topology

8.     Can choose to replicate only a subset of tables within the database (table level or filtered rows from table RLO)

9.       Since all databases are writeable, you can create different indexes optimized for reporting environment, optimal solution if you are running very expensive reporting queries which need custom indexes

10.       Considerations are fellows:

**   Client will have to change connection string to an available node if the node it is connected to becomes unavailable.

**   Replication is mainly intended for reporting offloading. Therefore the architecture builds an interim queue in the distribution database, causing higher latencies than would be observed with the Availability Groups architecture.

**   Does not handle conflicts gracefully, application owner will need to ensure data modified at different sites does not conflict (app should logically partition the updates). P2P replication will simply overwrite data (lost updates) unless configured to fail when a conflict occurs, which requires manual intervention Visit here to see more info.

**   Does not support initialization and re-initialization via a snapshot

***   Replicated tables must have a primary key

Multi-Site Clustering with SAN Replication:

Multi-Site Clustering with SAN Replication

Multisite Clustering with SAN Replication

1.     HA and DR solution using Failover Clustering in combination with SAN replication

2.     Protects against I/O subsystem failure on the primary datacenter

3.     Data disks can be synchronously or asynchronously mirrored

4.     External dependencies handled well as in typical single site failover clusters

5.     Instance-level protection WITH  data redundancy (Instance [+] DATA RLO)

6.     MS SQL 2012 supports multi-subnet environments, eliminating the need for a VLAN

7.     Considerations are fellows:

*   Additional expense of SAN replication technology

*   Additional complexity of managing and configuring SAN  replication technology

*   Quorum model and node vote options should be configured to ensure an outage at the DR site will not cause the cluster to lose quorum.

Multisite Clustering with NO SAN Replication and with NO shared storage:

Multi-Site Clustering with NO SAN Replication1.     SQL 2012 supports multi-subnet environments, eliminating the need for a VLAN

2.     Will become even more relevant:

*      SQL 2012 now supports databases on SMB

*      Windows 2012 clusters support 64 nodes

*      Hyper-V Replicas feature on Windows 2012

*      Hyper-V support for SMB

3.     Used often to support DR within a SQL 2012 Availability Group topology.

MS SQL Availability Groups:

SQL Availability Groups:

SQL Availability Groups

1.     HA (synchronous, local datacenter) and DR (asynchronous, remote datacenter) solution

2.     Protects against I/O subsystem failure on the primary server or datacenter

3.     Log Stream compression available for increased performance

4.     Automatic Page Repair

5.     Automatic redirection to secondary is possible via an AG listener

6.     Automatic failover is possible, does not require a 3rd witness server, windows cluster is used for quorum

7.     Up to 4 secondaries

8.     Quorum is based on all nodes of the underlying Windows Cluster

10.    You Can failover a set of databases as a group (database set RLO)

11.    You Can choose between several mirroring and failover options: synchronous or asynchronous streaming, automatic or manual failover

12.    You Can offload T-Log backups  on secondary

13.    Considerations are fellows:

*   Database must be using the FULL recovery model

*   Database on secondary is not writeable

*   Database on secondary can be set as readable (Readable or READ-INTENT secondary options)

*   Note that the system databases master, tempdb, and model cannot be part of an AG. Databases have dependencies on objects that reside in other databases such as the master database. These objects often include logins, jobs, certificates, and server audits but can include other objects. A process (normally involving scripts) needs to be adopted to keep objects that are external to the database synchronized across the servers. Suitable for environments where these external dependencies are not changed very often (which tends to be true in most environments).

*   Some features are not supported, such as cross database transactions visit here

*   Special configuration considerations when using Replication visit here

*   Quorum model and node vote options should be configured to ensure an outage at the DR site will not cause the cluster to lose quorum.


–        High Availability and Disaster Recovery (OLTP)—a Technical Reference Guide for Designing Mission-Critical OLTP Solutions

–        High Availability Solutions (SQL Server)

–        High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study

–        Microsoft SQL Server AlwaysOn Solutions Guide for High Availability and Disaster Recover

–       Quorum: How a Witness Affects Database Availability (Database Mirroring )

–        Quorum vote configuration check in AlwaysOn Availability Group Wizards (Andy Jing)

–        WSFC Quorum Modes and Voting Configuration (SQL Server)

–        Cross-Database Transactions Not Supported For Database Mirroring or AlwaysOn Availability Groups (SQL Server)

–        Database Mirroring (OLTP)—a Technical Reference Guide for Designing Mission-Critical OLTP Solutions

–        How to create multiple listeners for same availability group (Goden Yao)

–        Peer-to-Peer Transactional Replication

–        Selecting a High Availability Solution

–        Automatic Page Repair (Availability Groups/Database Mirroring)

–        http://blogs.technet.com/b/josebda/archive/tags/smb3/

–        MMS 2013 Demo: Hyper-V over SMB at high throughput with SMB Direct and SMB Multichannel

–        Configure Microsoft SQL Server to Use Scale-Out File Server

–        Support policy for Microsoft SQL Server products that are running in a hardware virtualization environment



Microsoft Certified Solutions Associate (MCSA)

Famous Architectures in IT


If you want to really learn scalability and performance, just take a look at the below mentioned top architectures of the world IT.


1) WhatsApp Architecture


2) Flickr Architecture



3) Amazon Architecture



4) Stack Overflow Architecture


5) Google Architecture


6) YouTube Architecture


7) Pinterest Architecture


8) Twitter Architecture


9) Instagram Architecture


10) Facebook Architecture


11) Akamai Network Architecture



12) TripAdvisor Architecture


13) Prismatic Architecture


14) Salesforce Architecture



15) Mailbox Architecture



16) Cinchcast Architecture


17) Stubhub Architecture


18) ESPN Architecture


19) AOL Architecture


20) Netflix Architecture


21) DataSift Architecture


22) Justin.tv Architecture


23) Playfish Architecture



24) FarmVille Architecture



Experience is the biggest teacher, and no books or coaches can be a better teacher than learning from experience. It’s never too late to learn more 🙂



Microsoft Certified Solutions Associate (MCSA)

%d bloggers like this: