SQLDEBACLE DOT COM SQL, Business Intelligence, NoSQL

1Nov/110

Data Mining 101

I am taking a class in data mining and learned a few things I wanted to share. As you might know, data mining is a process of finding previously unknown, non-trivial patterns from high volumes of data. It is highly used in industry where actionable results are required to make business decision. I will start with two different process of data mining.

Unsupervised Data Mining: There is no target variable. One of the examples is Association Analysis (AA), which is also called Market Basket Analysis. Association analysis basically tells you what things occur together.

Supervised Data Mining: In this technique, target variable is present. For instance: neural network and decision tree. Decision tree has a target variable set, and it helps us understand the path is takes to reach to the leaf from the root note.

17May/110

Auto memory management in Cluster

I was musing into a problem. We currently have our production server in a two node cluster that has 128GB of RAM each. 1st node contains 4 instances of SQL Server running, and the 2nd node contains 1 instance of SQL Server running. These are active/active clusters, meaning that in the event of failure, the other node takes on the processing, which ultimately degrades the performance. In the SQL Server running into 2nd node, we use some 84GB of memory and leave the remaining for failure events and the OS. In the 1st node though, we leave more than 90 GB for Node2 to be able to failover and not cause memory problems. I totally agree with the concept of high availability, but, for me, leaving so much amount of memory is just waste of resources.

We are adding additional node now, and the configuration will slightly change to active/active/passive clustering. I gathered my thoughts around the idea of writing a stored proc to manage the memory and do some math and decided to poke around with auto-memory management and startup stored procedure to achieve what I wanted to do so that when a fail over happens, and the SQL Server shifts to another node, the startup stored proc checks to see which node the server is on, and sets the right amount of memory.

Please do not use it in production without thoroughly verifying that this is not going to cause any problems. Startup stored procs has to be created on the master database (master is the first one which recovers after a restart), and cannot contain any input or output variables. There are other gotchas too.

CREATE proc dbo.dba_startup
AS
SET NOCOUNT ON;
DECLARE @hostName VARCHAR(100)
DECLARE @tbl TABLE
(
      hostName VARCHAR(100)
)
 
INSERT INTO @tbl
EXEC xp_cmdshell 'hostname'
 
DELETE FROM @tbl
WHERE hostName IS NULL
 
 
 
 
SELECT @hostname = hostname FROM @tbl
 
--SELECT @hostName = HOST_NAME()
 
IF @hostName = 'Node1'
BEGIN
 
            EXEC sys.sp_configure 'show advanced options', 1;
            RECONFIGURE
            EXEC sys.sp_configure 'max server memory', 20000;
            RECONFIGURE
            EXEC sys.sp_configure 'min server memory', 20000;
            RECONFIGURE
            EXEC sys.sp_configure 'show advanced options', 0;
            RECONFIGURE
 
 
END
IF @hostName = 'Node2'
BEGIN
      EXEC sys.sp_configure 'show advanced options', 1;
      RECONFIGURE
      EXEC sys.sp_configure 'max server memory', 2147483647;
      RECONFIGURE
      EXEC sys.sp_configure 'min server memory', 0;
      RECONFIGURE
      EXEC sys.sp_configure 'show advanced options', 0;
      RECONFIGURE
END
 
 
SET NOCOUNT OFF;
 
 
--Configuring this as startup stored proc
EXEC dbo.sp_procoption N'dba_startup', 'startup', 'on'

17May/110

Tracking Database Growth and Trends over Time

I was looking for a solution where database sizes and growth (and autogrowth) can be tracked over a time period so that I can estimate the size requirements, see when the growth happened, and trace it down to an event, job, or regular operations. Autogrowth can be tracked via querying default traces, but default traces are valid only for a certain period of time, and does not track the database sizes, free spaces, etc, for a long period of time. In order to accommodate all of that, I decided to look for around to see if anyone has a custom solution or at least partial solution of what I wanted to do. Found a post at MSSQLTips that had similar implementation. My changes were to create a table and collect information there, run the collector as a stored proc, and write a query to check the autogrowth between 2 consecutive days.

Found the author: Tim Ford (b, t). Original post was at: http://www.mssqltips.com/tip.asp?tip=1426

I made a few changes to the original script, and put together a little bit more detailed mechanism, which collects information, and allows you to query it in different ways to get the result you want.

USE [DATABASE]
GO
 
CREATE TABLE [dbo].[DatabaseSpace](
	[DatabaseSpaceID] [INT] IDENTITY(1,1) NOT NULL,
	[ServerName] [VARCHAR](100) NULL,
	[DatabaseName] [VARCHAR](100) NULL,
	[FileSizeMB] [INT] NULL,
	[LogicalFileName] [sysname] NOT NULL,
	[PhysicalFileName] [nvarchar](520) NULL,
	[STATUS] [sysname] NOT NULL,
	[Updateability] [sysname] NOT NULL,
	[RecoveryMode] [sysname] NOT NULL,
	[FreeSpaceMB] [INT] NULL,
	[FreeSpacePct] [VARCHAR](7) NULL,
	[FreeSpacePages] [INT] NULL,
	[PollDate] [datetime] NULL,
 CONSTRAINT [PK_DatabaseSpace] PRIMARY KEY CLUSTERED 
(
	[DatabaseSpaceID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

--Stored proc creation. This stored prod runs everyday via SQL Server Agent Job
USE [DATABASE]
GO
/****** Object:  StoredProcedure [dbo].[dba_DatabaseSpace]    Script Date: 05/16/2011 23:18:24 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER proc [dbo].[dba_DatabaseSpace]
AS
SET NOCOUNT ON;
 
DECLARE @command VARCHAR(5000)  
 
SELECT @command = 'Use [' + '?' + '] SELECT  
@@servername as ServerName,  
' + '''' + '?' + '''' + ' AS DatabaseName,  
CAST(sysfiles.size/128.0 AS int) AS FileSize,  
sysfiles.name AS LogicalFileName, sysfiles.filename AS PhysicalFileName,  
CONVERT(sysname,DatabasePropertyEx(''?'',''Status'')) AS Status,  
CONVERT(sysname,DatabasePropertyEx(''?'',''Updateability'')) AS Updateability,  
CONVERT(sysname,DatabasePropertyEx(''?'',''Recovery'')) AS RecoveryMode,  
CAST(sysfiles.size/128.0 - CAST(FILEPROPERTY(sysfiles.name, ' + '''' +  
       'SpaceUsed' + '''' + ' ) AS int)/128.0 AS int) AS FreeSpaceMB,  
CAST(100 * (CAST (((sysfiles.size/128.0 -CAST(FILEPROPERTY(sysfiles.name,  
' + '''' + 'SpaceUsed' + '''' + ' ) AS int)/128.0)/(sysfiles.size/128.0))  
AS decimal(4,2))) AS varchar(8)) + ' + '''' + '%' + '''' + ' AS FreeSpacePct,  
GETDATE() as PollDate FROM dbo.sysfiles'  
INSERT INTO dbo.DatabaseSpace  
   (ServerName,  
   DatabaseName,  
   FileSizeMB,  
   LogicalFileName,  
   PhysicalFileName,  
   STATUS,  
   Updateability,  
   RecoveryMode,  
   FreeSpaceMB,  
   FreeSpacePct,  
   PollDate)  
EXEC sp_MSForEachDB @command  
SET NOCOUNT OFF;

--Sample Query to get the database growth between two consecutive days.
--Author: Subhash Pant
SELECT  
        a.DatabaseName ,
        a.FileSizeMB AS previousSize,
        a.LogicalFileName ,
        a.PhysicalFileName ,
        a.FreeSpaceMB ,
        a.FreeSpacePct ,
        a.PollDate,
        c.FileSizeMB AS newSize,
        c.PollDate,
        isGrowth = CASE WHEN a.FileSizeMB < c.FileSizeMB THEN 1 ELSE 0 END
FROM dbo.DatabaseSpace a 
CROSS APPLY
(
SELECT TOP 1 *
FROM dbo.DatabaseSpace b
WHERE a.PollDate < b.PollDate
AND a.DatabaseName = b.DatabaseName
AND a.LogicalFileName = b.LogicalFileName
ORDER BY b.PollDate
) c

Found the author: Tim Ford (b, t). Original post was at: http://www.mssqltips.com/tip.asp?tip=1426

11May/110

SSIS Lookup Transformation

I encountered a weird problem with SSIS lookup transformation a few weeks back and today when working on a DW project. Lookup transformation, when run in full cache mode, caches the results during the pre-execution of the data flow into the memory. I prefer to use this method because Merge Join needs a sorted data set. The cost of sorting, whether we do it on SQL or SSIS, is expensive.

So, I was trying to join on smart key we were using on one of our legacy systems. This transformation did a looking for AccountID from the payment table, using the smart key (which is AlternateAccountID), and put the values in the payment table (child table). I found more than 1000 records for which I could not get any match. Before mapping this to the unknown dimension (-1), I decided to poke around and see why such a large result set would be present.

Ran a query against this orphaned data and the base table to see that many results are matching. It immediately hit my mind that SSIS lookup transformation, when run in Full Cache Mode is case sensitive, unlike SQL Statements. After looking at the speed of Partial Cache and No Cache Mode, I decided to change the smart keys to UPPER(smartkeys) in T-SQL. Presto! This seemed to bring down the problem to 200 orphaned records.

To sum it all up, SSIS Lookup Transformation when run in Full Cache Mode is case sensitive.

3May/112

Scripting out composite foreign keys

For a while now, we have been maintaining our production database which feeds thousand of websites by utilizing home grown log shipping process + sync process. Our log shipping process occurs every 15 minutes, and sync process occurs every week. This sync process ensures that all the websites has the most current information, and fix any errors from the past, and expire old products. Before we sync the old database, we truncate/delete all the tables. We use a combination of:

EXEC sp_MSForEachTable 'ALTER TABLE ? NOCHECK CONSTRAINT ALL'
GO
 
EXEC sp_MSForEachTable --'DELETE TABLE ?'
'BEGIN TRY
TRUNCATE TABLE ?
END TRY
BEGIN CATCH
DELETE FROM ?
END CATCH;' 
GO
 
-- enable referential integrity again
EXEC sp_MSForEachTable 'ALTER TABLE ? CHECK CONSTRAINT ALL'
GO

Although this disables all the constraint and allows you to delete from tables where foreign key is present, it still does not allow you to truncate the tables (parent tables) until the child table data is deleted. I decided to tweak this process a bit so that we could use the unlogged truncate operation in order to save time and decrease the resources needed.

I found a lot of scripts to script out the foreign keys, but I could not find any which worked on a composite key; because the scripts would provide column names for the same relationships in different rows. So, decided to write a script which will do this. Please freely use this, modify it to your needs, and let me know if there are any bugs so that I can update them.

 
--Build Common Table Expression which can be used multiple times
;WITH table_cte(ForeignKeyName, parentTableName, primaryKeyList, ChildTableName, ForeignKeyList)
AS
(
SELECT DISTINCT
        a.name AS ForeignKeyName 
        , OBJECT_NAME(a.referenced_object_id) AS parentTableName
        , d.name AS PrimaryKeyList 
        , OBJECT_NAME(a.parent_object_id) AS ChildTableName
        , c.name AS ForeignKeyList
 
FROM    sys.foreign_keys a
        JOIN sys.foreign_key_columns b ON a.object_id = b.constraint_object_id
        JOIN sys.COLUMNS c ON b.parent_column_id = c.column_id
                              AND a.parent_object_id = c.object_id
        JOIN sys.COLUMNS d ON b.referenced_column_id = d.column_id
                              AND a.referenced_object_id = d.object_id
 
)
SELECT DISTINCT ForeignKeyName ,
        parentTableName ,
        --primaryKeyList ,
        ChildTableName ,
        --ForeignKeyList,
         PrimaryKeys = LEFT(col.list, LEN(col.list) -1)
        , ForeignKeys = LEFT(ab.list, LEN(ab.list) -1)
INTO #temp
FROM table_cte t
CROSS APPLY
(
       SELECT QuoteNAME(CONVERT(VARCHAR(100), d.Name)) + ', ' AS [text()]
       FROM   sys.foreign_keys a
       JOIN sys.foreign_key_columns b
                  ON a.object_id=b.constraint_object_id
       JOIN sys.COLUMNS c
                  ON b.parent_column_id = c.column_id
                AND a.parent_object_id=c.object_id
       JOIN sys.COLUMNS d
                  ON b.referenced_column_id = d.column_id
               AND a.referenced_object_id = d.object_id
          WHERE t.ForeignKeyName = a.name
          ORDER BY b.constraint_object_id
           FOR XML PATH('')
) AS col(list)
CROSS APPLY
(
 
       SELECT QuoteNAME(CONVERT(VARCHAR(100), c.Name)) + ', ' AS [text()]
       FROM   sys.foreign_keys a
       JOIN sys.foreign_key_columns b
                  ON a.object_id=b.constraint_object_id
       JOIN sys.COLUMNS c
                  ON b.parent_column_id = c.column_id
                AND a.parent_object_id=c.object_id
       JOIN sys.COLUMNS d
                  ON b.referenced_column_id = d.column_id
               AND a.referenced_object_id = d.object_id
          WHERE t.ForeignKeyName = a.name
           ORDER BY b.constraint_object_id
           FOR XML PATH('')
) AS ab(list)
 
SELECT 'ALTER TABLE ' + ChildTableName + ' WITH CHECK ADD CONSTRAINT ' + ForeignKeyName + ' FOREIGN KEY (' + ForeignKeys + ') REFERENCES ' + ParentTableName + ' (' + PrimaryKeys + ')'
,ForeignKeyName ,
        parentTableName ,
        ChildTableName ,
        PrimaryKeys ,
        ForeignKeys
FROM #temp
 
 
--Work completed. Drop table.
DROP TABLE #temp

You can find the drop constraint scripts all throughout. But I could not find one which allows you to script out composite foreign keys.

Please be sure to make a copy of all the constraints above before you go and run the drop statements from below:

SELECT 'ALTER TABLE ' + TABLE_NAME + ' DROP CONSTRAINT ' + CONSTRAINT_NAME
FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS
WHERE CONSTRAINT_TYPE = 'Foreign Key'

Happy T-SQLing!!

Filed under: SQL Server, T-SQL 2 Comments
19Feb/110

Handling Duplicates through SSIS

At my work, we process a lot of data from raw data sources. Data feeds, IIS Log files, etc, etc are few ones. To get accurate information out of these requires a lot of manual labor. Vendors change their names, columns, data types, etc, etc all the time causing packages to break. Packages throwing error is often fine, because it actually threw an error message and stopped. I can come in the morning, troubleshoot, and ensure that right data gets into the database. However, if the raw data has duplicates, chances are that they wont break the package or throw any kind of error message and we wont be able to know about it. Finding this out at a later time will become more pain in the rear as you have to go and find out where the duplicate data is, and ensure that you only delete that. Almost have to re-invent the wheel....

Today, I had to design a package to handle log files from one of our vendors. We fetch the data using iMacro because they don't have any API, or any mechanism to send data to us. Basically what happens is, the iMacro runs, logs in to their website, choose the date range and extract the data in the form of CSV file, and saves it to one of our network location. Where it gets tricky is that it gives us the ability to select the date range. If the macro runs at 11 p.m. and gets all data, we will miss an hour worth of logs for the given day and so on. So I decided to use the Checksum transformation (Available from http://www.konesans.com/products.aspx). I designed the package which looks like this:

SSIS Duplicates

So when I created the table, I created with IGNORE Duplicate ON on the index. I pass certain rows which has chances of changing, and pass it through the checksum transformation which generates a checksum (int,bigint). I use this checksum as my primary key. It ignores all the duplicate records. This ensures that there is no duplicate data inserted into the table.

Hope this helps. Happy ETLing..

Filed under: SSIS No Comments
30Jan/110

Managing reading multiple blogs

I usually find things here and there, and bookmark them. For this, I use Google Chrome because it syncs my bookmarks etc based on my Google sign in, and no matter which computer I am logged in to, I find my best resources etc, easily. I usually read about Performance Tuning (SQL) from Kimberly Tripp (of SQLSkills), and about the internals from Brent Ozar and Paul Randall. Also, I read from many BI bloggers including Brian Knight and Jamie Thomson (for SSIS), and tons of other blogs from Sqlblog team. Over the time, all my bookmarks have become similar to page splits, fragmentations, and what not.

Anyways, to make it easier, I started using Yahoo Pipes. I had seen this webapp long time back, when it was in a very primitive stage. Basically, its very similar to SSIS, in terms of getting the blogs (RSS feeds), and even URLs, unioning them, and different way of organizing. I've made one pipe (I cloned the one made by Paul Randall of SQLSkills.com) and published this which I can access from my phone, computer, google reader and what not. If you want to be more organized, check out Yahoo Pipes and a few pipes I made to start with.

feed://pipes.yahoo.com/pipes/pipe.run?Age=20+days+ago&_id=0589b23e29276217ff817943202f6acc&_render=rss

I just wish it was would be easier links, and I could replace that guid (and have a better permalink).

Filed under: General No Comments
21Jan/113

Why I want to attend SQLSkill’s 5-day Internals and Performance class in Dallas

The reason for this blog post is outline:

1) How SQL Server community has really inspired me and helped me with my career.

2) My reasons to attend 5-day Internals and Performance class offered by SQLSkills.com

3) Win a free seat to the class

Luck never hits me by chance. I do not recall winning anything in my life, be it on raffle tickets in school or powerball tickets (I would probably be not writing this post if I had won the big powerball). However, I do believe in one thing, i.e. hard work. When I graduated from college back in 2008, jobs were hard to find. Economy was down, and nobody was hiring someone who didn't have the real work experience (internships, consulting etc didn't count much), connections plus needed H1B sponsorship. After much deliberation, I started working in a job which I was not very passionate about. However, I worked my way up to become one of the best assets in my department. After working there for more than a year, I finally made it to the DBA team. At first, I was more inclined to become a BI developer focusing on Microsoft or yada yada stack. I spent almost a year doing my own research, practice, hard work, etc, etc, before I could move to the team. I also read a couple of those, 'how to start as a Junior DBA?' post written by SQL Server gurus like Brent Ozar. Got a lot of ideas from it, however, shadowing DBA during their operations is not a 'very welcome' move at where I work, because it is a small shop, and people are busy doing their own stuff. So, I decided that I could do that myself. I want to give a lot of credit to Brent Ozar (blog, twitter) whose blogs are just amazing. Not only does it have a lot of resources for intermediate level DBA's like me, but also has a lot of inspirational posts. I am still upset that I did not win that iPad (jk). See, even a rand function in MySQL was not my friend :P

Anyways, back to the Internals and Performance class. Last couple of months, I spent ample time watching and learning some of the great stuff from MCM videos created by SQLSkills.com team. Those were followed by reading almost all the resources available in the blog posts, mostly by SQLSkills.com and a few others, and trying to apply newly learned things to solve my day to day problems. Some of the things came immediately beneficial for troubleshooting issues and I feel more and more confident. I've always wanted to attend classes offered by SQLSkills, Pragmatic Works, SQLCruise, etc, etc. But, at the same time, these are some real investments that neither my company would make, nor I am in a position to make. That is why getting all the knowledge and skills I can during my work hours or free time, reading tons of blogs, SQLServerPedia, Sqlskills.com articles, watching videos, or at least having these audios play while I work has helped me quite a bit. After coming to the DBA team, most of my work involves DBA based troubleshooting, assisting production DBA to administer production servers, moving data centers, and a report writing, BI, database architecting and little things here and there. That has brought a lot of confidence in me, and I can only imagine how much I would benefit from attending the ‘Internals and Performance’ class in Dallas to add value to my skills, be more valuable at my work, and start helping others in the community.

If you are interested in attending this event, and cannot afford it, here is a link:

http://www.sqlskills.com/BLOGS/PAUL/post/Master_Immersion_Event_Competition.aspx which has all the TOCs for entry into earning the class for FoC (free of cost).

If your company can pay, and you don't want to write a blog post, please register via here: http://www.sqlskills.com/Master1-Dallas-20110221.asp :)

I would highly encourage anyone to attend, as these are the top notch people in the SQL Server community. If you are planning for the MCM certification, this is the way to go!!!

Filed under: SQL Server 3 Comments
20Jan/110

Non-Trivial Recommendation Systems

Collaborative Filtering, recommendation systems, intelligent systems, etc are very fascinating to me. During my free time, I read about these systems and look for ways how I can utilize this to build great products. I have been following recommendation systems build by Amazon and other companies for a long time. Recently, I had to work on one of these projects where I had to generate item-to-item recommendation system based off of statistics collected from IIS logs. As this is still a proprietary product, I cannot discuss codes or process used to generate this. But the application was heavily influenced by Slope One algorithm and implemented via T-SQL and SSIS. If you are looking forward to build something of this nature, http://en.wikipedia.org/wiki/Slope_One and the corresponding links might be a good place to start with.

14Jan/110

Laptop – Quad Core vs Dual Cores

I have never been a big fan of gaming + laptops. In fact, I don't own a desktop at all, and I don't do any gaming. I work purely on SQL Server + some Windows Administration. Recently, I had weird applications being pushed to my laptop due to some workplace domain policy. Laptop was slower, and heard from others that they had to re-stage their laptops a couple of times because issues came back again and again. Due to the slowness, I decided to get another laptop (this one is a T400 with T9400 Core 2 Duo). I was debating with myself whether to get a Quad Core I7 or a Dual Core I7. Initially, the Quad Core sounded fantastic. I wanted to find the differences between the two in terms of speed, and how it affects everyday computing. Found something like this:


i7 720QM is a quad, but the clock speed with Turbo Mode goes like following.

1.60GHz Base
1.73GHz Quad Core Turbo
2.40GHz Dual Core Turbo
2.80GHz Single Core Turbo

i7 620M is like this:

2.66GHz Base
3.06GHz Dual Core Turbo
3.33GHz Single Core Turbo

See the clock speed differences? Even in heavily multi-threaded apps it'll be a wash because of the massive clock speed gap.

620M will be faster in most cases and competitive in the worst, and with better battery life and lower heat.

This, I decided to go and get the fancy Lenovo x201 with a I7 + and 620M.

Hope this helps anyone looking for a laptop purchase.

Filed under: General No Comments
4myphone alpensky cleaner4all cleanmachines