Tuesday, March 27, 2012

get % distribution of invalid data?

We've got the requirement where we need to load data (usually dirty of course) from a flat file, and for every column, if the value is invalid, I'm putting a -1 in the field as it flows through the data stream - no problem. (I'm converting all invalids to -1 so they'll go into my eventual int column)

My question though is - at the end of the data flow, is there any smooth way to capture the % invalid values in each column. An example: If my table is as follows...where the # 1 signifies "clean data" and -1 signifies dirty.

col1 col2 col3 col4

1 1 1 1

1 -1 -1 1

-1 1 -1 1

-1 1 -1 1

I would want my % distribution query/ssis result to return:

col1 col2 col3 col4

.5 .25 .75 0

I do not want to do a select/ssis process for each column. Any way to do this all in one shot?

You could write a script component to track the number of good and bad values, and output them to an asynchronous output. Or execute a SQL statement after the data flow to get the statistics.|||

jwelch wrote:

execute a SQL statement after the data flow to get the statistics.

That's what I would try first....just to keep it simpler

No comments:

Post a Comment