Monday, March 19, 2012

A "not in"-lookup?

Hi,

This may be an easy one, but I can't seem to crack it:
What would be the best way to implement a "not in"-loopkup. I want all the rows in my dataflow, where the key is not found in a specific table, to proceed in the data flow.

I could modify my data source only selecting the rows I'm interested in. However I don't like this solution as it is not very transparent.

I could also just use a Loopup transformation and then just use the Error Output for the rest of the flow. I don't like this solution either as it is a reverse approach.

Is there a third, and better, solution or will I have to go with one of the two above?

Regards,
Sune

Sune Hansen wrote:

Hi,

This may be an easy one, but I can't seem to crack it:
What would be the best way to implement a "not in"-loopkup. I want all the rows in my dataflow, where the key is not found in a specific table, to proceed in the data flow.

I could modify my data source only selecting the rows I'm interested in. However I don't like this solution as it is not very transparent.

I could also just use a Loopup transformation and then just use the Error Output for the rest of the flow. I don't like this solution either as it is a reverse approach.

Is there a third, and better, solution or will I have to go with one of the two above?

Regards,
Sune

Sune,
Using the error output is a legitamate approach and doesn't violate accepted best practice. What is your issue with using it?

-Jamie|||Wow - that was quick Smile Thanks.

My only issue was that I found it a little odd to use Error Output - there is no error.

Thanks again,
- Sune|||

Sune Hansen wrote:

Wow - that was quick Smile Thanks.

My only issue was that I found it a little odd to use Error Output - there is no error.

Thanks again,
- Sune

Yes, I'll grant you that. But don't worry - this is considered best practice.

-Jamie|||You could also use a MergeJoin and then use a conditional split to filter out any rows where the value of the column from the "lookup" table is NULL to a not in output (on the split) and then use that output for the rest of the flow. I don't know that this is any more performant (or less) than using the lookup's error output but it does keep you from using the "error" output when as you say there is no error. Just another alternative.

Thanks,
Matt|||Yeah... it is quick.

But, use it only if you aredealing with few rows.
When I was working with rows in the order of few hundred thousands to couple of millions, using error output was really slow. Possible reason is that new buffers are assigned for holding error output and then data is copied over from input buffers to the error output buffers. So, a work around to the peformance problem was to break it into 2 steps. In first step, perform lookup as earlier but this time select the ID column for that table. Now take the regular output and feed it into a conditional split transform. Put an expression like ISNULL(<ID Column name>). Output of conditional split will give you the desired not-in lookup.

HTH,
Nitesh|||

Nitesh Ambastha wrote:

Yeah... it is quick.

But, use it only if you aredealing with few rows.
When I was working with rows in the order of few hundred thousands to couple of millions, using error output was really slow. Possible reason is that new buffers are assigned for holding error output and then data is copied over from input buffers to the error output buffers. So, a work around to the peformance problem was to break it into 2 steps. In first step, perform lookup as earlier but this time select the ID column for that table. Now take the regular output and feed it into a conditional split transform. Put an expression like ISNULL(<ID Column name>). Output of conditional split will give you the desired not-in lookup.

HTH,
Nitesh

And that worked quicker? By what order of magnitude?

Wow, that's really valuable stuff to know. Thanks Nitesh.

-Jamie|||

Jamie Thomson wrote:


And that worked quicker? By what order of magnitude?

Wow, that's really valuable stuff to know. Thanks Nitesh.

-Jamie

Jamie,

I did not do any measures/benchmarks (yet).

But, in my packages I have lots of data flow tasks. And within each data flow tasks, I have one or more Ole Db Destinations. Right before each of these destinations, I need to do this not-in lookup. With that in mind, I feel I saw approximately 1.5 to 2 times better timings for the entire package... specially when I was dealing with millions of rows. This was on a 4 processor, 4GB RAM Intel box running Win 2k3 and June CTP.

- Nitesh

No comments:

Post a Comment