Friday, March 23, 2012

Generating Surrogate Keys

I've been playing with a couple of ways of doing this, based on the ideas in
the article "The Cost of GUID's as Primary Keys" :
http://www.informit.com/articles/ar...2&seqNum=1&rl=1 and a
couple of other online sources
and I've put together the following:
create view builtins as
select getdate() the_date, newid() the_id
go
create function comb_id() returns uniqueidentifier
as
begin
declare @.out uniqueidentifier
select @.out = CAST(CAST(builtins.the_id AS BINARY(10))
+ CAST(builtins.the_date AS BINARY(6)) AS UNIQUEIDENTIFIER)
from builtins
return @.out
end
GO
create table x(a int, b uniqueidentifier DEFAULT ([dbo].[comb_id]()))
GO
Now, inserts to x result in a new "comb guid" being generated, which I can
trap in an insert trigger from the inserted table. I can also, if I wish,
generate ready-made IDs in the application, by reimplementing the same
algorithm.
Are there any "gotchas" of which I should be aware when using this approach?
Enviroment is SQL 2000, no immediate plans to upgrade.
Any observations welcome,
GCWhy not just use an auto-incrementing identity column?
"GC" <reply_to_group@.please> wrote in message
news:O8RKZr9XFHA.1044@.TK2MSFTNGP10.phx.gbl...
> I've been playing with a couple of ways of doing this, based on the ideas
in
> the article "The Cost of GUID's as Primary Keys" :
> http://www.informit.com/articles/ar...2&seqNum=1&rl=1 and a
> couple of other online sources
> and I've put together the following:
> create view builtins as
> select getdate() the_date, newid() the_id
> go
> create function comb_id() returns uniqueidentifier
> as
> begin
> declare @.out uniqueidentifier
> select @.out = CAST(CAST(builtins.the_id AS BINARY(10))
> + CAST(builtins.the_date AS BINARY(6)) AS UNIQUEIDENTIFIER)
> from builtins
> return @.out
> end
> GO
> create table x(a int, b uniqueidentifier DEFAULT ([dbo].[comb_id]()))
> GO
> Now, inserts to x result in a new "comb guid" being generated, which I can
> trap in an insert trigger from the inserted table. I can also, if I wish,
> generate ready-made IDs in the application, by reimplementing the same
> algorithm.
> Are there any "gotchas" of which I should be aware when using this
approach?
> Enviroment is SQL 2000, no immediate plans to upgrade.
> Any observations welcome,
> GC
>|||Why not create use a generic function to which you pass the current DateTime
and
a new Id. Like so:
Create Function dbo.CombID(@.SeedId UniqueIdentifier, @.CurDate DateTime)
As
Begin
Return Cast(Cast(@.SeedId As Binary(10))
+ Cast(@.CurDate As Binary(6)) As UniqueIdentifier)
End
Thus, you would do something like so in a table:
Create Table TableName
(
AColumn UniqueIdentifier Default (dbo.CombId(NewId(), Current_Timestamp))
)
Beyond the general gotchas of using a Guid, the only gotcha to this approach
would be that you are sacrificing a little "uniqueness" for better indexing
performance. That means that the odds are higher of a collision. Realistical
ly,
the odds are incredibly low even using the full eight bytes of the current d
ate
time.
Thomas
"GC" <reply_to_group@.please> wrote in message
news:O8RKZr9XFHA.1044@.TK2MSFTNGP10.phx.gbl...
> I've been playing with a couple of ways of doing this, based on the ideas
in
> the article "The Cost of GUID's as Primary Keys" :
> http://www.informit.com/articles/ar...2&seqNum=1&rl=1 and a
> couple of other online sources
> and I've put together the following:
> create view builtins as
> select getdate() the_date, newid() the_id
> go
> create function comb_id() returns uniqueidentifier
> as
> begin
> declare @.out uniqueidentifier
> select @.out = CAST(CAST(builtins.the_id AS BINARY(10))
> + CAST(builtins.the_date AS BINARY(6)) AS UNIQUEIDENTIFIER)
> from builtins
> return @.out
> end
> GO
> create table x(a int, b uniqueidentifier DEFAULT ([dbo].[comb_id]()))
> GO
> Now, inserts to x result in a new "comb guid" being generated, which I can
> trap in an insert trigger from the inserted table. I can also, if I wish,
> generate ready-made IDs in the application, by reimplementing the same
> algorithm.
> Are there any "gotchas" of which I should be aware when using this approac
h?
> Enviroment is SQL 2000, no immediate plans to upgrade.
> Any observations welcome,
> GC
>|||A couple of reasons, really. I figured a GUID-type is more "meaningless",
and less likely to be the subject of hard-coded assumptions in the future,
and I also quite liked the idea of being able to generate the GUID in the
application if I wanted to.
"JT" <someone@.microsoft.com> wrote in message
news:%23Ubds49XFHA.3712@.TK2MSFTNGP09.phx.gbl...
> Why not just use an auto-incrementing identity column?
> "GC" <reply_to_group@.please> wrote in message
> news:O8RKZr9XFHA.1044@.TK2MSFTNGP10.phx.gbl...
> in
> approach?
>|||Thomas Coleman wrote:
> Why not create use a generic function to which you pass the current
> DateTime and a new Id. Like so:
Just because, for no particularly good reason, I felt the function should be
niladic.

> Beyond the general gotchas of using a Guid, the only gotcha to this
> approach would be that you are sacrificing a little "uniqueness" for
> better indexing performance. That means that the odds are higher of a
> collision. Realistically, the odds are incredibly low even using the
> full eight bytes of the current date time.
I guess I could test for this in @.@.ERROR and try again if there's been a
collision.
Thanks for your responses,
GC|||>> Why not create use a generic function to which you pass the current
> Just because, for no particularly good reason, I felt the function should
be
> niladic.
Understand. But wouldn't every call need to populate your little temp table
with
a guid and a datetime? Otherwise, you'll get duplicate results.

> I guess I could test for this in @.@.ERROR and try again if there's been a
> collision.
If the column is the primary key, of course this will not be necessary. If t
he
column in question is not the primary key, then you can simply create a uniq
ue
index on that column to ensure uniqueness.
Thomas|||Thomas Coleman wrote:
> Understand. But wouldn't every call need to populate your little temp
> table with a guid and a datetime? Otherwise, you'll get duplicate
> results.
It was a view, so newid() and getdate() would be called each time it was
selected from.

>
> If the column is the primary key, of course this will not be
> necessary. If the column in question is not the primary key, then you
> can simply create a unique index on that column to ensure uniqueness.
>
I was thinking about what to do in the (unlikely) event of the algorithm
giving rise to a collision.
If @.@.ERROR is 2627 (PK violation), then try again, on the basis that you'd
have to be seriously
unlucky to get two collisions in a row.

>
> Thomas|||An Integer is only 4 bytes long. GUIDs are 36 bytes long, and even a self
generated GUID would be at least twice as long as an integer. Not very
efficient for storage or performance reasons. Why is it important to
generate the IDs by the application, is this some sort of offline
application that periodically merges data to the database?
"GC" <reply_to_group@.please> wrote in message
news:ujmllH%23XFHA.2684@.TK2MSFTNGP09.phx.gbl...
> A couple of reasons, really. I figured a GUID-type is more "meaningless",
> and less likely to be the subject of hard-coded assumptions in the future,
> and I also quite liked the idea of being able to generate the GUID in the
> application if I wanted to.
> "JT" <someone@.microsoft.com> wrote in message
> news:%23Ubds49XFHA.3712@.TK2MSFTNGP09.phx.gbl...
ideas
a
wish,
>|||First of all this is not a surrogate key accordignt o Dr. Codd's
definition. It is an artifacial key. And it happens to be long,
unverifiable and a XXXXX to validate.
Here is an implementation of the additive congruential method of
generating values in pseudo-random order and is due to Roy Hann of
Rational Commerce Limited, a CA-Ingres consulting firm. It is based on
a shift-register and an XOR-gate, and it has its origins in
cryptography. While there are other ways to do this, this code is nice
because:
1) The algorithm can be written in C or another low level language for
speed. But math is fairly simple even in base ten.
2) The algorithm tends to generate successive values that are (usually)
"far apart", which is handy for improving the performance of tree
indexes. You will tend to put data on separate physical data pages in
storage.
3) The algorithm does not cycle until it has generated every possible
value, so we don't have to worry about duplicates. Just count how many
calls have been made to the generator.
4) The algorithm produces uniformly distributed values, which is a nice
mathematical property to have. It also does not include zero.
Generalizing the algorithm to arbitrary binary word sizes, and
therefore longer number sequences, is not as easy as you might think.
Finding the "tap" positions where bits are extracted for feedback
varies according to the word-size in an extremely non-obvious way.
Choosing incorrect tap positions results in an incomplete and usually
very short cycle, which is unusable. If you want the details and tap
positions for words of one to 100 bits, see E. J. Watson, "Primitive
Polynomials (Mod 2)", Mathematics of Computation, v.16, 1962,
p.368-369.
We need to tap bits 0 and 3 to construct the 31-bit random-order
generated value Generator (which is the one most people would want to
use in practice):
UPDATE Generator31
SET keyval =
keyval/2 + MOD(MOD(keyval, 2) + MOD(keyval/8, 2), 2)*2^30;
Or if you prefer, the algorithm in C:
int Generator31 ()
{static int n = 1;
n = n >> 1 | ((n^n >> 3) & 1) << 30;
return n;
}
Do not forget to add a check digit.|||Hello, JT
A GUID is stored on 16 bytes, not 36 bytes.
Sometimes it's better to generate the IDs by the application to avoid
the round trips (necessary to execute "SELECT SCOPE_IDENTITY()").
Sometimes a "less than perfect" environment (like Microsoft Access
ADP-s) is used to develop the application, that has "SELECT @.@.IDENTITY"
hardcoded (instead of "SELECT SCOPE_IDENTITY()"). Sometimes GUID-s are
needed for replication. Sometimes...
Regarding the performance reasons, did you read the article quoted by
GC in the first post? The author suggests that there is only a 10%
performance impact due to the 4 times increase of size (the bigger
impact in that test scenario is due to the randomness of GUIDs, and
this impact can be minimized with the converting of current date trick,
according to the article).
Razvan

No comments:

Post a Comment