Hi, I need some assistance with a query. To be honest, I'm not even
sure it can be done.
I'll try to keep the information limited to only what's relevant to
what I have and what I am trying to achieve with this query.
I have a table that contains around 100,000 records. For the sake of
this discussion, assume just two columns:
ID Data
1 000000000000
2 010000000000
3 011111111111
4 101100011101
5 110100011011
6 111100000000
7 111100000111
8 111100010111
9 111100011011
10 111111111111
The Data column contains only 1's and 0's.
The Data column is a text column, not numeric.
The Data column is actually 255 chars long. (I limited it above to 12
for this example only)
Duplicates on the Data column are allowed and do exist.
With 100,000 records, you would note that in the above example record
10 would actually be record 100,000.
My aim is to somehow sort the data (by creating a third column) so that
the records are in order of "string distance" or similar. In other
words, so that similar strings are located next to (or as close as
possible to) each other.
For example, taking the data above:
The "distance" between record 8 and record 9 is 2 (ie Only two
positions in the Data are not the same)
The "distance" between record 3 and record 4 is 6
However, the "distance" between record 3 and record 10 is only 1,
but sorted in the normal fashion, there would be some 99,990 records
between them.
I was hoping that as I add records to this table I could calculate a
number or a code or something to create a third column that could be
indexed. Accessing records using this index would give me the records
in order of "String Distance" or similar.
I have looked up functions such as "Levenshtein - Edit String
Distance", which is fine when I have to strings to compare. I can't
see it helping in generating an index though.
I hope I have been clear in my explanation.
I would really appreciate any comments or discussion that could help me
achieve this.
Thanks for your time,
MJPlease post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.
Based on your narrative, did you mean to post something like this?
CREATE TABLE Foobar
(foo_id INTEGER NOT NULL PRIMARY KEY,
vector CHAR(255) NOT NULL);
INSERT INTO Foobar VALUES(1, '000000000000');
INSERT INTO Foobar VALUES(2, '010000000000')
INSERT INTO Foobar VALUES(3, '011111111111');
INSERT INTO Foobar VALUES(4, '101100011101');
INSERT INTO Foobar VALUES(5, '110100011011');
INSERT INTO Foobar VALUES(6, '111100000000');
INSERT INTO Foobar VALUES(7, '111100000111');
INSERT INTO Foobar VALUES(8, '111100010111');
INSERT INTO Foobar VALUES(9, '111100011011');
INSERT INTO Foobar VALUES(10, '111111111111');
Look up the difference in rows and records, and what a scalar value is.
These strings look like (uuggh!) binary vectors, not scalar values as
required by First Normal Form.
There is no way to use string distance as a sorting order because it is
a **binary** relationship.
Look up the use of the Sequence Auxiliary table.
CREATE PROCEDURE StringDistance (@.foo_1 INTEGER, @.foo_2 INTEGER)
AS
SELECT SUM (CASE
WHEN SUBSTRING (F1.vector, S.seq, 1)
<> SUBSTRING (F2.vector, S.seq, 1)
THEN 1 ELSE 0 END) AS distance
FROM Foobar AS F1, Foobar AS F2, Sequence AS S
WHERE F1.foo_id = @.foo_1
AND F2.foo_id = @.foo_2
AND S.seq <= 255 -- or whatever
GROUP BY F1.foo_id, F2.foo_id;|||Hi Mary, I created an example that may help you to solve the problem.
Instead of calculating the difference between two binary
representations, i used products table in Northwind database to gererate a
new computed column that represents the price difference between two
consecutives products. In your case, you can replace this basic function to
any that you wish. Here's the script:
USE Northwind
GO
-- Creates a new table just like products
SELECT *
INTO ProductsComputed
FROM Products
GO
-- Creates a function that calculates the difference between the current
record and the one just before (it selects all the products lower than the
actual key and pick -- the first one, in the inverse order).
-- ISNULL is needed due to productId 1
CREATE FUNCTION fnMoneyDifference (@.ProductId INT)
RETURNS MONEY
AS
BEGIN
DECLARE @.Value Money
SELECT @.Value = P.UnitPrice -
ISNULL ((SELECT TOP 1 P2.UnitPRice
FROM ProductsComputed AS P2
WHERE P2.ProductID < P.ProductID
ORDER BY ProductID DESC), 0)
FROM ProductsComputed AS P
WHERE P.ProductID = @.ProductID
RETURN @.Value
END
GO
-- Add a computed columns to the table, using the function and passing the
product key
ALTER TABLE ProductsComputed
ADD ComputedColumn AS dbo.fnMoneyDifference(ProductID)
GO
-- Verify the result
SELECT * FROM ProductsComputed
GO
Using computed column you guarantee that the difference will always be
updated, but be careful, this can consume many resources when a query is
issued against a big table. If the data almost doesn't change, it's better
to create a real column and populate the values instead of using a computed
column.
I hope it helps.
[]s
Luciano Caixeta Moreira
Brazil
"--CELKO--" <jcelko212@.earthlink.net> wrote in message
news:1126146910.541347.236630@.g14g2000cwa.googlegroups.com...
> Please post DDL, so that people do not have to guess what the keys,
> constraints, Declarative Referential Integrity, data types, etc. in
> your schema are. Sample data is also a good idea, along with clear
> specifications. It is very hard to debug code when you do not let us
> see it.
> Based on your narrative, did you mean to post something like this?
> CREATE TABLE Foobar
> (foo_id INTEGER NOT NULL PRIMARY KEY,
> vector CHAR(255) NOT NULL);
> INSERT INTO Foobar VALUES(1, '000000000000');
> INSERT INTO Foobar VALUES(2, '010000000000')
> INSERT INTO Foobar VALUES(3, '011111111111');
> INSERT INTO Foobar VALUES(4, '101100011101');
> INSERT INTO Foobar VALUES(5, '110100011011');
> INSERT INTO Foobar VALUES(6, '111100000000');
> INSERT INTO Foobar VALUES(7, '111100000111');
> INSERT INTO Foobar VALUES(8, '111100010111');
> INSERT INTO Foobar VALUES(9, '111100011011');
> INSERT INTO Foobar VALUES(10, '111111111111');
>
> Look up the difference in rows and records, and what a scalar value is.
> These strings look like (uuggh!) binary vectors, not scalar values as
> required by First Normal Form.
> There is no way to use string distance as a sorting order because it is
> a **binary** relationship.
> Look up the use of the Sequence Auxiliary table.
> CREATE PROCEDURE StringDistance (@.foo_1 INTEGER, @.foo_2 INTEGER)
> AS
> SELECT SUM (CASE
> WHEN SUBSTRING (F1.vector, S.seq, 1)
> <> SUBSTRING (F2.vector, S.seq, 1)
> THEN 1 ELSE 0 END) AS distance
> FROM Foobar AS F1, Foobar AS F2, Sequence AS S
> WHERE F1.foo_id = @.foo_1
> AND F2.foo_id = @.foo_2
> AND S.seq <= 255 -- or whatever
> GROUP BY F1.foo_id, F2.foo_id;
>|||Hi Luciano,
Thank you very much for your response. I really appreciate you helping
me out with my problem.
I'll see what I can do to convert your code to my needs.
Thanks again,
MJ|||let's say you ahve data like this
ID Data
1 000000000000
2 100000000000
3 010000000000
4 001000000000
5 000100000000
rows 2-5 are all only 1 digit different foprm row 1. Hiow would you
sort them?|||Ok Mary. If you need any further assistance, just send a new message.
Good luck.
Luciano
"Mary" <maryjones11289@.hotmail.com> wrote in message
news:1126209734.166547.191330@.z14g2000cwz.googlegroups.com...
> Hi Luciano,
> Thank you very much for your response. I really appreciate you helping
> me out with my problem.
> I'll see what I can do to convert your code to my needs.
> Thanks again,
> MJ
>|||That's exactly my question.
I'm not looking to sort on the "data" column...I was hoping to add a
third column which contains some sort of "weighting factor" or
something similar, so that when I sort the data on this column I end up
with records with the smallest "distance" between each other "close" to
each other.
You are correct with your sample data that all of them have a distance
of 1 from each other and I would expect that when sorted by a weight
factor they will reside "close" to each other, but consider:
ID Data
1 00000000000
.
.
.
55 011111111111
.
.
.
100,000 111111111111
Record 55 and record 100000 have a distance of "1", yet are over 90,000
records apart when sorted on the "data" column.
If I could weight this column and sort by that weight, I am hoping to
get record 55 "as close as possible" to record 100,000.
Even if I have to read 50-100 records to get it would be much better
than having to read over 90,000
MJ|||On 7 Sep 2005 18:55:45 -0700, Mary wrote:
>Hi, I need some assistance with a query. To be honest, I'm not even
>sure it can be done.
Hi Mary,
I'm pretty sure it can't be done. Any ordering must be defined on the
contents of the row, not on the difference between a row and the
"previous" or "next" one, as you can only determine the "previous" and
"next" after applying the ordering.
Simplifying your example even further, to length 2:
ID Data
1 00
2 01
3 10
4 11
Now, which row should be first? There is no answer - because for each
row, there are two rows with a difference in one position, and one row
with a difference in two positions.
An ordering that WOULD bne possible is an ordering on the number of
position changed from a chosen "first" value. Let's say we arbitrarily
choose "01" to be the first value in the sample above. That would make
row #2 the first (no difference). The second and third rows would be #1
and #4 (or #4 and #1 - that's arbitrary again), that both differ in one
position from "01", and row 3 ("10" - both position changed) goes last.
But that is probably not the ordering you want, because the second and
third row now differ in two positions, and the third anf fourth only on
one!
I hope this example has managed to cinvince you that the ordering you
require is indeed impossible - not as a result of a limitation in SQL
Server, but because the requirement itself is insuffucient to define
"the" correct order.
The next step, of course, is to find out why you (think that you) need
the data to be ordered in this way, and to consider alternate ways of
achieveing the same goal. In short: what is the actual business
requirement that you're trying to solve?
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment