Saturday, February 11, 2012

4 million queries, or 4 million rows

Hello!
I need to verify that about 4 million rows were correctly inserted into
the database. Should i do this using one query returning 4 million rows,
or should i query the database 4 million times and returning one row each
query?
The result of the query/queries will be processed in a client application.
Does it matter?
I would prefer the latter, if it can finish in a reasonable amount of time.
Thanks!
--
Jon"Jon" <jon@.noreply> wrote in message
news:xn0f10jy5j55qzl00t@.news.microsoft.com...
> Hello!
> I need to verify that about 4 million rows were correctly inserted into
> the database. Should i do this using one query returning 4 million rows,
> or should i query the database 4 million times and returning one row each
> query?
>
Neither. Surely it would make far more sense to write a query that verifies
the result server-side. That way you need only return a Yes/No answer or
some other aggregate or exception result.
> The result of the query/queries will be processed in a client application.
> Does it matter?
> I would prefer the latter, if it can finish in a reasonable amount of
> time.
>
It matters! But if client-side row-by-row processing is an absolute
requirement then you'd better test performance for yourself. I don't know
what you consider reasonable.
--
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--|||The data is not imported from another database, it is file(s) that needs
to be processed before inserting into the database. I need to verify that
the application doing this, does the right thing (which it currently
doesn't). I am not really a pro on T-SQL (can i read from a file using
T-SQL?), so i feel more comfortable doing it in a "normal" application.
I have tried returning all rows in one query, and it takes about 20
minutes to verify the data. The problem is that when elements are missing
from the database, reading the next element from the file and next element
from the database gives a mis-match, because i am simply comparing wrong
items. I can of course "synchronize" this, but i want to keep down the
development time, so if it is just a small time difference, i may just do
it easy for me and not complicate it (the more complicated my code is, the
greater risk of me doing something wrong, which could mean that the
application verifying the insertion has a bug...).
Running time is not critical (it will just be used a few times) and i
would consider about 60 minutes to still be workable to verify the
complete database).
--
David Portas wrote:
>"Jon" <jon@.noreply> wrote in message
>news:xn0f10jy5j55qzl00t@.news.microsoft.com...
>>Hello!
>>I need to verify that about 4 million rows were correctly inserted into
>>the database. Should i do this using one query returning 4 million rows,
>>or should i query the database 4 million times and returning one row each
>>query?
>Neither. Surely it would make far more sense to write a query that
>verifies the result server-side. That way you need only return a Yes/No
>answer or some other aggregate or exception result.
>>The result of the query/queries will be processed in a client application.
>>Does it matter?
>>I would prefer the latter, if it can finish in a reasonable amount of
>>time.
>It matters! But if client-side row-by-row processing is an absolute
>requirement then you'd better test performance for yourself. I don't know
>what you consider reasonable.|||Hi,
The insertion is done by another application, reading text files,
processing each line in the text file, and inserting it into the database.
Unfortunately, it does not log errors, and there can also be a bug in the
application (so i need to verify not only that the element exists, but
also that the data is correct).
Verifying that the data is correct cannot be done using a constraint,
because it can possibly have inserted the data from another element into
the database (for example reading the wrong line in the file).
--
Jon
John Bell wrote:
>Hi Jon
>"Jon" wrote:
>>Hello!
>>I need to verify that about 4 million rows were correctly inserted into
>>the database. Should i do this using one query returning 4 million rows,
>>or should i query the database 4 million times and returning one row each
>>query?
>What do you mean by this? If you inserted 4 million rows and no error
>status
>was returned and the transaction was committed then they will be inserted!
>If
>you need to make sure specific values are in a given column then make sure
>that you have the correct column/table constraints in place.
>John|||"Jon" <jon@.noreply> wrote in message
news:xn0f10m5sj87sy300v@.news.microsoft.com...
> Hi,
> The insertion is done by another application, reading text files,
> processing each line in the text file, and inserting it into the database.
> Unfortunately, it does not log errors, and there can also be a bug in the
> application (so i need to verify not only that the element exists, but
> also that the data is correct).
> Verifying that the data is correct cannot be done using a constraint,
> because it can possibly have inserted the data from another element into
> the database (for example reading the wrong line in the file).
>
Maybe you could load the file to the server independently and then compare
the two results with a query. Perhaps that seems an odd idea given that you
already have an application doing the same. But if it's possible to bulk
load the file using BCP then I would expect you to see some improvement on
the 20 minute running time you mentioned. Take a look at the BCP topic in
Books Online.
--
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--|||In the first place, I wanted to bulk load the data as-is into the
database, but the db admin said no to this. I was not given any reason for
this. :-(
So I am looking into what other options i have to do this job.
You don't think it is reasonable to query the database a couple of million
times?
--
David Portas wrote:
>"Jon" <jon@.noreply> wrote in message
>news:xn0f10m5sj87sy300v@.news.microsoft.com...
>>Hi,
>>The insertion is done by another application, reading text files,
>>processing each line in the text file, and inserting it into the database.
>> Unfortunately, it does not log errors, and there can also be a bug in the application (so i need to verify not only that the element exists, but also that the data is correct).
>>Verifying that the data is correct cannot be done using a constraint,
>>because it can possibly have inserted the data from another element into
>>the database (for example reading the wrong line in the file).
>Maybe you could load the file to the server independently and then compare
>the two results with a query. Perhaps that seems an odd idea given that
>you already have an application doing the same. But if it's possible to
>bulk load the file using BCP then I would expect you to see some
>improvement on the 20 minute running time you mentioned. Take a look at
>the BCP topic in Books Online.|||Yes, the program will be improved. But to do this, we need to figure out
what is wrong, hence why i am doing this.
I don't think they are open to loading the data using a complete new
process. I think they want to stick with the original application.
John Bell wrote:
>Hi Jon
>"Jon" wrote:
>>Hi,
>>The insertion is done by another application, reading text files,
>>processing each line in the text file, and inserting it into the database.
>>Unfortunately, it does not log errors, and there can also be a bug in the
>>application (so i need to verify not only that the element exists, but
>>also that the data is correct).
>Have you considered BCP which will create a file of records that have
>errored? Possibly a DTS/SSIS package could process this better?
>>Verifying that the data is correct cannot be done using a constraint,
>>because it can possibly have inserted the data from another element into
>>the database (for example reading the wrong line in the file).
>It sounds like you need to improve the program that does the inserts or the
>quality of the data!
>>--
>>Jon
>John|||No, it is not on a live system. But the problem still remains, i need to
somehow compare the data from the files with the data that is loaded in
the database, and the question is what is best to do.
--
John Bell wrote:
>Hi Jon
>>Yes, the program will be improved. But to do this, we need to figure out
>>what is wrong, hence why i am doing this.
>If that is the case then you will will probably want to take a copy of the
>database and not work on the live system!
>John|||"Jon" <jon@.noreply> wrote in message
news:xn0f10o2ajatr8x00y@.news.microsoft.com...
> No, it is not on a live system. But the problem still remains, i need to
> somehow compare the data from the files with the data that is loaded in
> the database, and the question is what is best to do.
>
The DBA objects to you doing a bulk load on a non-production system? That
seems unhelpful to the say the least!
Since you are going through the file anyway in your code maybe you could
insert the data into a temporary table that way. Then compare the two tables
with a query.
Failing that, it's hard to be sure whether retrieving 1 x 4m rows or 4m x 1
row would be better. I suspect the single query approach is the best unless
your client-side process is particularly heavy duty in which case the
difference may be negligible. That's really just a hunch though because it
depends on too many factors only you can determine - the type of query,
indexing, hardware and network utilisation, etc.
--
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--|||On Tue, 09 Jan 2007 12:32:00 -0800, Jon wrote:
> I am not really a pro on T-SQL (can i read from a file using
>T-SQL?), so i feel more comfortable doing it in a "normal" application.
(snip)
>Running time is not critical (it will just be used a few times)
Hi Jon,
For a throwaway program that doesn't need to be fast, I'd suggest that
you use the technique that you are most comfortable with. And in this
case, that appears to be reading the rows one by one.
> and i
>would consider about 60 minutes to still be workable to verify the
>complete database).
The only way to find out if you'll be able to process the data in 60
minutes is to test it. (Consider testing on smaller databases first and
then try to extrapolate the execution time - test with several sizes to
check if the increase in execution time will be linear of exponential).
--
Hugo Kornelis, SQL Server MVP
My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis|||"Jon" <jon@.noreply> wrote in message
news:xn0f10ndvj9udcu00w@.news.microsoft.com...
> In the first place, I wanted to bulk load the data as-is into the
> database, but the db admin said no to this. I was not given any reason for
> this. :-(
> So I am looking into what other options i have to do this job.
It might be worth re-visiting the Bulk Load. Ask the DBA's if you can bulk
load the data into a separate table, and then execute a query to move the
data from the new table into the actual table. Data is much easier to
manipulate once it's in the database, even in a different table.
> You don't think it is reasonable to query the database a couple of million
> times?
I don't think that's the best option, but if you're stuck with it...|||Thanks David, for your help!
I will spend some time this evening writing code. I think i will continue
with the one-query approach.
--
David Portas wrote:
>"Jon" <jon@.noreply> wrote in message
>news:xn0f10o2ajatr8x00y@.news.microsoft.com...
>>No, it is not on a live system. But the problem still remains, i need to
>>somehow compare the data from the files with the data that is loaded in
>>the database, and the question is what is best to do.
>The DBA objects to you doing a bulk load on a non-production system? That
>seems unhelpful to the say the least!
>Since you are going through the file anyway in your code maybe you could
>insert the data into a temporary table that way. Then compare the two
>tables with a query.
>Failing that, it's hard to be sure whether retrieving 1 x 4m rows or 4m x
>1 row would be better. I suspect the single query approach is the best
>unless your client-side process is particularly heavy duty in which case
>the difference may be negligible. That's really just a hunch though
>because it depends on too many factors only you can determine - the type
>of query, indexing, hardware and network utilisation, etc.|||The structure of the files is not close to the table definitions, that is
one of the reason we have developed an application to process the files
and insert the data into the database.
It is the verification i am working on. I have the original data (the
files), and i have the result of our application (the database), and i now
need to identify what data was not loaded correctly. With this
information, i hope we can identify what part(s) of our application is not
working properly, and fix the application.
John Bell wrote:
>Hi Jon
>"Jon" wrote:
>>No, it is not on a live system. But the problem still remains, i need to
>>somehow compare the data from the files with the data that is loaded in
>>the database, and the question is what is best to do.
>>--
>If you load the data into new tables using the same program it is not
>really
>going to prove anything apart from possibly the program makes the same
>mistakes consistently!! May be you need to export the data and compare what
>is exported with the original files? It is not clear how close the table
>definitions are to the flat file format.
>John|||Hi Hugo,
You have a very good point in that i should use what i feel most
comfortable with. I will see this evening in what direction i will go.
Thanks!
Hugo Kornelis wrote:
>On Tue, 09 Jan 2007 12:32:00 -0800, Jon wrote:
>>I am not really a pro on T-SQL (can i read from a file using
>>T-SQL?), so i feel more comfortable doing it in a "normal" application.
>(snip)
>>Running time is not critical (it will just be used a few times)
>Hi Jon,
>For a throwaway program that doesn't need to be fast, I'd suggest that
>you use the technique that you are most comfortable with. And in this
>case, that appears to be reading the rows one by one.
>>and i
>>would consider about 60 minutes to still be workable to verify the
>>complete database).
>The only way to find out if you'll be able to process the data in 60
>minutes is to test it. (Consider testing on smaller databases first and
>then try to extrapolate the execution time - test with several sizes to
>check if the increase in execution time will be linear of exponential).|||"Jon" <jon@.noreply> wrote in message
news:xn0f11t16kd8jng010@.news.microsoft.com...
> The structure of the files is not close to the table definitions, that is
> one of the reason we have developed an application to process the files
> and insert the data into the database.
> It is the verification i am working on. I have the original data (the
> files), and i have the result of our application (the database), and i now
> need to identify what data was not loaded correctly. With this
> information, i hope we can identify what part(s) of our application is not
> working properly, and fix the application.
If you could load all the data into a working table with a structure close
to what your flat file is, it is then a simple matter to export the data
again and compare to the original flat file. It is also easier to move and
manipulate the data once it is in the database than during loading.

No comments:

Post a Comment