What is the purpose of a data model?
I have been trying to design a small data model today (only 4 tables) to capture the results of running queries with different workloads for a benchmarking tool. The topmost table is something called RUN which stores information about every run of the tool such as:
RUN_NO (PK)
REMARKS
START_TIME
END_TIME
This seemingly simple table is raising some fundamental questions in data modelling:
1. The RUN_NO column is straightforward and is a primary key, no issues.
2. Remarks is whatever user wants to supply for a specific run. It could be a comment such as "Benchmarking with new statistics" or "Benchmarking without query rewrite", etc. Let us assume that Remarks is mandatory for every Run (let the application figure out what to supply if user doesn't pass remarks).
3. Start_Time and End_Time. This is where the problem lies - should I make both these columns mandatory or should both be optional or should Start_Time be mandatory and End_Time optional? If I look only at the nature of the data and not presume anything about the process or application used to create that data, then I should make both the columns mandatory. But what if some Runs start and never finish (or crash)? Should I not make End_Time optional in that case? The answer is - it depends; if I am going to create the record before a Run completes, then End_Time should be optional but if I am going to create it after a Run completes, then it should be mandatory. Depending on how the application is coded, I may even have to make Start_Time itself optional because I may be creating the record initially with just a Run_No and Remarks.
But hang on...isn't a data model supposed to represent only the true nature of the underlying data and not be biased by the specific application or technique used to create the data? The technique may and probably will change frequently while the basic nature of the data will not. What if two or more applications start writing into this model in future, they will probably do things differently, does it mean I keep changing the model or should I just strictly enforce rules as a modeller and let the applications work by these rules? Will that not mean more effort for the applications? Who am I (as data modeller) to dictate how and in what sequence an app has to write data. Should I not just restrict myself to defining primary key and foreign key constraints (these are fundamental and risky to trust an app with) and only enforce NOT NULL for columns which, from a business sense, can cause trouble if left optional?
Is it really wise to go and drive each little nail firmly down in my data model just so I can say 'My model will never permit dirty or missing data' and all applications can go to hell?
Is this why ERP applications like SAP do not enforce database constraints as rigorously as they ought to?
I am facing these issues because it is the first time that I am both developing a data model AND writing an application for it.
One obvious answer to these questions is to let the logical model leave the columns as optional but change the physical implementation as per the application. But the fundamental question still is:
Should a data model suit the way an application is built or should it almost dictate how an application is built?
This topic warrants much more thought...
RUN_NO (PK)
REMARKS
START_TIME
END_TIME
This seemingly simple table is raising some fundamental questions in data modelling:
1. The RUN_NO column is straightforward and is a primary key, no issues.
2. Remarks is whatever user wants to supply for a specific run. It could be a comment such as "Benchmarking with new statistics" or "Benchmarking without query rewrite", etc. Let us assume that Remarks is mandatory for every Run (let the application figure out what to supply if user doesn't pass remarks).
3. Start_Time and End_Time. This is where the problem lies - should I make both these columns mandatory or should both be optional or should Start_Time be mandatory and End_Time optional? If I look only at the nature of the data and not presume anything about the process or application used to create that data, then I should make both the columns mandatory. But what if some Runs start and never finish (or crash)? Should I not make End_Time optional in that case? The answer is - it depends; if I am going to create the record before a Run completes, then End_Time should be optional but if I am going to create it after a Run completes, then it should be mandatory. Depending on how the application is coded, I may even have to make Start_Time itself optional because I may be creating the record initially with just a Run_No and Remarks.
But hang on...isn't a data model supposed to represent only the true nature of the underlying data and not be biased by the specific application or technique used to create the data? The technique may and probably will change frequently while the basic nature of the data will not. What if two or more applications start writing into this model in future, they will probably do things differently, does it mean I keep changing the model or should I just strictly enforce rules as a modeller and let the applications work by these rules? Will that not mean more effort for the applications? Who am I (as data modeller) to dictate how and in what sequence an app has to write data. Should I not just restrict myself to defining primary key and foreign key constraints (these are fundamental and risky to trust an app with) and only enforce NOT NULL for columns which, from a business sense, can cause trouble if left optional?
Is it really wise to go and drive each little nail firmly down in my data model just so I can say 'My model will never permit dirty or missing data' and all applications can go to hell?
Is this why ERP applications like SAP do not enforce database constraints as rigorously as they ought to?
I am facing these issues because it is the first time that I am both developing a data model AND writing an application for it.
One obvious answer to these questions is to let the logical model leave the columns as optional but change the physical implementation as per the application. But the fundamental question still is:
Should a data model suit the way an application is built or should it almost dictate how an application is built?
This topic warrants much more thought...

1 Comments:
Just taking a lesson from the data modeling document you had sent, if you abstract this concept of RUN to something called as a EVENT. EVENT(run_no,remarks,start_time) is a super-type entity and you can model 2 sub-type entities FAILURE RUN(end_time) and SUCCESS RUN(end_time). You can make END_TIME mandatory for SUCCESS RUN subtype entity and optional for FAILURE RUN subtype entity. In this case,both the applications would be able to use the data model for their use. So, I feel the a data model can suit itself to the application and also dictate consistency to the application
By
Senthilkumar Bala, at 4:44 AM
Post a Comment
<< Home