Learnings

Monday, December 31, 2007

Java program to load event monitor output into tables

A small java program (original source is here) that will load the event monitor output into a database table to make it easier to analyze results. The program has a few small changes to handle DB2 version 8 and 9 and also to capture an identifier for the RUN for which events have been captured. That way, it will be possible to store multiple event monitor runs into the same table and compare.

Sorry for the formatting of source code, haven't figured out how to copy with tabs into this blogger window yet.


import java.io.*;
import java.sql.*;
/**
* DB2Trace program reads the output from the db2event monitor
* and inserts the data into a database.
*
* Creation Date: January 15 2003.
*/
class DB2Trace {
/**
* Usage: java DB2Trace sqltrace.txt jdbc:db2:db2mwh userid
* password
* sqltrace.txt is the input file
* jdbc:db2:db2mwh is the database jdbc url connection
* username is the user ID
* password is the password
* @param args[]
* @exception IOException, SQLException
*/
public static void main (String args[]) throws IOException,
SQLException {
FileWriter fw = new FileWriter("DB2Trace.sql");
PrintWriter pw = new PrintWriter(fw);
BufferedReader in = new BufferedReader(new FileReader(args[0]));
String s = "";
String sqlString = "";
boolean textYes = false;
int exists = 0;
COM.ibm.db2.jdbc.app.DB2Driver driver = new COM.ibm.db2.jdbc.app.DB2Driver();
java.sql.DriverManager.registerDriver(driver);
Connection conn = java.sql.DriverManager.getConnection(args[1], args[2], args[3]);
conn.setAutoCommit(false);
System.out.println("Connected");

Statement stmt = conn.createStatement();
ResultSet rset = stmt.executeQuery("SELECT COUNT(*) FROM SYSCAT.TABLES WHERE TABSCHEMA=USER AND " + "TABNAME='DB2TRACE'");
while (rset.next()) {
exists = rset.getInt(1);
}
if (exists > 0)
stmt.executeUpdate("DROP TABLE DB2TRACE");
stmt.executeUpdate("CREATE TABLE DB2TRACE( RUNID VARCHAR(30), OPERATION VARCHAR(30),SQLTXT VARCHAR(32000),"
+ "STARTTIME VARCHAR(30),STOPTIME VARCHAR(30) ,"
+ "EXECTIME VARCHAR(20),"
+ "USRCPU VARCHAR(20)," + "SORTS VARCHAR(20)," + "TOTSORTTIME VARCHAR(20), PKGNAME VARCHAR(20), SECTNO INTEGER,PKGSCHEMA VARCHAR(30)) IN TRACE_TSP ");
stmt.executeUpdate("CREATE INDEX DB2TRACE_CPU ON DB2TRACE(USRCPU)");

PreparedStatement p = conn.prepareStatement("INSERT INTO DB2TRACE(OPERATION,SQLTXT,EXECTIME,"+"STARTTIME,STOPTIME,USRCPU,SORTS,TOTSORTTIME,PKGNAME,SECTNO,PKGSCHEMA,RUNID) VALUES(?,?,?,?,?,?,?,?,?,?,?,?)");
int i = 0;
while ((s = in.readLine()) != null) {
if (i%100000 == 0) {
System.out.println("Loop "+i);
conn.commit();
}
i++;
if (s.startsWith(" Operation: ")) {
p.setString(1, s.substring(13, s.length()));
}
if (s.startsWith(" Text :")) {
textYes = true;
p.setString(2, s.substring(13, s.length()));
}
if (s.startsWith(" Type : Static")) {
textYes = true;
p.setString(2, "STATIC_SQL");
}
if (s.startsWith(" Exec Time:")) {
p.setString(3, s.substring(14, s.length() - 7));
}
if (s.startsWith(" Start Time: ")) {
p.setString(4, s.substring(14, s.length()));
}
if (s.startsWith(" Stop Time: ")) {
p.setString(5, s.substring(14, s.length()));
}
if (s.startsWith(" User CPU:")) {
p.setString(6, s.substring(13, s.length() - 7));
}
if (s.startsWith(" Sorts:")) {
p.setString(7, s.substring(8, s.length()));
}
if (s.startsWith(" Package :")) {
p.setString(9, s.substring(13, s.length()));
}
if (s.startsWith(" Section :")) {
p.setString(10, s.substring(12, s.length()));
}

if (s.startsWith(" Creator :")) {
p.setString(11, s.substring(13, s.length()));
}
p.setString(12, args[0]);
if (s.startsWith(" Total sort time:")) {
p.setString(8, s.substring(18, s.length()));
if (textYes == true) {
p.executeUpdate();
textYes = false;
}
}
}
pw.close();
conn.commit();
System.out.println("Updating...");
// Fill up the sql text of static sql
stmt.executeUpdate("UPDATE DB2TRACE A SET A.SQLTXT = (SELECT B.TEXT FROM STATEMENTS B WHERE B.PKGNAME=A.PKGNAME AND B.SECTNO=A.SECTNO AND B.PKGSCHEMA = A.PKGSCHEMA) WHERE A.SQLTXT='STATIC_SQL' AND A.RUNID ='"+args[0]+"'");
conn.commit();
stmt.close();
p.close();
}
}

Tracing SQL Activity in DB2

The best way to trace SQL activity in DB2 to find bad queries, high resource usage, etc is to use statement monitors. Here is a good introduction to event monitors. To summarize:

The DB2 event monitor is a built-in utility in DB2 that writes out information to a file or pipe whenever a user-specified event occurs. The event could be, for example, the execution of an SQL statement or a commit or a sort or a lock being taken or whatever.

To see what events can be traced, issue the following command from db2 clp:
get monitor switches

You will see output like:
Monitor Recording Switches
Switch list for db partition number 0

Buffer Pool Activity Information (BUFFERPOOL) = OFF
Lock Information (LOCK) = OFF
Sorting Information (SORT) = ON 12/31/2007 14:42:44.078048
SQL Statement Information (STATEMENT) = ON 12/31/2007 14:42:44.078048
Table Activity Information (TABLE) = OFF
Take Timestamp Information (TIMESTAMP) = ON 12/31/2007 14:42:44.078048
Unit of Work Information (UOW) = OFF

To turn on/off tracing specific events, issue:
update monitor switches using sort off uow on
create event monitor mon_t for transactions write to file 'c:\temp'
set event monitor mon_t state=1
(if you get an error SQLSTATE=51026 saying the path is in use by another monitor, check in SYSCAT.EVENTMONITORS to verify)
Now, run the application/perform activities that you want to trace. When finished, issue:
set event monitor mon_t state=0
terminate

And then issue,
db2evmon -path c:\temp > sqltrace.txt

This will write out all the information about the events into the file sqltrace.txt. This file can get quite big and will contain a separate line for each event and is difficult to analyze directly. This is where the following program DB2Trace.java can come in handy (it has been sourced from the IBM website referred to above and modified slightly to handle the new version of DB2). The program parses and loads the information from sqltrace.txt into a table so that it becomes easier to analyze the data using regular SQL.

The code for the modified DB2Trace.java is in a separate post above. See above IBM post for details on how to run it.

After running DB2Trace and capturing event information into the database, the following queries can be used to identify bad SQL:

-- Slow SQL
SELECT /* TOP SQL ON ELAPSED TIME */
SQLTXT, EXECTIME
FROM DB2TRACE
WHERE RUNID = 'sqltrace.txt'
ORDER BY EXECTIME DESC FETCH FIRST 10 ROWS ONLY


-- Frequent SQL
SELECT /* Frequent SQL */ count(*) as Num_Executions, sqltxt
FROM DB2TRACE
WHERE RUNID = 'sqltrace.txt'
GROUP BY sqltxt
ORDER BY count(*) desc
FETCH FIRST 10 ROWS ONLY


-- CPU hoggers
SELECT /* Top CPU Using SQL */
sum(cast(usrcpu as decimal(15,6))) as CPU_SECS,
sum(cast(totsorttime as decimal(15,6))) as SORT_SECS,
sqltxt
FROM DB2TRACE
WHERE RUNID = 'sqltrace.txt'
GROUP BY sqltxt
ORDER BY sum(cast(usrcpu as decimal(15,6))) + sum(cast(totsorttime as decimal(15,6))) desc
FETCH FIRST 10 ROWS ONLY

Friday, December 21, 2007

What is the purpose of a data model?

I have been trying to design a small data model today (only 4 tables) to capture the results of running queries with different workloads for a benchmarking tool. The topmost table is something called RUN which stores information about every run of the tool such as:
RUN_NO (PK)
REMARKS
START_TIME
END_TIME

This seemingly simple table is raising some fundamental questions in data modelling:
1. The RUN_NO column is straightforward and is a primary key, no issues.
2. Remarks is whatever user wants to supply for a specific run. It could be a comment such as "Benchmarking with new statistics" or "Benchmarking without query rewrite", etc. Let us assume that Remarks is mandatory for every Run (let the application figure out what to supply if user doesn't pass remarks).
3. Start_Time and End_Time. This is where the problem lies - should I make both these columns mandatory or should both be optional or should Start_Time be mandatory and End_Time optional? If I look only at the nature of the data and not presume anything about the process or application used to create that data, then I should make both the columns mandatory. But what if some Runs start and never finish (or crash)? Should I not make End_Time optional in that case? The answer is - it depends; if I am going to create the record before a Run completes, then End_Time should be optional but if I am going to create it after a Run completes, then it should be mandatory. Depending on how the application is coded, I may even have to make Start_Time itself optional because I may be creating the record initially with just a Run_No and Remarks.

But hang on...isn't a data model supposed to represent only the true nature of the underlying data and not be biased by the specific application or technique used to create the data? The technique may and probably will change frequently while the basic nature of the data will not. What if two or more applications start writing into this model in future, they will probably do things differently, does it mean I keep changing the model or should I just strictly enforce rules as a modeller and let the applications work by these rules? Will that not mean more effort for the applications? Who am I (as data modeller) to dictate how and in what sequence an app has to write data. Should I not just restrict myself to defining primary key and foreign key constraints (these are fundamental and risky to trust an app with) and only enforce NOT NULL for columns which, from a business sense, can cause trouble if left optional?

Is it really wise to go and drive each little nail firmly down in my data model just so I can say 'My model will never permit dirty or missing data' and all applications can go to hell?

Is this why ERP applications like SAP do not enforce database constraints as rigorously as they ought to?

I am facing these issues because it is the first time that I am both developing a data model AND writing an application for it.

One obvious answer to these questions is to let the logical model leave the columns as optional but change the physical implementation as per the application. But the fundamental question still is:
Should a data model suit the way an application is built or should it almost dictate how an application is built?

This topic warrants much more thought...

Thursday, December 20, 2007

Where do I put the files - LOB or filesystem?

Here is another issue on which I wish I could speak authoritatively from hard-won experience but unfortunately, I can't do that yet. The question is, if I have an application where I need to store lots of files (such as audio, video, resumes, etc), should I store them in LOB columns inside the database or should I just store them on the filesystem and hold the path in the database? What are the pros and cons of each approach?

Like I said, since I haven't done it myself, I can only compile the aspects that need to be considered while making this decision:

Data Integrity
This is obviously going to be higher in the LOBs and is in fact the biggest positive of storing the information inside the database as opposed to outside it. There is no danger of having invalid links due to lost files (someone deleted a directory on the filesystem by mistake). This point becomes even stronger if the files are not just going to be inserted and possibly deleted later but if they are also going to be actively updated sometimes. Having them inside the database will give you the benefit of automatic locking and concurrency control so LOBs win here hands down.

Indexing and Searching
If you are going to be not just shuffling the files up and down but are also going to be doing a lot of searching on keywords, content, etc, then you have to consider the fact that having it in LOBs might give you the benefits of database indexing (such as interMedia indexes for Oracle or DB2 Extenders). This is not to say that it can't be done in the filesystem - there are probably products (i vaguely remember that Thunderbird could do this) which can do this stuff very well on the filesystem too. So i would say that both options are even here -depending on what you have and what you need, decide on LOBs or filesystem.

Security
If you have stringent security reqts, it is probably better to leave the stuff in LOBs and let db security take care, especially if the security is tied up to other relational data within the database (such as, if I have access to one department's employees, i should only be able to view their resumes, etc). This sort of stuff is possible in the filesystem also but it can get messy if it gets complex. So LOBs are a marginal winner here I would think.

Disk Space
If files are stored in LOBs, they will almost always take more space than storing on the filesystem. There are various factors involved here - one is that the RDBMS (whether Oracle or DB2) will allocate space in chunks for LOBs and hence you will always have to sacrifice some free space on the last few blocks/pages if you store in a LOB. With a filesystem, it probably means that you only have a little bit of wasted space on the last sector. So be aware of how "chunking" works for LOBs in your DB.
Another factor which can increase size is the need for handling rollbacks on LOBs - in Oracle, LOBs dont go into the rollback segment, instead different versions are stored in the LOB segment itself and the amount of space reserved for this is controllable by pctfree or by the retention_time (if you are using Automatic Undo Management). Check asktom.oracle.com for an explanation of this issue (in fact his new book on Designing Oracle has a chapter devoted to LOBS). Of course, this is configurable so it is not to say that you will certainly lose space - if you dont want rollback, then make sure you dont configure for it. However, this may work differently in other DBMS's so keep it in mind still.
A third factor which can increase size is if your DBMS doesn't support incremental backups or if it is too complex to set it up. With filesystems, one may argue that, if the directories are dated and made read-only after some time, some scripting can be done to perform incremental backups but if you are in a situation where all the files are inside the DB and you have to take a full backup, then this can become a big issue. Especially if you are also logging the LOB changes to the redo log files, then recovery times may also be affected. Again, it depends on what you need - if you need logging of LOBs for some reason, then the filesystem can't do it anyway but if you don't, then there may not be a strong reason to keep the files inside the DB. Something to consider.
Lastly, size may also be affected by compression - the operating system provides file compression options while this may not be available in LOBs or may not perform as well. I don't know, I am just guessing but this could be a factor too.
On the whole, with the size issue, filesystems seem to be the winner.

Size Limitations
Be aware of the LOB size limitations in your DBMS and compare with filesize limitations and any limits on the number of files and directories in your filesystem. The DBMS may only support LOBs upto 2 GB in some cases, so please check this.

Administration
Folder management can become pretty complex if you are storing 10s of thousands of files on the filesystem. Think about how it will be archived, purged, etc. Even with LOBs, the DBA will have to do some careful planning of where to put the LOBs, how to backup, etc, so again it may be more or less even here but it may be somewhat easier inside the DB.

Complexity
In general, it may be a bit painful to understand LOBs and work with them from your application compared to just pointing to the file from a browser and picking it all up. Be aware of limitations with jdbc, versions, etc, etc and understand LOBs fully before starting work on them.

Other issues
Also consider how things like load balancing, having multiple applications read the same file at the same time, etc may affect you. Load balancing may be especially thorny to handle with the DBMS compared to setting up on a router.

Lastly, even if you want to store the files in the filesystem, dont hardcode the entire path to the files inside the db - use UNC instead as that will make it more portable.

On the whole, the tilt may be slightly towards filesystems but then, I can't say for sure as I haven't worked hands-on on this. Your comments welcome!

Labels: