Real-time data warehousing
I have never really understood the need for real-time data warehouses (so what, there are so many other things that I haven't understood:-). I have come across a few instances where people were proposing a real-time DW. For example, in one case, a company wanted to capture up-to-the-minute measurements about the operation of all its compressors in the country. There were, if I remember, over 30 measurements to be captured every second for every machine. In addition, they wanted to be able to perform trend analysis on this data with data collected from several months back. It hence seemed like a real-time DW was the only solution.
On closer analysis though, it looked like they were going to do two different things with the data:
1. Examine the real-time data and try to predict machine failures
2. Examine data trends over a long period of time and probably try to predict failures as well.
For the first analysis, they needed very detailed measurements but they were only interested in that level of detail for readings from the last day or two (not last one year!). For the second case, they did not really need every reading from every second for every years. Rather, a reading taken every 5 mins, say, for a period of 1 year would have sufficed.
In cases like this, it may be much simpler to leave the main DW as it is (with data upto yesterday) and let the ODS store real-time data for the last few days. This immediately makes things simpler - there is no need to recompute any costly aggregates each time a new feed comes in (which is every second). The ODS table is likely to be reasonably small and can be well indexed to support all the typical queries. Importantly, there is never any AD-HOC analysis performed on recent real-time data. Instead, people typically use statistical tools to calculate things like moving averages, std deviations, etc and these are all well-defined queries known beforehand.
Here is a much better article on the same topic:
http://www.theregister.co.uk/2006/07/31/real_time_data_warehousing/
Here is one on using DataMirror for real-time DW but I wish the author had given one or two convincing business scenarios where a real-time DW is the only viable solution:
http://www.grcdi.nl/considerations.pdf
On closer analysis though, it looked like they were going to do two different things with the data:
1. Examine the real-time data and try to predict machine failures
2. Examine data trends over a long period of time and probably try to predict failures as well.
For the first analysis, they needed very detailed measurements but they were only interested in that level of detail for readings from the last day or two (not last one year!). For the second case, they did not really need every reading from every second for every years. Rather, a reading taken every 5 mins, say, for a period of 1 year would have sufficed.
In cases like this, it may be much simpler to leave the main DW as it is (with data upto yesterday) and let the ODS store real-time data for the last few days. This immediately makes things simpler - there is no need to recompute any costly aggregates each time a new feed comes in (which is every second). The ODS table is likely to be reasonably small and can be well indexed to support all the typical queries. Importantly, there is never any AD-HOC analysis performed on recent real-time data. Instead, people typically use statistical tools to calculate things like moving averages, std deviations, etc and these are all well-defined queries known beforehand.
Here is a much better article on the same topic:
http://www.theregister.co.uk/2006/07/31/real_time_data_warehousing/
Here is one on using DataMirror for real-time DW but I wish the author had given one or two convincing business scenarios where a real-time DW is the only viable solution:
http://www.grcdi.nl/considerations.pdf
