Correct, except that the role is different from a DBA. A DBA will take
care of backups, tablespace, indexes and other very physical things. A
data warehouse architect will take care of the data model and the
logical data load processes.
A data warehouse is the collection of cleansed and integrated information of an enterprise, used for informational purposes (As opposed to transactional). It must conform to very specific architecture and design rules to be viable. You then add a variety of analysis, reporting and decision making tools on top of it, alerts and dashboards as well. It is also a pre-requisite to a data mining project. Datawarehousing has typically been reserved for very large enterprises as serious datawarehousing projects usually cost several million dollars. The data volumes are generally (as of today) hundreds of gigabytes to a few terabytes. Real-time Datawarehousing is slowly becoming more critical and requires even more ressources.
There are two (correct) ways to build a datawarehouse:
The first is a corporate datawarehouse with a traditional E/R model as the enterprise datawarehouse that no one can (should) access and is only used to build datamarts which have star schema models (highly denormalized, with facts and dimensions).
The other way is a collection of datamarts that share a common "databus" with conformed dimensions. This one is more appropriate when the datawarehouse project is sponsored by a particular departement within the company instead of a strong corporate initiative. If done well, it can scale to enterprise levels by adding other datamart at the end of the "databus". Not to be confused with a bunch of first generation heterogeneous datamarts, which usually need to be replaced after a short while. (That's what I do, replace them)
A data warehouse MUST be built in one of those two ways following very strict design rules otherwise it WILL fail. Data dumps made by smart people who try to re-invent the wheel are NOT data warehouses (At best some people end up with bad Operational Data Stores)
The biggest effort in building a datawarehouse is spent on the ETL process. ETL means Extract, Tranform, Load, which also includes data cleansing and conforming.
Common pitfalls in a datawarehouse are
-scope creep
-two-year plans with no tangible milestones (Only to realise that "Oh sh*t did we ever go wrong there")
-weak project sponsorship plegued by internal politics (The best is a corporate commitment)
This is as short a summary as I can give of the following must read books:
Kimball's datawarehouse toolkit
Inmon's Building the data warehouse
^ Books by the guys who basically invented data warehousing. Think of those as K&R's book on C.
They oppose each other in datawarehouse design and are very dogmatic.
Imhof's mastering datawarehouse design
^ Book by Inmon's student/partner who obviously shares his philosophy but is less dogmatic and more practical.
You can get certified on both schools of thought .
Kimball's datawarehouse LIFECYCLE toolkit
Shares some chapters with the first book but is higher level and discusses aspects that relate to a whole warehouse project, not just data modeling.
Kimball's ETL toolkit
^ More technical and only discusses ETL processes. Another must-read.
Microsoft Data Warehouse toolkit
^ Have not read this one but it's apparently good and as the title indicates it is for SQL Server (2005)
If you're going to be doing this on SQL Server I guess it would be good to read this first but going back to the first two books I linked above is a must.
There are quite a lot of other theoretical DW books and quite honestly you should avoid them.
Other books one might want would be technical and geared towards the specific RDBMS being used. ( SSIS , Oracle & OWB , etc...) although they are not necessary if you understand the general principles and you are good at digging for technical details when you need them. They would make a horrible first read as they would introduce a technical biais which can only assure you a stovepipe career.
As far as the "OLAP stuff" goes that is quite a lenghty subject as well but basically OLAP is the multi-dimensional models (stars and cubes) I was talking about and the analysis tools you add on top of them.
OLAP is comprised of MOLAP (multi-dimensional), ROLAP (relational) and HOLAP (Hybrid: MOLAP functionalities on a ROLAP archtitecture).
While 10+ years ago MOLAP vendors had the lead, they've been trailing behind in the last 5 years because their technology didn't scale very well to the new data volumes (hundreds of GIGs to TERA). Although some vendors claim they now scale very well IMO they still won't gain their lead back on HOLAP vendors because the extra cost & maintenance is not justified compared to HOLAP.
To adress SQL Server 2005 and SSIS specifically I think they will make a serious KILLING with their pricing and functionalities and dominate the mid-market. While Oracle and DB2 both have mid-market products, no (serious) ETL and reporting vendors have introduced comparable price models. Oracle offers OWB for cheap but it still requires more knowledge and time to achieve the same result (For a mid market enterprise) while a guy who isn't too dumb and has read a book or two could get some decent results with SQL Server 2005 and SSIS.
The questions I presented are very much for a senior architect, so for the salary they offer if you can at least discuss most of those questions, you should totally get the job.
...Assuming they somewhat know what they are doing and they are not confusing the role with that of a DBA. A DBA is indeed very much needed on a data warehouse projectbut the roles are vastly different.
edit:
quicklinks for newsletters, conferences, etc..
www.tdwi.org
www.b-eye-network.com
www.dmreview.com
www.datawarehouse.com
www.bizintelligencepipeline.com
www.intelligententerprise.com
Obligatory wiki link:
http://en.wikipedia.org/wiki/Datawarehouse
A data warehouse is the collection of cleansed and integrated information of an enterprise, used for informational purposes (As opposed to transactional). It must conform to very specific architecture and design rules to be viable. You then add a variety of analysis, reporting and decision making tools on top of it, alerts and dashboards as well. It is also a pre-requisite to a data mining project. Datawarehousing has typically been reserved for very large enterprises as serious datawarehousing projects usually cost several million dollars. The data volumes are generally (as of today) hundreds of gigabytes to a few terabytes. Real-time Datawarehousing is slowly becoming more critical and requires even more ressources.
There are two (correct) ways to build a datawarehouse:
The first is a corporate datawarehouse with a traditional E/R model as the enterprise datawarehouse that no one can (should) access and is only used to build datamarts which have star schema models (highly denormalized, with facts and dimensions).
The other way is a collection of datamarts that share a common "databus" with conformed dimensions. This one is more appropriate when the datawarehouse project is sponsored by a particular departement within the company instead of a strong corporate initiative. If done well, it can scale to enterprise levels by adding other datamart at the end of the "databus". Not to be confused with a bunch of first generation heterogeneous datamarts, which usually need to be replaced after a short while. (That's what I do, replace them)
A data warehouse MUST be built in one of those two ways following very strict design rules otherwise it WILL fail. Data dumps made by smart people who try to re-invent the wheel are NOT data warehouses (At best some people end up with bad Operational Data Stores)
The biggest effort in building a datawarehouse is spent on the ETL process. ETL means Extract, Tranform, Load, which also includes data cleansing and conforming.
Common pitfalls in a datawarehouse are
-scope creep
-two-year plans with no tangible milestones (Only to realise that "Oh sh*t did we ever go wrong there")
-weak project sponsorship plegued by internal politics (The best is a corporate commitment)
This is as short a summary as I can give of the following must read books:
Kimball's datawarehouse toolkit
Inmon's Building the data warehouse
^ Books by the guys who basically invented data warehousing. Think of those as K&R's book on C.
They oppose each other in datawarehouse design and are very dogmatic.
Imhof's mastering datawarehouse design
^ Book by Inmon's student/partner who obviously shares his philosophy but is less dogmatic and more practical.
You can get certified on both schools of thought .
Kimball's datawarehouse LIFECYCLE toolkit
Shares some chapters with the first book but is higher level and discusses aspects that relate to a whole warehouse project, not just data modeling.
Kimball's ETL toolkit
^ More technical and only discusses ETL processes. Another must-read.
Microsoft Data Warehouse toolkit
^ Have not read this one but it's apparently good and as the title indicates it is for SQL Server (2005)
If you're going to be doing this on SQL Server I guess it would be good to read this first but going back to the first two books I linked above is a must.
There are quite a lot of other theoretical DW books and quite honestly you should avoid them.
Other books one might want would be technical and geared towards the specific RDBMS being used. ( SSIS , Oracle & OWB , etc...) although they are not necessary if you understand the general principles and you are good at digging for technical details when you need them. They would make a horrible first read as they would introduce a technical biais which can only assure you a stovepipe career.
As far as the "OLAP stuff" goes that is quite a lenghty subject as well but basically OLAP is the multi-dimensional models (stars and cubes) I was talking about and the analysis tools you add on top of them.
OLAP is comprised of MOLAP (multi-dimensional), ROLAP (relational) and HOLAP (Hybrid: MOLAP functionalities on a ROLAP archtitecture).
While 10+ years ago MOLAP vendors had the lead, they've been trailing behind in the last 5 years because their technology didn't scale very well to the new data volumes (hundreds of GIGs to TERA). Although some vendors claim they now scale very well IMO they still won't gain their lead back on HOLAP vendors because the extra cost & maintenance is not justified compared to HOLAP.
To adress SQL Server 2005 and SSIS specifically I think they will make a serious KILLING with their pricing and functionalities and dominate the mid-market. While Oracle and DB2 both have mid-market products, no (serious) ETL and reporting vendors have introduced comparable price models. Oracle offers OWB for cheap but it still requires more knowledge and time to achieve the same result (For a mid market enterprise) while a guy who isn't too dumb and has read a book or two could get some decent results with SQL Server 2005 and SSIS.
quote:Originally posted by Alask:
but I think I've got a reasonable understanding of most of them. Hopefully enough for what this business will want anyway, but who knows!
The questions I presented are very much for a senior architect, so for the salary they offer if you can at least discuss most of those questions, you should totally get the job.
...Assuming they somewhat know what they are doing and they are not confusing the role with that of a DBA. A DBA is indeed very much needed on a data warehouse projectbut the roles are vastly different.
edit:
quicklinks for newsletters, conferences, etc..
www.tdwi.org
www.b-eye-network.com
www.dmreview.com
www.datawarehouse.com
www.bizintelligencepipeline.com
www.intelligententerprise.com
Obligatory wiki link:
http://en.wikipedia.org/wiki/Datawarehouse
No comments:
Post a Comment