Skip to content

Rants of a Madman – I don’t care if it’s ETL, it’s still code!

August 23, 2012

A couple of weeks ago I kicked the hornet’s nest regarding how many individuals who work on automation efforts don’t practice software development best practices themselves.  I figured I might as well keep kicking this same nest with another group of technical practitioners.  Why not? I’m a glutton for punishment and somebody on an automation effort paid me off in chocolate to do it.

Riddle me this:  What is it that makes many who work on ETL think that they don’t have to follow best practices?  It is a classic issue that I’ve seen at every company I’ve ever worked at.  Somehow ETL is viewed by some as special and “not code”.

If you know what ETL is then skip this next paragraph as it’ll bore you and you most likely won’t agree with my dumbed down description…  That said, it’ll provide enough context for others…

What’s ETL?  It stands for Extraction Transformation and Load and it is probably one of the few areas in computer science where the promise of “Visual 4GL” might have actually played out to some degree.  Imagine a world where you can sit in front of a graphical IDE and “draw” a flowchart using a Visio like tool to describe how data enters a system (i.e. the “Extraction”), how it is manipulated or changed (the “Transformation”), and where the results are stored (the “Load”).  These “pictures” can be very simple as well as very complex.  They can have steps that run in a series, follow conditional paths, follow parallel paths with sync points to regroup, as well as explode when you drill down in to their own nested ETL definitions (as deep as you want to).  The extracting of data for an ETL can come from database tables, views, flat files, web services, and even custom connectors that were developed in a programming language.  This flexibility is also true for the loading of the results as literally many of the same connectors that act as input can be leveraged for outputs.  And now for the cool part…some ETL systems can even dynamically provision engines to spin up and down to execute your ETL jobs with as much parallelism as the rules will allow while ensuring accurate results.  Other than for managed code transactions in .NET many of us are still waiting for that magic to occur in other areas, no?

There are some companies and instances where ETL is truly valued over traditional code.  Yes, you could write code to create a complex transaction, but in many cases ETL may be easier to build and maintain.  It may even perform much better depending on the situation (as it may also perform much worse than code).  I’m not going to enter the debate as to which is better, ETL or code, as the truth is that it matters.  It is also like asking a father which of his kids he likes the best.  I like them both equally is my answer and I’m sticking to it.

So, given how great ETL can be and how often it is used why are there teams out there who don’t treat ETL with the same best practices as code? 

If you work on a team that does ETL work using such visual IDE tools let me ask the following:

  • Do you have unit test harnesses?  Do you use your ETL tool for automation testing?  You have a very robust ETL tool so why not create your test harnesses using the same ETL tool?  You can have your automation ETL preload tables with known data, have it run your ETL job(s) that are part of your “product”, and then use the automation ETL to then compare the result tables with known answer sets to flag and alert errors.  No?  Call me crazy, but for some reason this escapes a number of ETL teams I’ve met over the years.  Some of the best automation I’ve seen has been done with ETL tools.  (Extra credit: Have some Java web services you want to test?  Go out and ask one of your ETL buddies how they would design a test harness using their approach vs. writing more Java code; you’ll get an interesting solution that works quite well and is easy to maintain.)
  • Do you leverage configuration management capabilities or tools? Do you check out your ETL to make modifications?  Do you check in your changes and explain why the changes were made?  Do you paint your ETL versions with labels such that when you “build” your system you know exactly what ETL you are grabbing?  When you go to deploy your ETL to test or to production, was that based on a package created from what you’ve versioned and labeled?  Okay, I know it is harder to do than code as code is simpler to manage this way (code often sits in a file and you have easy integrations with version control systems in your IDE), but every ETL system has an artifact that could be managed with good software configuration management practices.  Some IDEs have evolved to have version and release management built in to them.  For the others?  Well worst case you could still import and export your ETL from the repository and apply version and release management to those artifacts.
  • Do you have a QA discipline, use defect tracking and management software, etc.?  Don’t chuckle.  I’ve had teams tell me that ETL isn’t “code” and therefore doesn’t need any of it.  Somehow Ralph is quite happy using e-mail, post-it notes, and a spiral binder of random scribbles to keep track of what works, what doesn’t ,what needs to be fixed, what to pick up at the store on the way home from work…  He knows his ETL works and doesn’t need Bob to test it because it “works on his machine”, on his repository, and with his “test” data.  Never heard all of those excuses before from your coding buddies have you?
  • Do you use performance analysis and profiling tools to optimize your ETL?  Hey, I’ve personally created more ETL than I’d like to admit using a variety of solutions; I know I can create something pig slow given the opportunity…  I bet you can too.
  • Where are your reusable components? Just because you can draw an ETL that takes yards of plotter paper to print out doesn’t mean that you should do so.  A coder can fill one source code file with megabytes of code if he wanted to.  Heck, he can even deblank it…  Just because your ETL is a “picture” doesn’t mean you cannot be organized and have re-usable components.  Spaghetti is spaghetti; the good news with ETL is your “picture” can be a good clue as to if it is spaghetti or not…
  • Where are your comments?  Would it kill you to annotate anything in your ETL?  Trust me, your data flow diagram isn’t self-documenting…  Nor are the reams of SQL or transformation commands you buried in one of the steps.
  • Agile Anyone? I’ve also heard from some that ETL just doesn’t condone itself to Agile.  How so?  You have prioritized work that needs to get done, it can be prioritized in a backlog queue, sprints can be planned as to what you’ll work on and why, demos can be given to show you’ve met the definition of done.  What about acceptance criteria in your stories?  Out of any computer science discipline could you show me a better example of where you can easily state what the acceptance criteria should be in terms of the outputs given a set of inputs?

I could go on for quite a while longer, but I think you get the point.  I’m not picking on anybody, but I’ve seen this movie hundreds of times and the ending always sucks.  In my experience this seems to be a very common discussion point amongst ETL teams.  So now I’ve gotta ask:  How’s your ETL doing?


From → Quality, Software

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: