Friday, November 16 • 1:40pm - 2:00pm
Distinguishing features of production-quality data pipelines

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

It's now easier than ever before to execute SQL queries or a few lines of code in a notebook against massive datasets and produce exciting results on a whim. Sometimes though you might be asked to take your one off proof of concept and "productionize" it. Simply putting your query or script on a schedule might be all that's needed for the problem at hand, but if you're building for the long term, there are costly pitfalls you might face in the future with new data and changing logic. What are the foundations to build on and what are the nice to have qualities and utilities so that you can avoid the big data equivalent of emailing spreadsheets to each other? You may have heard of "lambda architecture," "immutable append only data sources", "reproducible deterministic outputs," "atomic deployments" and how nice it is for your data pipeline to have these qualities, but what are the specific benefits and in what situations are they important or not? This talk will detail various practices and principles around data pipelines which can help to avoid costly mistakes and hours lost to debugging mysteries. For the most part we'll focus on why you might put effort into certain goals that don't directly affect your immediate results. This talk is geared mainly towards scala/spark data pipelines but aims to be relevant to other kinds of data pipelines as well.

avatar for Nimbus Goehausen

Nimbus Goehausen

Principal Data Engineer, Demandbase

Friday November 16, 2018 1:40pm - 2:00pm PST