Back To Schedule
Saturday, November 17 • 1:40pm - 2:00pm
Journey of Building a Modern Data Prep Tool on Top of Apache Spark

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Apache Spark is designed to be extensible and pluggable, offering much flexibility in how the system can be used. In this talk, we show how we utilize Spark to build the data preparation engine that powers Workday Prism Analytics. Our data prep engine runs two types of Spark applications: one that is “always on” to serve interactive data prep queries, and another that is “on demand” to perform batch processing of data pipelines. We demonstrate how Spark and Catalyst made it possible to have these two types of applications share much of the same code, differing only in sampling, caching, and result extraction. Further, we illustrate how our engine today takes advantage of Spark SQL and Catalyst to generate DataFrames/Datasets optimized for our use cases, and relies on Tungsten to facilitate codegen on 100+ custom library functions we expose to our users. We also describe how we leverage the Data Sources API to implement partition elimination and incremental data analysis on top of various file formats.

avatar for Jianneng Li

Jianneng Li

Software Engineer, Workday
Jianneng is a software engineer specializing in distributed systems and data processing. He works at Workday on Prism Analytics, leveraging Apache Spark to build an end-to-end data analytics solution that helps businesses better understand their financial and HR data.

Saturday November 17, 2018 1:40pm - 2:00pm PST