Managing the Lifecycle of Datasets at Scale: Pipelines + Storage + Access

10 Oct 2024

11:40 - 12:00

AI, Machine Learning & Advanced Analytics Theatre

Every day, data grows not only in size, but also in value. People regularly make data-driven decisions about whether to wear rain boots or what to invest in (e.g., a pharmaceutical company with the next blockbuster weight loss drug). Bloomberg's Data Platform Engineering team manages diverse financial datasets. In order to scale, the team built configurable workflows to standardize the broad variety of structured and unstructured data in order to make it machine readable. These datasets are then delivered to internal clients and external customers through various applications and APIs. In this session, we will review one of Bloomberg’s data pipeline architectures used to onboard hundreds of datasets. We will explore our extensive use of Apache Airflow to orchestrate ingestion, horizontally-scaled PostgreSQL clusters, and Trino to access and combine disparate datasets. This session will inspire new ideas and strategies for standardizing and scaling your data.

Speakers