Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros

Learn from AI, ML & data leaders

March 31, 2026  |  Live

Data Engineering

Data Engineering

Hive Metastore (HMS): What it is & What Can Replace it

Einat Orr, PhD

What is Hive Metastore? Hive Metastore (HMS) provides a single repository of metadata that you can quickly analyze to make educated, data-driven decisions. It’s an important component of many data lake systems. Hive is based on Apache Hadoop and can store data on S3, ADLS, and other cloud storage services via HDFS. Hive enables users […]

Data Engineering

What is Data Lifecycle Management (DLM)?

Paul Singman, Einat Orr, PhD

What is Data Lifecycle Management? Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table. Instead, useful sets of data are created and maintained through a process that involves several predictable steps. Managing this process is often

Data Engineering

How To Measure Data Engineering Teams

Einat Orr, PhD

Data teams love calculating and tracking everything with metrics. We already have the infrastructure in place to do so… yet often fail to apply the same strategy for our own work. To fix that, let’s take a look at a few metrics useful for measuring our own performance. In my old office, there was a

Data Engineering

Solving Data Reproducibility

Paul Singman

Debugging an issue is never fun, but why make it harder? In this post, we show how reproducing data is possible whether interacting with a single file, entire table, or data repository. Introducing Data Reproducibility There are two types of issues in the world — reproducible and unreproducible.  A reproducible issue is one where the original conditions for

Data Engineering Thought Leadership

The State of Data Engineering in 2021

Einat Orr, PhD

Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we are solving. What better way to visualize our place in this ecosystem, I thought, than by creating a cross-sectional LUMAscape to depict it. What’s

Data Engineering Tutorials

Concrete Graveler: Splitting for Reuse

Ariel Shaqed (Scolnicov)

Welcome to another episode “Concrete Graveler”, our deep-dive into the implementation of Graveler, the committed object storage for lakeFS. Graveler is our versioned object store, inspired by Git. It is designed to store orders of magnitude more objects than Git does.  The last episode focused on how we store a single commit — a snapshot

Data Engineering

Messing with AWS Endpoint URLs

Paul Singman

It makes perfect sense that if you type aws s3 ls s3://my-bucket to list the contents of an S3 bucket, you would expect to connect to the genuine bucket and have its contents listed. But there’s no hard rule that you have to connect to the real bucket. And in fact, there’s a simple parameter

Data Engineering

Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared

Oz Katz

Introduction to Data Lakehouse When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility. It is inspiring that by simply changing the format data is stored in, we can unlock new functionality

Data Engineering

Why I’m Joining lakeFS

Paul Singman

Thoughts on a personal journey into the world of developer advocacy at an open-source data project. In March of 2021, I chose to leave the data team at Equinox Media and join a nascent open-source project lakeFS as the first developer advocate. In this post, I share a few reasons why I’m excited about starting this

Data Engineering

3 Data Lake Anti-Patterns to Avoid

Paul Singman

Rid yourself of these troubling habits and start the journey towards data lake mastery! Introduction Data lakes offer tantalizing performance upside, which is a major reason for their high rate of adoption. Sometimes though, the promise of technological performance can overshadow an unpleasant developer experience. This is troublesome since I believe the developer experience is as

Data Engineering

What is a Data Lake? Data Lake vs Data Warehouse

Paul Singman

What is a Data Lake? A data lake is a system of technologies that allow for the querying of data in file or blob objects.  When employed effectively, they enable the analysis of structured and unstructured data assets at tremendous scale and cost-efficiency. The number of organizations employing data lake architectures has increased exponentially since

We use cookies to improve your experience and understand how our site is used.

Learn more in our Privacy Policy