Hive Metastore (HMS): What it is & What Can Replace it

Einat Orr, PhD

August 9, 2021

What is Hive Metastore? Hive Metastore (HMS) provides a single repository of metadata that you can quickly analyze to make educated, data-driven decisions. It’s an important component of many data lake systems. Hive is based on Apache Hadoop and can store data on S3, ADLS, and other cloud storage services via HDFS. Hive enables users […]

Data Engineering

What is Data Lifecycle Management (DLM)?

Paul Singman, Einat Orr, PhD

July 14, 2021

What is Data Lifecycle Management? Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table. Instead, useful sets of data are created and maintained through a process that involves several predictable steps. Managing this process is often

Data Engineering

How To Measure Data Engineering Teams

Einat Orr, PhD

June 14, 2021

Data teams love calculating and tracking everything with metrics. We already have the infrastructure in place to do so… yet often fail to apply the same strategy for our own work. To fix that, let’s take a look at a few metrics useful for measuring our own performance. In my old office, there was a

Data Engineering

Solving Data Reproducibility

Paul Singman

May 19, 2021

Debugging an issue is never fun, but why make it harder? In this post, we show how reproducing data is possible whether interacting with a single file, entire table, or data repository. Introducing Data Reproducibility There are two types of issues in the world — reproducible and unreproducible. A reproducible issue is one where the original conditions for

Data Engineering Thought Leadership

The State of Data Engineering in 2021

Einat Orr, PhD

May 5, 2021

Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we are solving. What better way to visualize our place in this ecosystem, I thought, than by creating a cross-sectional LUMAscape to depict it. What’s

Data Engineering Tutorials

Concrete Graveler: Splitting for Reuse

Ariel Shaqed (Scolnicov)

April 27, 2021

Welcome to another episode “Concrete Graveler”, our deep-dive into the implementation of Graveler, the committed object storage for lakeFS. Graveler is our versioned object store, inspired by Git. It is designed to store orders of magnitude more objects than Git does. The last episode focused on how we store a single commit — a snapshot

Data Engineering

Messing with AWS Endpoint URLs

Paul Singman

April 20, 2021

It makes perfect sense that if you type aws s3 ls s3://my-bucket to list the contents of an S3 bucket, you would expect to connect to the genuine bucket and have its contents listed. But there’s no hard rule that you have to connect to the real bucket. And in fact, there’s a simple parameter

Data Engineering

Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared

Oz Katz

April 12, 2021

Introduction to Data Lakehouse When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility. It is inspiring that by simply changing the format data is stored in, we can unlock new functionality

Data Engineering

Why I’m Joining lakeFS

Paul Singman

April 6, 2021

Thoughts on a personal journey into the world of developer advocacy at an open-source data project. In March of 2021, I chose to leave the data team at Equinox Media and join a nascent open-source project lakeFS as the first developer advocate. In this post, I share a few reasons why I’m excited about starting this

Data Engineering

3 Data Lake Anti-Patterns to Avoid

Paul Singman

March 30, 2021

Rid yourself of these troubling habits and start the journey towards data lake mastery! Introduction Data lakes offer tantalizing performance upside, which is a major reason for their high rate of adoption. Sometimes though, the promise of technological performance can overshadow an unpleasant developer experience. This is troublesome since I believe the developer experience is as

Data Engineering

What is a Data Lake? Data Lake vs Data Warehouse

Paul Singman

March 22, 2021

What is a Data Lake? A data lake is a system of technologies that allow for the querying of data in file or blob objects. When employed effectively, they enable the analysis of structured and unstructured data assets at tremendous scale and cost-efficiency. The number of organizations employing data lake architectures has increased exponentially since

Best Practices Data Engineering

lakeFS Hooks: Implementing Write-Audit-Publish for Data Using Pre-Merge Hooks

Oz Katz

March 2, 2021

Write-Audit-Publish (continuous integration/continuous deployment of data) is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment of data ensures the quality of data at each step of a production pipeline. In this blog, I will present lakeFS’s web hooks, and

Data Engineering

Pick up the Slack with lakeFS