LakeFS


lakeFS
Original author(s)Einat Orr
Oz Katz
Developer(s)Treeverse
Initial releaseAugust 3, 2020
Stable release
0.104.0
Repositoryhttps://github.com/treeverse/lakeFS
Written inGo
TypeData version control
LicenseApache 2.0
Websitelakefs.io

lakeFS is a free and open-source software developed by Treeverse.[1][2] It provides scalable and format-agnostic version control for data lakes,[3] using Git-like semantics to create and access different data versions.[1][2]

First released in August 2020, its features include data version tracking, isolated development and testing, repository rollback, continuous data integration and deployment.

History

lakeFS was developed by Oz Katz and Einat Orr in 2020.[4][5]

Its first public release, v0.8.1, was provided by Treeverse in August 2020. This version provided Git-like operations for any file format and AWS S3 storage compatibility, featuring a versioning engine based on MVCC.[6]

In 2021, the versioning engine transitioned to Graveler, increasing its handling capacity to billions of objects with a limited performance impact.[7]

In July 2021, Treeverse, the parent company of lakeFS, received an investment of $23 million in a Series A funding round, led by Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures.[5][8][9]

In June 2022, lakeFS Cloud was introduced as a managed service to facilitate versioning in cloud data lakes.[1][3] This service helps mitigate challenges related to tracking data changes and reverting to previous versions.[3]

Software

Overview

lakeFS is a data versioning engine that manages data in a way similar to code. By using operations such as branching, committing, merging, and reverting, which resemble those found in Git, it facilitates the handling of data and its corresponding schema throughout the entire data life cycle.[10]

Features

lakeFS is an interface made for interaction with object stores such as S3 as well as data management systems, such as AWS Glue and Databricks.[1] The system assigns the task of actual data storage to backend services such as AWS, while it handles branch tracking and supports multiple storage providers.[1]

lakeFS simplifies branch creation, tracking, and merging.[1] It removes the need for complete dataset duplication during testing phases, thereby isolating experimental modifications.[1] It also streamlines branch operations, supporting the creation, merging, or deletion of branches as required.[1] Furthermore, it integrates with continuous integration and deployment pipelines via webhooks.[1]

When dealing with arbitrary object storage, lakeFS processes data blocks via API calls.[1] It stores branching information as metadata, enabling efficient subsequent object management as needed.[1]

lakeFS hooks

lakeFS hooks enable specific checks and validations before key lifecycle events.[10] Unlike Git Hooks, these hooks activate remote servers to run tests.[10] They can be configured to assess table schemas when merging data from development or test branches into production; if validation fails, the merge is blocked.[10] This function serves as a tool for schema enforcement and standardized rule application across various data sources and producers.[10]

Events that can trigger these hooks may include change commits, branch merges, new branch creations, or alterations in tags.[11] In the context of a merge, a pre-merge hook operates on the source branch before the finalization of the merge.[11]

References

  1. ^ a b c d e f g h i j k Wayner, Peter (June 27, 2022). "LakeFS brings branching to data lakes". VentureBeat. Archived from the original on June 27, 2023. Retrieved June 27, 2023.
  2. ^ a b Borck, James R. (October 18, 2021). "The best open source software of 2021". InfoWorld. Archived from the original on March 8, 2023. Retrieved July 18, 2023.
  3. ^ a b c Kerner, Sean Michael (22 June 2022). "Treeverse set to launch lakeFS cloud data lake service". TechTarget. Archived from the original on 2023-06-27. Retrieved 2023-06-27.
  4. ^ Goldberg, Niva (July 29, 2021). "Israeli Startup Treeverse Secures $23 Million for Open Source Technology". Jewish Business News. Archived from the original on July 8, 2023. Retrieved July 18, 2023.
  5. ^ a b Sawers, Paul (28 July 2021). "Treeverse raises $23M to bring Git-like version control to data lakes". VentureBeat. Archived from the original on 2023-09-24. Retrieved 2023-06-27.
  6. ^ "v0.8.1". Github. Archived from the original on 2024-06-28. Retrieved 2023-06-27.
  7. ^ "lakeFS Architecture". Archived from the original on 2023-08-10. Retrieved 2023-08-10.
  8. ^ Orbach, Meir (28 July 2021). "Treeverse raises $15 million Series A to leverage lakeFS". Calcalist. Archived from the original on 7 July 2023. Retrieved 18 July 2023.
  9. ^ Martin, Noga (28 July 2021). "Open source technology lakeFS secures $23M in funding". Israel Hayom. Archived from the original on 10 July 2023. Retrieved 10 August 2023.
  10. ^ a b c d e Hemo, Yaniv Ben (3 February 2023). "How To Avoid "Schema Drift"". Archived from the original on 10 August 2023. Retrieved 10 August 2023.
  11. ^ a b Avneri, Iddo (27 June 2023). "Managing Schema Validation in a Data Lake Using Data Version Control". Archived from the original on 11 August 2023. Retrieved 11 August 2023.