datalake

Here are 176 public repositories matching this topic...

trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

java distributed-systems data-science sql database big-data presto hive hadoop analytics jdbc databases distributed-database query-engine iceberg datalake prestodb trino delta-lake

Updated Jun 19, 2023
Java

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Updated Jun 19, 2023
Python

apache / hudi

Star

Upserts, Deletes And Incremental Processing on Big Data.

bigdata stream-processing data-integration datalake apachespark hudi apachehudi incremental-processing apacheflink

Updated Jun 19, 2023
Java

treeverse / lakeFS

Star

lakeFS - Data version control for your data lake | Git for data

go golang apache-spark aws-s3 google-cloud-storage data-engineering data-lake azure-storage data-version-control object-storage datalake hadoop-filesystem data-quality data-versioning azure-blob-storage apache-sparksql git-for-data lakefs datalakes

Updated Jun 19, 2023
Go

DataLinkDC / dinky

Star

Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.

sql olap flink datawarehouse datalake dlink flinksql flinkcdc real-time-computing-platform

Updated Jun 19, 2023
Java

leo-project / leofs

Star

The LeoFS Storage System

erlang s3 nfs s3-storage distributed-storage distributed-file-system leofs nfs-server datalake

Updated Jun 2, 2020
Erlang

lakesoul-io / LakeSoul

Star

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

rust streaming sql big-data spark postgresql flink datalake lakehouse lakesoul

Updated Jun 16, 2023
Scala

zinggAI / zingg

Star

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

Updated Jun 17, 2023
Java

NetEase / arctic

Star

Arctic is a streaming lake warehouse service open sourced by NetEase

bigdata datalake lakehouse

Updated Jun 19, 2023
Java

leesf / hudi-resources

Star

汇总Apache Hudi相关资料

bigdata apache stream-processing data-integration datalake hudi apachehudi incremental-processing hudi-resources

Updated Jun 18, 2023

Datavault-UK / automate-dv

Star

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)

metadata sql etl snowflake datawarehousing dbt elt datawarehouse datalake dataengineering datavault datavault20 data-vault

Updated Jun 14, 2023

cuebook / cuelake

Star

Use SQL to build ELT pipelines on a data lakehouse.

sql apache-spark etl pipelines data-engineering data-lake data-transfer delta data-integration upsert elt data-pipeline datalake data-ingestion spark-sql zeppelin-notebook apache-iceberg lakehouse incremental-updates

Updated May 25, 2022
JavaScript

japila-books / delta-lake-internals

Star

The Internals of Delta Lake

books book internals datalake delta-lake deltalake

Updated Apr 13, 2023

awslabs / aws-orbit-workbench

Star

A Data Platform built for AWS, powered by Kubernetes.

kubernetes aws jupyter analytics gpu jupyterhub data-analysis redshift mach workbench datalake dataengineering eks eks-cluster orbit-workbench

Updated Jun 2, 2023
Python

WeBankFinTech / Streamis

Star

Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.

streaming kafka warehouse flink iceberg datalake hudi deltalake linkis dataspherestudio wedatasphere streamis

Updated Mar 21, 2023
Java

izhangzhihao / Real-time-Data-Warehouse

Star

Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi

Updated Feb 24, 2022
Dockerfile

martandsingh / ApacheSpark

Star

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.