COLLECTED BY
Organization:
Internet Archive
Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
The Wayback Machine - https://web.archive.org/web/20200809162739/https://github.com/topics/data-lake
Here are
90 public repositories
matching this topic...
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Updated
Mar 9, 2020
Python
Generic Data Ingestion & Dispersal Library for Hadoop
Updated
Jan 31, 2020
Java
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Updated
Mar 5, 2020
Python
U-SQL Examples and Issue Tracking
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
Updated
Apr 6, 2020
Jupyter Notebook
Samples and Docs for Azure Data Lake Store and Analytics
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Updated
Aug 8, 2020
Python
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
Updated
Aug 4, 2020
Scala
Reference Architectures for Datalakes on AWS
Updated
May 13, 2020
HTML
Query API for aggregated Zeebe data
Updated
Aug 7, 2020
Kotlin
Apache Spark Course Material
Updated
Jul 26, 2020
Scala
Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.
Updated
Jun 26, 2020
Python
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Updated
Jul 27, 2020
Python
A K8s-based infrastructure for analytics
Updated
Jan 15, 2020
Shell
Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
EU Budget for Results - Data Lake
Updated
Dec 2, 2019
JavaScript
Terraform module for an Azure Data Lake
Framework to quickly build and maintain Smart Data Lakes
Updated
Aug 7, 2020
Scala
Logstash output plugin for Azure Data Lake Store (ADLS)
Updated
Sep 15, 2017
Ruby
Demonstration of a Hive Input Format for Iceberg
Updated
Jun 17, 2020
Java
📒 (GitBook) A curated list of awesome Data Engineering resources
Sample and tutorial that creates interactive dashboards using: Dynamic Dashboard Embedded, Cloud Object Storage, SQL Query, DB2 Warehouse and AppID.
Updated
Jul 20, 2020
TypeScript
An idiomatic kotlin dataframe toolkit for data engineering tasks of any size dataset
Updated
Jul 30, 2020
Kotlin
Personal Data Engineering Projects
Updated
Jun 3, 2020
Jupyter Notebook
Apache Spark 3 - Structured Streaming Course Material
Updated
Aug 5, 2020
Python
The Simple Data Lake - Data Kale
Updated
Apr 17, 2020
Python
Create Data Lake on AWS S3 to store dimensional tables after processing data using Spark on AWS EMR cluster
Updated
Oct 10, 2019
Python
Prominent data platform design with AWS well-architected framework
Updated
Dec 20, 2019
Python
Improve this page
Add a description, image, and links to the
data-lake
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
data-lake
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.