COLLECTED BY
Organization:
Internet Archive
Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
The Wayback Machine - https://web.archive.org/web/20200730133151/https://github.com/topics/data-engineering
Here are
468 public repositories
matching this topic...
A modern data workflow platform
Updated
Jul 30, 2020
Python
📊 📋 Dashboards using YAML or JSON files
Updated
Jul 29, 2020
JavaScript
A list of useful resources to learn Data Engineering from scratch
📚 Curated papers, articles & videos on data science & machine learning applied in production, with results.
Quilt is a versioned data portal for S3
Updated
Jul 30, 2020
Jupyter Notebook
Updated
Jul 30, 2020
Python
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Updated
Jun 30, 2020
Jupyter Notebook
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Updated
Mar 9, 2020
Python
Clean APIs for data cleaning. Python implementation of R package Janitor
Updated
Jul 30, 2020
Python
Example project implementing best practices for PySpark ETL jobs and applications.
Updated
Jul 9, 2020
Python
📝 A compilation of everything that I learn; Computer Science, Software Development, Engineering, Math, and Coding in General. Read the rendered results here ->
Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on various cluster computing platforms. Please see
https://github.com/cwensel/cascading for access to all WIP branches.
Updated
Nov 29, 2018
Java
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Updated
Mar 5, 2020
Python
Updated
Apr 20, 2020
Python
Open Metadata and Governance
Updated
Jul 30, 2020
Java
Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift
Updated
Jul 30, 2020
TypeScript
A daily digest of the articles or videos I've found interesting, that I want to share with you.
An Awesome List of Open-Source Data Engineering Projects
A package to easily open an instance of a Google spreadsheet and interact with worksheets through Pandas DataFrames.
Updated
Jun 18, 2020
Python
An automatic ML model optimization tool.
Updated
Jul 28, 2020
Python
The Accelerator is a tool for fast and reproducible processing of large amounts of data.
Updated
Jun 24, 2020
Python
Study materials for the Google Cloud Professional Data Engineering Exam
Updated
Jul 29, 2020
HTML
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Updated
Jun 30, 2020
Scala
Projects done in the Data Engineering Nanodegree by Udacity.com
Updated
Aug 7, 2019
Jupyter Notebook
Interactive computing for complex data processing, modeling and analysis in Python 3
Updated
Feb 24, 2020
Python
Tool to build production-ready pipelines for experimentation with Kedro and MLflow
Updated
Jun 22, 2020
Python
Ansible playbook to deploy distributed technologies
Updated
Nov 20, 2017
Python
The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation
Updated
Jul 30, 2020
Java
Documentation for data enthusiasts
Updated
Jul 17, 2020
JavaScript
Improve this page
Add a description, image, and links to the
data-engineering
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
data-engineering
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.