The Wayback Machine - https://web.archive.org/web/20200627074006/https://github.com/topics/parquet
Skip to content
#

parquet

Here are 155 public repositories matching this topic...

Jonathanpro
Jonathanpro commented Jan 2, 2019

Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.

I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with

driver = 'libhdfs'
return pyarrow.hdfs.c

80+ DevOps & Data CLI Tools - AWS, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, Ambari, Blueprints, CloudFormation, Elasticsearch, Solr, Pig, IPython - Python / Jython Tools

  • Updated Jun 12, 2020
  • Python
pystore
yohplala
yohplala commented Jan 6, 2020

Hello,

I haven't tested append() yet, and I was wondering if duplicates are removed when an append is managed.
I had a look in collection.py script and following pandas function are used:
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

After a look into pandas documentation, I understand that duplicate lines are removed, only the last occurence is kept.

idreeskhan
idreeskhan commented Dec 30, 2019

Over time we've had some things leak into the diff methods that make it more cumbersome to use BigDiffy via code instead of CLI.

For example diffAvro here https://github.com/spotify/ratatool/blob/master/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala#L284

User has to manually pass in schema otherwise we they receive a non-informative error regarding null schema, add

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

  • Updated Feb 1, 2019
  • TypeScript
hannesmiller
hannesmiller commented Mar 7, 2017

For documentation???

Spark is great at parallel processing data already in a distributed store like HDFS but it's not really designed for ingesting data at REST from a non-distributed store like a Local File System though there is support for it, i.e. local mode.

The disadvantage of ingesting data at REST from a local file system:

  • There's no advantage in using YARN o
tobias-hd
tobias-hd commented Aug 13, 2019

When opening a parquet file, ParquetViewer first launches a popup "Select fields to load", where you either can confirm to load all fields, or select the fields you want.

In all use cases relevant for me, I want to display all fields. Hence I'm wondering if it would be possible to skip this popup all together? It's just inconvenient to always confirm the "All fields...", before you see any data

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .

  • Updated May 15, 2020
  • Scala

Improve this page

Add a description, image, and links to the parquet topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the parquet topic, visit your repo's landing page and select "manage topics."

Learn more

You can’t perform that action at this time.