parquet

Validation should be added to directed fields in schemas. This will be done as part of work for version 2 as adding in validation would cause breaking changes.

Situation

When creating a package:

import quilt3
quilt3.config(default_remote_registry='s3://your-bucket')
p = quilt3.Package()
p.push("username/packagename")

The package name can be any string. In particular it may be e.g. fashion-mnist.

Why is it wrong?

I would like

Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.

I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with

driver = 'libhdfs'
return pyarrow.hdfs.c

Currently, there isn't a way to get the table properties in the SparkOrcWriter via the WriterFactory.

Currently reading Parquet data file is twisted in OapDataReader, while this is in datasources/oap.
It should be dealt with in another class.

Hello,

I haven't tested append() yet, and I was wondering if duplicates are removed when an append is managed.
I had a look in collection.py script and following pandas function are used:
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

After a look into pandas documentation, I understand that duplicate lines are removed, only the last occurence is kept.

Over time we've had some things leak into the diff methods that make it more cumbersome to use BigDiffy via code instead of CLI.

For example diffAvro here https://github.com/spotify/ratatool/blob/master/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala#L284

User has to manually pass in schema otherwise we they receive a non-informative error regarding null schema, add

https://docs.scipy.org/doc/numpy/reference/ufuncs.html#ufunc

It would allow us to skip processing of masked-out elements without first compacting the arrays. It might only be applicable if the MaskedArray.content is Numpy, though.

For documentation???

Spark is great at parallel processing data already in a distributed store like HDFS but it's not really designed for ingesting data at REST from a non-distributed store like a Local File System though there is support for it, i.e. local mode.

The disadvantage of ingesting data at REST from a local file system:

There's no advantage in using YARN o

Now that airspeed-velocity/asv#449 is fixed, we could do a proper test of our benchmarks during CI.

When opening a parquet file, ParquetViewer first launches a popup "Select fields to load", where you either can confirm to load all fields, or select the fields you want.

In all use cases relevant for me, I want to display all fields. Hence I'm wondering if it would be possible to skip this popup all together? It's just inconvenient to always confirm the "All fields...", before you see any data

May	JUN	Jul
	27
2019	2020	2021

parquet

Here are 155 public repositories matching this topic...

gchq / Gaffer

apache / parquet-mr

quiltdata / quilt

Situation

Why is it wrong?

uber / petastorm

apache / parquet-format

skale-me / skale

Netflix / iceberg

apache / parquet-cpp

HariSekhon / DevOps-Python-tools

Cinchoo / ChoETL

Intel-bigdata / OAP

moshe / elasticsearch_loader

ranaroussi / pystore

spotify / ratatool

elastacloud / parquet-dotnet

scikit-hep / awkward-array

ironSource / parquetjs

Chabane / bigdata-playground

cldellow / sqlite-parquet-vtable

51zero / eel-sdk

sunchao / parquet-rs

JDASoftwareGroup / kartothek

mukunku / ParquetViewer

indix / schemer

lightcopy / parquet-index

mjakubowski84 / parquet4s

saurfang / sparksql-protobuf

Re1tReddy / Spark

skale-me / node-parquet

spotify / gcs-tools

Improve this page

Add this topic to your repo