jnorthrup / columnar

big dataframes

description

This is an idiomatic kotlin dataframe toolkit to support data engineering tasks of any size collection of datasets.

The primary focus of this toolkit is to support Pandas-like operations on a Dataframe iterator for large data extractions using function assignment and deferred reification instead of in-memory data manipulation.

so far, these are the fundamaental composable Unary Operators: (val newcursor = oldcursor.operator)

Resampling time-series datasets on LocalDate/LocatlTime columns

cursor.resample(indexes)
Pivot any columns into any collection of other columns

cursor.pivot(preservedcolumns,newcolumnheaders,expansiontargets)
Group with reducers

cursor.group(columns,{reducer})
slice,reorder, and join columns
- cursor[0] -slice first column only
- cursor[0,1,2] -slice first three columns
- cursor[(0 until 3).painfulKotlinCastFunctions] -slice first three columns
- cursor[3,2,1,3,2, 1,1,1,1,2] -remap 3 source columns into 10 columns
- join(cursor[0],cursor[2],othercursor[0],...) -join any permutation of source cursor/columns as one cursor. boundschecking is not done upfront here. know your row sizes.
random access across combined rows from different sources

combine(cursor1,cursor..n,) - a binary-searched column dispatch into n cursors. column boundschecking is not done here. non-uniform column meta-models per are built into the blackboard driver design to arrive at spreadsheet functionality (todo: formalization of cells functions ).
Simplified one-hot encoding

cursor[0,1].categories([DummySpec.last])
ISO, Lunar, and Islamic Jvm Calendar Time-Series support

...almost
- todo: Javanese + Balinese Calendars

runtime objects

The familiar dataset abstractions are as follows:

Cursor: a cursor is a typealias Vector(Vect0r) of Rows accessable first by row(y) and then by Vector of column pairs (value,type) on x axis. This Row is a typealias called RowVec. Future implementations will include more complex arrangements of x,y,z and more, as described in the CoroutineContext at creationtime.

Table is generally speaking a virtual array of driver-specific x,y,z read and write access on homogenous and heterogenous backing stores.

Kotlin CoroutineContext - documented elsewhere, is the defining collection of factors describing the Table and Cursor configurations above using ContextElements to differentiate driver-level execution strategies at creation from common top level interfaces. one source input may potentially be accessed y,x, and x,y from two driver configurations.

architecture

The initial focus of the implementation rests on the fixed-width file format obtainable via the companion project flatsql, part of jdbc2json.
the library is designed to levereage the ISAM properties of FWF and to extend toward reading and creation of other data formats such as Binary rowsets and Scalar Column index volumes

internals: vectorlike

implementation distinctions from other implementations

The implementation relies on a set of typealiases and extension functions approximating various pure-functional constructs and retaining off-heap and deferred/lazy processing semantics.

to briefly explain this a little more, the typalias features in kotlin enable a Pair (Pai2) as an interface, which provides Vectors (Vect0r) as pairs of size and functions, and some rich many-to-one indexing operations in function composition.

Operations on this particular Pair(Pai2) may be the mechanism of mapping list or sequence semantics on primitive arrays or dynamically destructing a Vect0r to Vect02<First,Second> by casting alone and perform aggregate left, right functions without conversion.

features and todo

Blackboard defined Table, Cursor, Row metadata driving access behaviors (using CoroutineContext.Elements)
read an FWF text and efficiently mmap the row access, it becomes a Cursor. [1]
enable index operations, reordering, expansions, preserving column metadata
resample timeseries data (jvm LocalDate initially) to fill in series gaps
concatenation of n cursors from disimilar FP projections
pivot n rows by m columns (lazy) preserving l left-hand-side pass-thru columns
groupby n columns
cursor.group(n..){reducer}
One-hot Encodings
min/max scaling (same premise as resampling above)
support Numerics, Linear Algebra libraries
support for (resampling) Calendar, Time and Units conversion libraries
orthogonal offheap and indirect IO component taxonomy
nearly 0-copy direct access
nearly 0-heap direct access
large file access: JVM NIO mmap window addressability beyond MAXINT bytes
Algebraic Vector aggregate operations with lazy runtime execution on contents
Mapper Buffer pools
Access (named) Columns by name
heap object Object[][] cursor mappings - if i did this first it would never have off-heap.
Review as Java lib via maven. what is available, what's not.
a token amount of jvm switch testing.
textual field format IO/mapping
binary field format IO/mapping (network endian binary int/long/ieee)
basic ISAM [de]hydradation - in review of network-endian binary fwf, we ought to just call it ISAM.
idempotent ISAM - ISAM volumes can be cryptographically digested to give a placeholder of the contents in operator expressions of transforms
sharded [de]hydration - unit test to shard dayjob sample by column

lower priorities (as-yet unimplemented orthogonals)

gossip mesh addressable cursor iterators (this branch) [2]
json field format IO/mapping
CBOR field format IO/mapping
csv IO tokenization +- headers
columnstore access patterns +- apache arrow compatibility
matrix math integrations with adjacent ecosystem libraries
key-value associative cursor indexes
hilbert curve iterators for converting (optimal/bad) cache affinity patterns to (good/good) cache affinity
R-Tree n-dimensional associative
parallel and concurrent access helpers
explicit platter and direct partition mapping
jdbc adapters [1]
sql query language driver [1]
jq query language driver
sharded datatables - IO Routines to persist idempotent rowsets across multiple rows and column divisions
mutability facade - append-only journal of volume mutations as idempotent transformation expressions
adaptive cavitation - consolidate and redivide TableRoots based on mutation patterns to absorb journalized mutations into contiguous and conjoined volumes with join and combine operators against digests

[1]: downstream of jdbc2json

[2]: borrowing from SWIM Implementation here

Figure below: Orthogonal Context elements (Sealed Class Hierarchies).

These describe different aspects of accessing data and projecting cursors and matrix transformations These are easy to think of as hierarchical threadlocals to achieve IOBound storage access to large datasets.

inspired by the STXXL project

priorities and organization

the code is only about composable cursor abstraction. this part has been made as clean as possible.

However, the driver code is complex, the capabilities are unbounded, and the preamble for a cursor on the existing NIO driver is a little bit unsightly. it is hoped that macro simplifications can converge with similar libraries in the long run. the driver code is intended to be orthogonal and not a cleanest possible implementation of one format, and the overly-abstract class heirarchy was not collapsed after writing the first IO driver for this reason.

Experiments show IO arrangement is the biggest factor enabling algorithmic code to compete and sometimes outperform embarassingly parrallel approaches

Scalability! But at what COST? slides

The tradeoff here is that a simplistic format-only serializer interface is going to induce users to write for loops to fix up near misses, instead of having composability first. This is my experience with Pandas as it applies to my early experiences. For whatever reason Pandas has a C++ optimized non-ISAM CSV reader but the FWF implementation lacks the capabilities of the fixed-width guarantees, benchmarking much better in CSV than FWF when the hardware support is quite the opposite.

the end-product of a blackboard driver construction layer is hopefully a format construction dsl to accommodate a variety of common and slightly tweaked combinations of IO and encodings. progress in here should have no impact or influence on the Columnar cursor user API, datasources should be abstract even if there are potentially multiple driver implementations to enable better IO for specific cases.

jvm switches

using -server -Xmx24g -XX:MaxDirectMemorySize=1G outperforms everything I've tried to hand-tune before adding -server

Aug	SEP	Oct
	10
2019	2020	2021

jnorthrup / columnar

README.md

big dataframes

description

runtime objects

architecture

implementation distinctions from other implementations

features and todo

lower priorities (as-yet unimplemented orthogonals)

priorities and organization

jvm switches

About

Releases

Packages

Contributors 2

Languages

jnorthrup / columnar

Join GitHub today

Clone with HTTPS

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio

Latest commit

Git stats

Files

README.md

big dataframes

description

runtime objects

architecture

implementation distinctions from other implementations

features and todo

lower priorities (as-yet unimplemented orthogonals)

priorities and organization

jvm switches

About

Topics

Resources

Releases

Packages 0

Contributors 2

Languages

Packages