big dataframes
description
This is an idiomatic kotlin dataframe toolkit to support data engineering tasks of any size collection of datasets.
The primary focus of this toolkit is to support Pandas-like operations on a Dataframe iterator for large data extractions using function assignment and deferred reification instead of in-memory data manipulation.
so far, these are the fundamaental composable Unary Operators: (val newcursor = oldcursor.operator)
-
Resampling time-series datasets on LocalDate/LocatlTime columns
cursor.resample(indexes) -
Pivot any columns into any collection of other columns
cursor.pivot(preservedcolumns,newcolumnheaders,expansiontargets) -
Group with reducers
cursor.group(columns,{reducer}) -
slice,reorder, and join columns
cursor[0]-slice first column onlycursor[0,1,2]-slice first three columnscursor[(0 until 3).painfulKotlinCastFunctions]-slice first three columnscursor[3,2,1,3,2, 1,1,1,1,2]-remap 3 source columns into 10 columnsjoin(cursor[0],cursor[2],othercursor[0],...)-join any permutation of source cursor/columns as one cursor. boundschecking is not done upfront here. know your row sizes.
-
random access across combined rows from different sources
combine(cursor1,cursor..n,)- a binary-searched column dispatch into n cursors. column boundschecking is not done here. non-uniform column meta-models per are built into the blackboard driver design to arrive at spreadsheet functionality (todo: formalization of cells functions ). -
Simplified one-hot encoding
cursor[0,1].categories([DummySpec.last]) -
ISO, Lunar, and Islamic Jvm Calendar Time-Series support
...almost
- todo: Javanese + Balinese Calendars
runtime objects
The familiar dataset abstractions are as follows:
Cursor: a cursor is a typealias Vector(Vect0r) of Rows accessable first by row(y) and then by Vector of column pairs (value,type) on x axis. This Row is a typealias called RowVec. Future implementations will include more complex arrangements of x,y,z and more, as described in the CoroutineContext at creationtime.
Table is generally speaking a virtual array of driver-specific x,y,z read and write access on homogenous and heterogenous backing stores.
Kotlin CoroutineContext - documented elsewhere, is the defining collection of factors describing the Table and Cursor configurations above using ContextElements to differentiate driver-level execution strategies at creation from common top level interfaces. one source input may potentially be accessed y,x, and x,y from two driver configurations.
architecture
The initial focus of the implementation rests on the fixed-width file format obtainable via the companion project
flatsql, part of jdbc2json.
the library is designed to levereage the ISAM properties of FWF and to extend toward reading and creation of other
data formats such as Binary rowsets and Scalar Column index volumes
internals: vectorlike
implementation distinctions from other implementations
The implementation relies on a set of typealiases and extension functions approximating various pure-functional constructs and retaining off-heap and deferred/lazy processing semantics.
to briefly explain this a little more, the typalias features in kotlin enable a Pair (Pai2) as an interface, which provides Vectors (Vect0r) as pairs of size and functions, and some rich many-to-one indexing operations in function composition.
Operations on this particular Pair(Pai2) may be the mechanism of mapping list or sequence semantics on primitive arrays or dynamically destructing a Vect0r to Vect02<First,Second> by casting alone and perform aggregate left, right functions without conversion.
features and todo
- Blackboard defined Table, Cursor, Row metadata driving access behaviors (using
CoroutineContext.Elements) - read an FWF text and efficiently mmap the row access, it becomes a
Cursor. [1] - enable index operations, reordering, expansions, preserving column metadata
- resample timeseries data (jvm LocalDate initially) to fill in series gaps
- concatenation of n cursors from disimilar FP projections
- pivot n rows by m columns (lazy) preserving l left-hand-side pass-thru columns
- groupby n columns
- cursor.group(n..){reducer}
- One-hot Encodings
- min/max scaling (same premise as resampling above)
- support Numerics, Linear Algebra libraries
- support for (resampling) Calendar, Time and Units conversion libraries
- orthogonal offheap and indirect IO component taxonomy
- nearly 0-copy direct access
- nearly 0-heap direct access
- large file access: JVM NIO mmap window addressability beyond MAXINT bytes
- Algebraic Vector aggregate operations with lazy runtime execution on contents
- Mapper Buffer pools
- Access (named) Columns by name
- heap object Object[][] cursor mappings - if i did this first it would never have off-heap.
- Review as Java lib via maven. what is available, what's not.
- a token amount of jvm switch testing.
- textual field format IO/mapping
- binary field format IO/mapping (network endian binary int/long/ieee)
- basic ISAM [de]hydradation - in review of network-endian binary fwf, we ought to just call it ISAM.
- idempotent ISAM - ISAM volumes can be cryptographically digested to give a placeholder of the contents in operator expressions of transforms
- sharded [de]hydration - unit test to shard dayjob sample by column
lower priorities (as-yet unimplemented orthogonals)
- gossip mesh addressable cursor iterators (this branch) [2]
- json field format IO/mapping
- CBOR field format IO/mapping
- csv IO tokenization +- headers
- columnstore access patterns +- apache arrow compatibility
- matrix math integrations with adjacent ecosystem libraries
- key-value associative cursor indexes
- hilbert curve iterators for converting (optimal/bad) cache affinity patterns to (good/good) cache affinity
- R-Tree n-dimensional associative
- parallel and concurrent access helpers
- explicit platter and direct partition mapping
- jdbc adapters [1]
- sql query language driver [1]
- jq query language driver
- sharded datatables - IO Routines to persist idempotent rowsets across multiple rows and column divisions
- mutability facade - append-only journal of volume mutations as idempotent transformation expressions
- adaptive cavitation - consolidate and redivide TableRoots based on mutation patterns to absorb journalized mutations into contiguous and conjoined volumes with join and combine operators against digests
[1]: downstream of jdbc2json
[2]: borrowing from SWIM Implementation here
Figure below: Orthogonal Context elements (Sealed Class Hierarchies).
These describe different aspects of accessing data and projecting cursors and matrix transformations These are easy to think of as hierarchical threadlocals to achieve IOBound storage access to large datasets.
inspired by the STXXL project
priorities and organization
the code is only about composable cursor abstraction. this part has been made as clean as possible.
However, the driver code is complex, the capabilities are unbounded, and the preamble for a cursor on the existing NIO driver is a little bit unsightly. it is hoped that macro simplifications can converge with similar libraries in the long run. the driver code is intended to be orthogonal and not a cleanest possible implementation of one format, and the overly-abstract class heirarchy was not collapsed after writing the first IO driver for this reason.
Experiments show IO arrangement is the biggest factor enabling algorithmic code to compete and
sometimes outperform embarassingly parrallel approaches 
Scalability! But at what COST? slides
The tradeoff here is that a simplistic format-only serializer interface is going to induce users to write for loops to fix up near misses, instead of having composability first. This is my experience with Pandas as it applies to my early experiences. For whatever reason Pandas has a C++ optimized non-ISAM CSV reader but the FWF implementation lacks the capabilities of the fixed-width guarantees, benchmarking much better in CSV than FWF when the hardware support is quite the opposite.
the end-product of a blackboard driver construction layer is hopefully a format construction dsl to accommodate a variety of common and slightly tweaked combinations of IO and encodings. progress in here should have no impact or influence on the Columnar cursor user API, datasources should be abstract even if there are potentially multiple driver implementations to enable better IO for specific cases.
jvm switches
using -server -Xmx24g -XX:MaxDirectMemorySize=1G outperforms everything I've tried to hand-tune before adding -server

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

