465 questions
Advice
5
votes
1
replies
144
views
Parquet VS ORC In Iceberg
Hi I have been interested lately in learning iceberg. There is something was not able to get so I thought I would ask here.
I really wanna know why is Apache parquet the native file format used when ...
0
votes
1
answer
58
views
I'm writing repeated string values to a string column in an ORC file using Java and while reading the ORC file back, encounter a NullPointerException
When I am trying to write same value for each row for string column in orc file, only first row is returning the written value, while reading remaining rows, facing null pointer issue. In some cases, ...
0
votes
1
answer
168
views
Apache ORC buffer size too small
I face the attached problem when reading an orc file:
Is it possible to change this buffer size of 65536 to the needed one of 1817279?
Which configuration values do I have to adapt in order to set ...
1
vote
1
answer
152
views
How to use python to create ORC file compressed with ZLIB compression level 9?
I want to create an ORC file compressed with ZLIB compression level 9.
Thing is, when using pyarrow.orc, I can only choose between "Speed" and "Compression" mode,
and can't control ...
0
votes
0
answers
172
views
Apache Beam code to write output in ORC format
I am new to apache beam, and i have a use case where I need to write a java streaming code to read from a KafkaTopic (from which i extract some CustomObject.class) and output the entries to hdfs in ...
1
vote
1
answer
2k
views
I get a "Fatal Python error: Aborted" and no explanatory error message I can work with when I try to open a simple .orc file with pyarrow
I am using:
Win 10 Pro
Intel(R) Xeon(R) W-1250 CPU @ 3.30GHz / 16 GB RAM
Anaconda Navigator 2.5.0,
Python 3.10.13 in venv
pyarrow 11.0.0
pandas 2.1.1
Running scripts in Spyder IDE 5.4.3
I want to open/...
0
votes
1
answer
229
views
Read ORC files from AWS S3 bucket in Flink app
We are using Flink version of 1.13.5 and trying to read the ORC files from AWS S3 location. And, we are deploying our application in a self-managed flink cluster. Please find the below code for ...
0
votes
1
answer
203
views
binary format that allows to store multiple pandas dataframes with different columns, width, rows
I have like 200 pandas dataframe, and every dataframe has some unique column, or maybe completely different columns. example:
df1 = pd.DataFrame({
'Product': ['Apple', 'Banana', 'Orange', 'Mango'],...
0
votes
0
answers
911
views
Detection and Cleaning of Strike-out Texts on Handwriting
I have images where the text is strike-out and replace by next words. Sometimes it's just one line that gets struck out. Other times, multiple lines are.
I expected output should be like this. remove ...
0
votes
0
answers
79
views
In hadoop, why does the parquet format occupy higher memory than the original txt when I test?
I am testing the impact of different data formats on hive query efficiency(win10,only my desktop). The original data is 400 txt files of almost the same size (total memory 169MB). I first converted to ...
0
votes
0
answers
102
views
Issue downloading/parsing ORC File from S3, or from Local Path
I have an application deployed that is supposed to parse/download an ORC File from an S3 bucket.
I have tried multiple things, one of them being, downloading the File locally in the app, and try to ...
0
votes
0
answers
444
views
How can I optimize orc snappy compression in spark?
My orc with snappy compression dataset was 3.3 GB when it was originally constructed via a series of small writes to 128 Kb files. It totals 400 million rows, has one timestamp column, and the rest ...
1
vote
0
answers
127
views
Pyspark error while writing large dataframe to file
I am trying to write my dataframe df_trans(which has about 10 mill records) to file and want to compare the performance by writing it to parquet vs orc vs csv.
df_trans.write.mode('overwrite').parquet(...
0
votes
0
answers
209
views
To read orc file from GCS bucket
To read orc file from a GCS bucket i'm using below code snippet, where i'm creating hadoop configuration and setting required file system attributes to use gcs bucket
val hadoopConf = new ...
2
votes
1
answer
427
views
Reading orc does not trigger projection pushdown and predicate push down
I have a fileA in orc with the following format
key
id_1
id_2
value
value_1
....
value_30
If I use the following config:
'spark.sql.orc.filterPushdown' : true
And ...