Skip to main content
Advice
5 votes
1 replies
144 views

Hi I have been interested lately in learning iceberg. There is something was not able to get so I thought I would ask here. I really wanna know why is Apache parquet the native file format used when ...
katz daniel's user avatar
0 votes
1 answer
58 views

When I am trying to write same value for each row for string column in orc file, only first row is returning the written value, while reading remaining rows, facing null pointer issue. In some cases, ...
user1885418's user avatar
0 votes
1 answer
168 views

I face the attached problem when reading an orc file: Is it possible to change this buffer size of 65536 to the needed one of 1817279? Which configuration values do I have to adapt in order to set ...
Ruben Hartenstein's user avatar
1 vote
1 answer
152 views

I want to create an ORC file compressed with ZLIB compression level 9. Thing is, when using pyarrow.orc, I can only choose between "Speed" and "Compression" mode, and can't control ...
Y.S's user avatar
  • 1,862
0 votes
0 answers
172 views

I am new to apache beam, and i have a use case where I need to write a java streaming code to read from a KafkaTopic (from which i extract some CustomObject.class) and output the entries to hdfs in ...
vamsi's user avatar
  • 325
1 vote
1 answer
2k views

I am using: Win 10 Pro Intel(R) Xeon(R) W-1250 CPU @ 3.30GHz / 16 GB RAM Anaconda Navigator 2.5.0, Python 3.10.13 in venv pyarrow 11.0.0 pandas 2.1.1 Running scripts in Spyder IDE 5.4.3 I want to open/...
Esat Becco's user avatar
0 votes
1 answer
229 views

We are using Flink version of 1.13.5 and trying to read the ORC files from AWS S3 location. And, we are deploying our application in a self-managed flink cluster. Please find the below code for ...
nirmal's user avatar
  • 107
0 votes
1 answer
203 views

I have like 200 pandas dataframe, and every dataframe has some unique column, or maybe completely different columns. example: df1 = pd.DataFrame({ 'Product': ['Apple', 'Banana', 'Orange', 'Mango'],...
Abdulrahman Sheikho's user avatar
0 votes
0 answers
911 views

I have images where the text is strike-out and replace by next words. Sometimes it's just one line that gets struck out. Other times, multiple lines are. I expected output should be like this. remove ...
Do Chi Bao's user avatar
0 votes
0 answers
79 views

I am testing the impact of different data formats on hive query efficiency(win10,only my desktop). The original data is 400 txt files of almost the same size (total memory 169MB). I first converted to ...
fei yang's user avatar
0 votes
0 answers
102 views

I have an application deployed that is supposed to parse/download an ORC File from an S3 bucket. I have tried multiple things, one of them being, downloading the File locally in the app, and try to ...
FluffyGus's user avatar
0 votes
0 answers
444 views

My orc with snappy compression dataset was 3.3 GB when it was originally constructed via a series of small writes to 128 Kb files. It totals 400 million rows, has one timestamp column, and the rest ...
user19695124's user avatar
1 vote
0 answers
127 views

I am trying to write my dataframe df_trans(which has about 10 mill records) to file and want to compare the performance by writing it to parquet vs orc vs csv. df_trans.write.mode('overwrite').parquet(...
OhMoh24's user avatar
  • 71
0 votes
0 answers
209 views

To read orc file from a GCS bucket i'm using below code snippet, where i'm creating hadoop configuration and setting required file system attributes to use gcs bucket val hadoopConf = new ...
Nitish N Banakar's user avatar
2 votes
1 answer
427 views

I have a fileA in orc with the following format key id_1 id_2 value value_1 .... value_30 If I use the following config: 'spark.sql.orc.filterPushdown' : true And ...
olaf's user avatar
  • 347

15 30 50 per page
1
2 3 4 5
31