Questions tagged [spark]
The spark tag has no summary.
20 questions
0
votes
0
answers
80
views
How to connect to SFTP using Apache Spark 3.5 with Scala 2.12 for parallel file transfers?
I am working on a project where I need to transfer thousands of files (each sized between 50-60 MB) every hour from an SFTP server to local storage or AWS S3. I am using Apache Spark 3.5 with Scala 2....
2
votes
3
answers
280
views
Method naming conventions "setX" vs "withX"
Why learning about Fluent Interfaces, I came across this post which states that using set hints one is mutating the object whereas with is returing a new object.
I have seen this pattern first hand ...
1
vote
3
answers
977
views
Python: Is returning self in method chaining a violation of Demeter's law?
In Python it is very common to see code that uses method chaining, the main difference with code elsewhere is that this is also combined with returning an object of the same type but modified. This ...
2
votes
1
answer
166
views
Data Ingest Architecture Advice
I have a requirement where we need to collect N different events and store them for analysis. I am having trouble coming up with a general architecture for this.
FINAL REQUIREMENTS
The end goal of the ...
5
votes
1
answer
1k
views
How do you perform accumulation on large data sets and pass the results as a response to REST API?
I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: ...
2
votes
1
answer
256
views
How (whether to?) include Apache Spark in my Architecture
Brief overview of general data flow
The general goal of my system is to allow users to upload many different types of files containing data (PDF, CSV, ZIP, etc.), then index it and perform some basic ...
-2
votes
1
answer
690
views
Export huge excel file
I develop a web application in Angular (frontend) and Scala (backend) for a big data team. Because they use large files for export/import, I build a module which is a copy of Microsoft Excel.
So, what ...
-2
votes
1
answer
96
views
What are the benefits of running Apache Spark on Kubernetes?
When running Apache Spark one submits jobs to a Cluster Manager. The cluster manager is delegated with the task of accepting / declining requests for resources. The cluster manager could either be ...
2
votes
0
answers
71
views
How to manage scheduled ETL jobs that are time sensitive?
We have some ETL jobs that are scheduled to run every day, and some that are scheduled to run every week via Control-M. These types of jobs tag data with the date the job was run and perform filter ...
-3
votes
1
answer
110
views
Can accessing the same API from different languages be more performant?
I've just started my first proper internship in industry (not learning to code but learning to write software that does stuff). My employer makes use of Apache Spark, as they do a lot of Big Data ...
2
votes
0
answers
37
views
How to design a report processing model using Spark in the most efficient way
I have a reporting system which gets time-series data from numerous meters (here I am referring it as raw_data)
I need to generate several reports based on different combinations of the incoming ...
0
votes
1
answer
141
views
Where do you put tests that are not unit tests in a Maven project?
I'm building a Spark-based, text analysis package using both Java and Scala. I have a series of transform functions, which take in one dataframe and spit out another, and that can be chained together ...
-1
votes
1
answer
173
views
How to incrementally update value of features in a machine learning pipeline?
I am working on a machine learning pipeline where we have to compute certain measures on streaming data. Every day, new raw data enters our pipeline. To update our features, we have to run an ETL that ...
1
vote
1
answer
3k
views
Processing only once the same message produced by two producers
If I have two different producers that could produce the same message for a Kafka broker, how can I ensure that only one of the two message occurrences gets processed?
Is the only way to have an ...
0
votes
1
answer
77
views
Could Apache spark be an option?
Today we are using SQL server with multiple indexed views. Whenever we update the source tables for the view there is too long delay.
I have no experience with Spark, so the question is:
Can we input ...