Questions tagged [spark]
The spark tag has no summary.
                20 questions
            
            
            
                0
            
            votes
        
        
            
                0
            
            answers
        
        
            
                80
            
            views
        
        
            
            
        How to connect to SFTP using Apache Spark 3.5 with Scala 2.12 for parallel file transfers?
                    I am working on a project where I need to transfer thousands of files (each sized between 50-60 MB) every hour from an SFTP server to local storage or AWS S3. I am using Apache Spark 3.5 with Scala 2....
                
            
       
        
            
                2
            
            votes
        
        
            
                3
            
            answers
        
        
            
                280
            
            views
        
        
            
        Method naming conventions "setX" vs "withX"
                    Why learning about Fluent Interfaces, I came across this post which states that using set hints one is mutating the object whereas with is returing a new object.
I have seen this pattern first hand ...
                
            
       
        
            
                1
            
            vote
        
        
            
                3
            
            answers
        
        
            
                977
            
            views
        
        
            
            
            
        Python: Is returning self in method chaining a violation of Demeter's law?
                    In Python it is very common to see code that uses method chaining, the main difference with code elsewhere is that this is also combined with returning an object of the same type but modified. This ...
                
            
       
        
            
                2
            
            votes
        
        
            
                1
            
            answer
        
        
            
                166
            
            views
        
        
            
            
        Data Ingest Architecture Advice
                    I have a requirement where we need to collect N different events and store them for analysis. I am having trouble coming up with a general architecture for this.
FINAL REQUIREMENTS
The end goal of the ...
                
            
       
        
            
                5
            
            votes
        
        
            
                1
            
            answer
        
        
            
                1k
            
            views
        
        
            
            
            
        How do you perform accumulation on large data sets and pass the results as a response to REST API?
                    I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: ...
                
            
       
        
            
                2
            
            votes
        
        
            
                1
            
            answer
        
        
            
                256
            
            views
        
        
            
            
        How (whether to?) include Apache Spark in my Architecture
                    Brief overview of general data flow
The general goal of my system is to allow users to upload many different types of files containing data (PDF, CSV, ZIP, etc.), then index it and perform some basic ...
                
            
       
        
            
                -2
            
            votes
        
        
            
                1
            
            answer
        
        
            
                690
            
            views
        
        
            
            
        Export huge excel file
                    I develop a web application in Angular (frontend) and Scala (backend) for a big data team. Because they use large files for export/import, I build a module which is a copy of Microsoft Excel.
So, what ...
                
            
       
        
            
                -2
            
            votes
        
        
            
                1
            
            answer
        
        
            
                96
            
            views
        
        
            
        What are the benefits of running Apache Spark on Kubernetes?
                    When running Apache Spark one submits jobs to a Cluster Manager. The cluster manager is delegated with the task of accepting / declining requests for resources. The cluster manager could either be ...
                
            
       
        
            
                2
            
            votes
        
        
            
                0
            
            answers
        
        
            
                71
            
            views
        
        
        How to manage scheduled ETL jobs that are time sensitive?
                    We have some ETL jobs that are scheduled to run every day, and some that are scheduled to run every week via Control-M. These types of jobs tag data with the date the job was run and perform filter ...
                
            
       
        
            
                -3
            
            votes
        
        
            
                1
            
            answer
        
        
            
                110
            
            views
        
        
            
            
        Can accessing the same API from different languages be more performant?
                    I've just started my first proper internship in industry (not learning to code but learning to write software that does stuff). My employer makes use of Apache Spark, as they do a lot of Big Data ...
                
            
       
        
            
                2
            
            votes
        
        
            
                0
            
            answers
        
        
            
                37
            
            views
        
        
        How to design a report processing model using Spark in the most efficient way
                    I have a reporting system which gets time-series data from numerous meters (here I am referring it as raw_data)
I need to generate several reports based on different combinations of the incoming ...
                
            
       
        
            
                0
            
            votes
        
        
            
                1
            
            answer
        
        
            
                141
            
            views
        
        
            
        Where do you put tests that are not unit tests in a Maven project?
                    I'm building a Spark-based, text analysis package using both Java and Scala. I have a series of transform functions, which take in one dataframe and spit out another, and that can be chained together ...
                
            
       
        
            
                -1
            
            votes
        
        
            
                1
            
            answer
        
        
            
                173
            
            views
        
        
            
            
        How to incrementally update value of features in a machine learning pipeline?
                    I am working on a machine learning pipeline where we have to compute certain measures on streaming data. Every day, new raw data enters our pipeline. To update our features, we have to run an ETL that ...
                
            
       
        
            
                1
            
            vote
        
        
            
                1
            
            answer
        
        
            
                3k
            
            views
        
        
            
            
            
        Processing only once the same message produced by two producers
                    If I have two different producers that could produce the same message for a Kafka broker, how can I ensure that only one of the two message occurrences gets processed?
Is the only way to have an ...
                
            
       
        
            
                0
            
            votes
        
        
            
                1
            
            answer
        
        
            
                77
            
            views
        
        
            
            
        Could Apache spark be an option?
                    Today we are using SQL server with multiple indexed views. Whenever we update the source tables for the view there is too long delay.
I have no experience with Spark, so the question is:
Can we input ...
                
            
       
         
         
         
        