So, I am practicing multi threading in java and trying to add the elements of a randomly generated 2D integer array both sequentially and using 4 threads. I measured the performance of my code and for some reason sequential part is a lot faster than multi-threaded. Here is the code for sequential addition:
public class ArraySum2DNonMT {
private int[][] arrayToSum;
private int totalSum;
public ArraySum2DNonMT(int[][] arr){
    this.arrayToSum = arr;
    this.setTotalSum(0);
}
public void runSequential(){
    for(int i = 0; i < arrayToSum[0].length; i++){
        for(int j = 0; j < arrayToSum.length; j++){
            setTotalSum(getTotalSum() + arrayToSum[j][i]);
        }
    }
}
public int getTotalSum() {
    return totalSum;
}
public void setTotalSum(int totalSum) {
    this.totalSum = totalSum;
}
}
Here is the code for Multi-threaded version:
package multiThreaded;
/**
* 
* @author Sahil Gupta
* 
* This class takes in a 2D integer array and adds it's contents. This
* addition will be concurrent between several threads which will divide
* the work of the array based on the threadID assigned to thread by the
* programmer. Assume that the passed in 2D array to the constructor is a 
* matrix with each array in the main array having same length.
*/
public class ArraySum2D implements Runnable{
private int[][] arrayToSum;
private int threadID;
private int totalSum;
public ArraySum2D(int[][] arr, int threadID){
    this.arrayToSum = arr;
    this.threadID = threadID;
    this.setTotalSum(0);
}
@Override
public void run() {
    int arrayCol = arrayToSum[0].length;
    int arrayRow = arrayToSum.length;
    int colStart = (int)((threadID%2) * (arrayCol/2));
    int rowStart = (int)((int)(threadID/2) * (arrayRow/2));
    int colEnd = colStart + (int)(arrayCol/2);
    int rowEnd = rowStart + (int)(arrayRow/2);
    for(int i = colStart; i < colEnd; i++){
        for(int j = rowStart; j < rowEnd; j++){
            setTotalSum(getTotalSum() + arrayToSum[j][i]);
        }
    }
}
public int getTotalSum() {
    return totalSum;
}
public void setTotalSum(int totalSum) {
    this.totalSum = totalSum;
}
}
Here is the main:
package controller;
import java.util.Random;
import multiThreaded.ArraySum2D;
import sequentialNonMT.ArraySum2DNonMT;
public class ControllerMain {
private final static int cols = 20;
private final static int rows = 10;
private static volatile int[][] arrayToAdd = new int[rows][cols];
private static Random rand = new Random();
private static ArraySum2D a0, a1, a2, a3;
public static void main(String[] args) throws InterruptedException{
    for(int j = 0; j < rows; j++){
        for(int i = 0; i < cols; i++){
            arrayToAdd[j][i] = rand.nextInt(100);
        }
    }
    ArraySum2DNonMT a = new ArraySum2DNonMT(arrayToAdd);
    long startTimeSequential = System.nanoTime();
    a.runSequential();
    long estimatedTimeSequential = System.nanoTime() - startTimeSequential;
    System.out.println("The total sum calculated by sequential program is: " + a.getTotalSum());
    System.out.println("The total time taken by sequential program is: " + estimatedTimeSequential);
    a0 = new ArraySum2D(arrayToAdd, 0);
    a1 = new ArraySum2D(arrayToAdd, 1);
    a2 = new ArraySum2D(arrayToAdd, 2);
    a3 = new ArraySum2D(arrayToAdd, 3);
    Thread t0 = new Thread(a0);
    Thread t1 = new Thread(a1);
    Thread t2 = new Thread(a2);
    Thread t3 = new Thread(a3);
    long startTimeMultiThreaded = System.nanoTime();
    t0.start();
    t1.start();
    t2.start();
    t3.start();
    t0.join();
    t1.join();
    t2.join();
    t3.join();
    int Sum = addThreadSum();
    long estimatedTimeMultiThreaded = System.nanoTime() - startTimeMultiThreaded;
    System.out.println("The total sum calculated by multi threaded program is: " + Sum);
    System.out.println("The total time taken by multi threaded program is: " + estimatedTimeMultiThreaded);
}
private static int addThreadSum(){
    return a0.getTotalSum() + a1.getTotalSum() + a2.getTotalSum() + a3.getTotalSum();
}
}
The output I am currently getting shows significant difference in runtime (measured here in nanoseconds). This is what I get:
The total sum calculated by sequential program is: 10109 
The total time taken by sequential program is: 46000
The total sum calculated by multi threaded program is: 10109
The total time taken by multi threaded program is: 641000
The sequential code is around 13 times faster. Can you please help me point out what I might not be doing correctly? I have a dual core i7 haswell, macbook air. I am not sure why it is taking longer, but here are a few ideas I have in mind which might be causing this: False sharing, Too much parallelism/threads (4 for dual core), Cache coherence protocol might not be favoring me, some other basic multi threaded stuff I am missing/don't know about.
Please help me identify the specific cause and ways I can make multi threaded run faster than sequential. Thank you so much for helping me out!
Edit: More info about the processor and it's caches: Processor Name: Intel Core i7 Processor Speed: 1.7 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB
I think this can have up to 4 threads according to Intel's data sheet.
P.S. This is my first post to ask question but I have been using this site to clear doubts for a while now. Forgive me for any mistakes I make.