It's O(N). Each data-point must be hit at least once.
However, assuming your question is practical rather than theoretical: If you have N/2 cores capable of adding two numbers in a single cycle, you can perform the operation in log2(N) cycles. Pseudocode for a fast parallel approach:
while N > 1:
N = N / 2
for i in 0..N: // in parallel
X[i] += X[i + N]
// result in X[0]
as opposed to a naive approach:
accum = 0
for i in 0..N: // in serial
accum += X[i]
// result in accum
The bottleneck preventing parallelization in the naive case is the 'reduction' into accum. I think any commutative reduction operation (addition, scalar multiplication, etc) can be parallelized as above.
Another practical consideration is that CPU and GPU processor cores that can do more than one addition in a single "cycle" (eg SSE).
Big-O doesn't highlight reduction bottlenecks and does not necessarily reflect time complexity measured in real time.