This sounds like a really interesting project! Here are a few thoughts on your code.
#Naming
I have to admit that your naming is a mix of good and bad. The constructor for Parallel_process uses inputImgage, which, misspelling aside, tells me what it is. But then you have a vector named AA. When I see AA in image processing code, I usually assume it means either "axis-aligned" (as in AABB = "Axis-aligned bounding box"), or "antialiasing". It doesn't appear to mean either in this case. What does it mean? And you assign AA to a member variable named A. What is A?
The name diffVal and diff are almost as bad. At least I know it's the difference between two things. (Or is it a differential?) Whenever you find yourself using the word "value" (or some abbreviation of it) in a name, you probably need to rethink the name. What is it the difference of? You're dividing by it, so it seems like maybe it's actually a range - like the difference between the minimum and maximum values of… something. It would be nice to let a reader of your code know what that something is.
Then you have a value inside operator() named AAA which appears to be a copy of A, despite the fact that you don't modify it at all. Why are you not just using A wherever you have AAA? It would speed things up as you wouldn't need to copy A.
It's not immediately clear what dcp stands for.
I assume calculateSD() is calculating the standard deviation? I would clarify SD to be standardDeviation. You can shorten it if you like, but SD is overloaded, too. (Standard Definition, super density, standard deviation, Jacobi's function.)
#Functions
Your main() function could be made much simpler and easier to read if you broke it into functions with descriptive names. This:
// build look up table
unsigned char lut[256];
auto fGamma=0.4;
#pragma omp for
for (size_t i=0; i<256; i++)
lut[i] = cv::saturate_cast<uchar>(pow((float)(i / 255.0), fGamma) * 255.0f);
//std::cout<<cv::getBuildInformation()<<std::endl;
high_resolution_clock::time_point t1(high_resolution_clock::now());
GammaCorrection(src_temp, lut, src_temp);
could all be put into a function named convertToLinearRGB(). That said, are you sure you want to use a gamma of 0.4? If you're dealing with most normal image (sRGB) or video (Rec. 709) formats, 1.0 / 2.2 = .4545… would be a better choice. The linear offset near 0 makes 2.4 or 2.5 suboptimal choices for a conversion, despite the fact that they are used in the actual calculation.
Next, you should put the histogram calculation into a function named histogram().
For A_estim_lambda, why do you define the lambda and then immediately define a variable that is simply assigned to the lambda? Why not just do it in 1 line?
#Reduce Complexity
I see at least 2 different parallel computation systems in use - OpenMP and OpenCV's parallel structures. If you get some material advantage out of them, then maybe it's worth it, but using so many different systems makes it more complex for maintenance and understanding.
Also, the first loop in main() is unlikely to be helped by being parallelized. For 256 values, the overhead of creating multiple threads is very likely to be more than the time it takes to just do the calculations.
#Performance
Unfortunately, I can't run your code and profile it because I don't have OpenCV installed on my machine. But you should profile your code to see which specific lines in which specific functions are costing the most time. Without doing that, it doesn't make sense to start optimizing it because you don't know which parts are the slowest. You appear to be compiling with GCC, so you can probably use gprof for profiling. Depending on what OS you're on, there may be additional tools for profiling. I recommend you check them out.