Integrate( f(x) = 2.0 / sqrt(1  x*x) , x = 0 to x = 1 ) 
Maple: > integrate(2.0 / sqrt(1  x*x), x=0..1); 3.141592654 
Reason:
integral of 2 / sqrt( 1  x^2 ) = 2 × arcsin( x ) arcsin(1) = 1/2 Pi arcsin(0) = 0 
How to approximate:

(See the figure above !!!)
We can see that:
Integral ~= w * f(0.5w) + w * f(1.5w) + w + f(2.5w) + ...... 
In pseudo code:
Integral = 0; for (i = 0; i < N; i++) { x = (i+0.5)*w; Integral = Integral + w * f(x); // w * f(x) is area of rectangle } 
The entire program in C/C++:
double f(double a) { return( 2.0 / sqrt(1  a*a) ); } int main(int argc, char *argv[]) { int i; int N; double sum; double x, w; N = ...; // Will determine the accuracy of the approximation w = 1.0/N; sum = 0.0; for (i = 0; i < N; i = i + 1) { x = (i + 0.5) * w; sum = sum + w*f(x); } cout << sum; } 
Compile with:
Run the program with:
Often, a small amount of (shared) information is updated within every execution of the loop.
The program can be speed up when nonconlficting operations are performed concurrently (in parallel), while conlficting operations to the shared information (variable) are performed sequentially
double f(double a) { return( 2.0 / sqrt(1  a*a) ); } int main(int argc, char *argv[]) { int i; int N; double sum; double x, w; N = ...; // Will determine the accuracy of approximation w = 1.0/N; sum = 0.0; for (i = 0; i < N; i = i + 1) { x = w*(i + 0.5); // We can make x nonshared.. sum = sum + w*f(x); // sum is SHARED !!! } cout << sum; } 
w*(f(0.5w) + f(1.5w) + f(2.5w) + ... + f(0.50.5w) )
and thread 2 compute the "second half" of partial sum
w*(f(0.5+0.5w) + f(0.5+1.5w) + f(0.5+2.5w) + ... + f(10.5w) )
Pictorially:
values added by values added by thread 1 thread 2 <><> 
w*(f(0.5w) + f((2 + 0.5)w) + f((4+0.5)w) + ... )
and thread 2 compute the "odd stepped" partial sum
w*(f((1+0.5)w) + f((3+0.5)w) + f((5+0.5)w) + ... ) Pictorially:
values added by thread 1               V V V V V V V V V V V V V V  ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^               values added by thread 2 
It turns out that the "even stepped" and "odd stepped" approach of partition is more easier to program in many instances !!!
NOTE: We don't access any array, so the paging problem is not applicable !!!
The division of labor is as follows for N threads:
/*** Shared variables, but not updated.... ***/ int N; // # intervals double w; // width of one interval int num_threads; // # threads /*** Shared variables, updated !!! ***/ double sum; pthread_mutex_t sum_mutex; // Mutex to control access to sum int main(int argc, char *argv[]) { int Start[100]; // Start index values for each thread pthread_t tid[100]; // Used for pthread_join() int i; N = ...; // Read N in from keyboard... w = 1.0/N; // "Broadcast" w num_threads = ... // Skip distance for each thread sum = 0.0; // Initialized shared variable /**** Make worker threads... ****/ for (i = 1; i <= N; i = i + 1) { Start[i] = i; // Start index for thread i if ( pthread_create(&tid[i], NULL, PIworker, &Start[i]) ) { cout << "Cannot create thread" << endl; exit(1); } } /**** Wait for worker threads to finish... ****/ for (i = 0; i < num_threads; i = i + 1) pthread_join(tid[i], NULL); cout << sum; } 
Worker thread:void *PIworker(void *arg) { int i, myStart; double x; /*** Get the parameter (which is my starting index) ***/ myStart = * (int *) arg; /*** Compute sum, skipping every "num_threads" items ***/ for (i = myStart; i < N; i = i + num_threads) { x = w * ((double) i + 0.5); // next x pthread_mutex_lock(&sum_mutex); sum = sum + w*f(x); // Add to sum pthread_mutex_unlock(&sum_mutex); } return(NULL); /* Thread exits (dies) */ } 
Are you surprised ???
Worker thread:void *PIworker(void *arg) { int i, myStart; double x; double tmp; // local nonshared variable /*** Get the parameter (which is my starting index) ***/ myStart = * (int *) arg; /*** Compute sum, skipping every "num_threads" items ***/ for (i = myStart; i < N; i = i + num_threads) { x = w * ((double) i + 0.5); // next x tmp = tmp + w*f(x); // No synchr. needed } pthread_mutex_lock(&sum_mutex); sum = sum + tmp; // Synch only ONCE !!! pthread_mutex_unlock(&sum_mutex); return(NULL); /* Thread exits (dies) */ } 
What a difference it can make where you put the synchronization points in a parallel program....