Auto-Parallelization Overview

The auto-parallelization feature of the Intel® compiler automatically translates serial portions of the input program into equivalent multithreaded code. Automatic parallelization determines the loops that are good worksharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as needed in programming with OpenMP* directives. The OpenMP and auto-parallelization features provide the performance gains from shared memory on multiprocessor and dual core systems.

The auto-parallelizer analyzes the dataflow of the loops in the application source code and generates multithreaded code for those loops which can safely and efficiently be executed in parallel.

This behavior enables the potential exploitation of the parallel architecture found in symmetric multiprocessor (SMP) systems.

The guided auto-parallelization feature of the Intel® compiler helps you locate portions in your serial code that can be parallelized further. You can invoke guidance for parallelization, vectorization, or data transformation using specified compiler options of the -guide (Linux* OS) or /Qguide (Windows* OS) series.

Automatic parallelization frees developers from having to:

find loops that are good worksharing candidates
perform the dataflow analysis to verify correct parallel execution
partition the data for threaded code generation as is needed in programming with OpenMP* directives.

The parallel run-time support provides run-time features as found in OpenMP*, such as handling the details of loop iteration modification, thread scheduling, and synchronization. You can use the -par-runtime-control (Linux* OS) or the /Qpar-runtime-control (Windows* OS) compiler option to generate code that performs run-time checks for loops that have symbolic loop bounds. The loop is executed in parallel if the granularity of a loop is greater than the parallelization threshold. The parallelization threshold can be set using the -par-threshold (Linux OS) or the /Qpar-threshold (Windows OS) compiler option, which sets a threshold for the auto-parallelization of loops based on the probability of profitable execution of the loop in parallel.

Although OpenMP directives enable serial applications to transform into parallel applications quickly, you must explicitly identify specific portions of your application code that contain parallelism and add the appropriate compiler directives. Auto-parallelization, which is triggered by the -parallel (Linux* OS and Mac OS* X) or /Qparallel (Windows* OS) option, automatically identifies those loop structures that contain parallelism. During compilation, the compiler automatically attempts to deconstruct the code sequences into separate threads for parallel processing. No other effort is needed.

Note

In order to execute a program that uses auto-parallelization on Linux* OS or Mac OS* X systems, you must include the -parallel compiler option when you compile and link your program.

Using this option enables parallelization for both Intel® microprocessors and non-Intel microprocessors. The resulting executable may get additional performance gain on Intel microprocessors than on non-Intel microprocessors. The parallelization can also be affected by certain options, such as /arch or /Qx (Windows) or -m or -x (Linux and Mac OS X).

Serial code can be divided so that the code can execute concurrently on multiple threads. For example, consider the following serial code example.

Example 1: Original Serial Code
subroutine ser(a, b, c) integer, dimension(100) :: a, b, c do i=1,100 a(i) = a(i) + b(i) * c(i) enddo end subroutine ser

Example 1: Original Serial Code

subroutine ser(a, b, c)

  integer, dimension(100) :: a, b, c

  do i=1,100

    a(i) = a(i) + b(i) * c(i)

  enddo

end subroutine ser

The following example illustrates one method showing how the loop iteration space, shown in the previous example, might be divided to execute on two threads.

Example 2: Transformed Parallel Code
subroutine par(a, b, c) integer, dimension(100) :: a, b, c ! Thread 1 do i=1,50 a(i) = a(i) + b(i) * c(i) enddo ! Thread 2 do i=51,100 a(i) = a(i) + b(i) * c(i) enddo end subroutine par

Example 2: Transformed Parallel Code

subroutine par(a, b, c)

  integer, dimension(100) :: a, b, c

  ! Thread 1

  do i=1,50

    a(i) = a(i) + b(i) * c(i)

  enddo

  ! Thread 2

  do i=51,100

    a(i) = a(i) + b(i) * c(i)

  enddo

end subroutine par

Auto-Vectorization and Parallelization

Auto-vectorization detects low-level operations in the program that can be done in parallel, and then converts the sequential program to process 2, 4, 8 or up to 16 elements in one operation, depending on the data type. In some cases auto-parallelization and vectorization can be combined for better performance results. For example, in the code below, thread-level parallelism can be exploited in the outermost loop, while instruction-level parallelism can be exploited in the innermost loop.

Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx (Windows) or -m or -x (Linux and Mac OS X).

Example
DO I = 1, 100 ! Execute groups of iterations in different hreads (TLP) DO J = 1, 32 ! Execute in SIMD style with multimedia extension (ILP) A(J,I) = A(J,I) + 1 ENDDO ENDDO

Example

DO I = 1, 100     ! Execute groups of iterations in different hreads (TLP)

  DO J = 1, 32    ! Execute in SIMD style with multimedia extension (ILP)

     A(J,I) = A(J,I) + 1

  ENDDO

ENDDO

With the relatively small effort of adding OpenMP* directives to existing code you can transform a sequential program into a parallel program. The following example shows OpenMP* directives within the code. Options that use OpenMP* are available for both Intel® and non-Intel microprocessors but these options may perform additional optimizations on Intel® microprocessors than they perform on non-Intel microprocessors. The list of major, user-visible OpenMP constructs and features that may perform differently on Intel® microprocessors vs. non-Intel microprocessors include: locks (internal and user visible), the SINGLE construct, barriers (explicit and implicit), parallel loop scheduling, reductions, memory allocation, and thread affinity and binding.

Example
!OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C) ! Defines a parallel region !OMP$ PARALLEL DO ! Specifies a parallel region that ! implicitly contains a single DO directive DO I = 1, 1000 NUM = FOO(B(i), C(I)) X(I) = BAR(A(I), NUM) ! Assume FOO and BAR have no other effect ENDDO

Example

!OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C)

! Defines a parallel region

!OMP$ PARALLEL DO

! Specifies a parallel region that

! implicitly contains a single DO directive

DO I = 1, 1000

  NUM = FOO(B(i), C(I))

  X(I) = BAR(A(I), NUM)

! Assume FOO and BAR have no other effect

ENDDO

Optimization Notice
Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Auto-Parallelization Overview

Note

Auto-Vectorization and Parallelization

See Also