Kernel Programming Model

The device code can specify the amount of parallelism to request through several mechanisms.

  • single_task – execute a single instance of the kernel with a single work item.

  • parallel_for – execute a kernel in parallel across a range of processing elements. Typically, this version of parallel_for is employed on “embarrassingly parallel” workloads.

  • parallel_for_work_group – execute a kernel in parallel across a hierarchical range of processing elements using local memory and barriers.

The following code sample shows two combinations of invoking kernels:

  1. single_task and C++ lambda (lines 32-34)

  2. parallel_for and functor (lines 8-16 and line 46)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#include <array>
#include <CL/sycl.hpp>

const int SIZE = 1024;

using namespace sycl;

class Vassign {
  accessor<int, 1, access::mode::read_write,
	   access::target::global_buffer> access;

public:
  Vassign(accessor<int, 1, access::mode::read_write,
	  access::target::global_buffer> &access_) : access(access_) {}
  void operator()(id<1> id) { access[id] = 1; }
};

int main() {
  std::array<int, SIZE> a;

  for (int i = 0; i<SIZE; ++i) {
    a[i] = i;
  }

  {
    range<1> a_size{SIZE};
    buffer<int>  a_device(a.data(), a_size);
    queue q;

    q.submit([&](handler &h) {
	auto a_in = a_device.get_access<access::mode::write>(h);
	h.single_task([=]() {
	    a_in[0] = 2;
	  });
      });
  }

  {
    range<1> a_size{SIZE};
    buffer<int>  a_device(a.data(), a_size);
    queue q;
    q.submit([&](handler &h) {
	auto a_in = a_device.get_access<access::mode::read_write,
					access::target::global_buffer>(h);
	Vassign F(a_in);
	h.parallel_for(range<1>(SIZE), F);
      });
  }
}