CUDA从入门到放弃（十四）：CUDA Thrust库

Thrust 是一个基于标准模板库（STL）的 C++ 模板库，专为 CUDA 设计，旨在简化高性能并行应用的开发。它提供了一系列数据并行原语，如扫描、排序和归约，可组合实现复杂算法。通过高级抽象描述计算，Thrust 能自动选择最优实现，适用于 CUDA 应用的快速原型设计和生产环境，提高程序员生产率和性能。

1 安装

安装 CUDA Toolkit 会将 Thrust 的头文件复制到您系统的标准 CUDA 包含目录中。由于 Thrust 是一个由头文件组成的模板库，因此无需进行其他安装步骤即可开始使用 Thrust。

2 Vectors

Thrust 提供了 host_vector 和 device_vector 两种向量容器，分别用于主机和 GPU 设备内存。它们类似于 C++ STL 中的 std::vector，是泛型容器，可动态调整大小。使用 = 运算符可以轻松地复制容器内容。device_vector 的元素可以通过标准括号访问，但应谨慎使用，因为每次访问都涉及内存复制。

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <iostream>

int main(void)
{
    // H has storage for 4 integers
    thrust::host_vector<int> H(4);

    // initialize individual elements
    H[0] = 14;
    H[1] = 20;
    H[2] = 38;
    H[3] = 46;

    // H.size() returns the size of vector H
    std::cout << "H has size " << H.size() << std::endl;

    // print contents of H
    for(int i = 0; i < H.size(); i++)
        std::cout << "H[" << i << "] = " << H[i] << std::endl;

    // resize H
    H.resize(2);

    std::cout << "H now has size " << H.size() << std::endl;

    // Copy host_vector H to device_vector D
    thrust::device_vector<int> D = H;

    // elements of D can be modified
    D[0] = 99;
    D[1] = 88;

    // print contents of D
    for(int i = 0; i < D.size(); i++)
        std::cout << "D[" << i << "] = " << D[i] << std::endl;

    // H and D are automatically deleted when the function returns
    return 0;
}

Thrust 还提供了初始化向量元素和复制特定值集的方法。

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>

#include <iostream>

int main(void)
{
    // initialize all ten integers of a device_vector to 1
    thrust::device_vector<int> D(10, 1);

    // set the first seven elements of a vector to 9
    thrust::fill(D.begin(), D.begin() + 7, 9);

    // initialize a host_vector with the first five elements of D
    thrust::host_vector<int> H(D.begin(), D.begin() + 5);

    // set the elements of H to 0, 1, 2, 3, ...
    thrust::sequence(H.begin(), H.end());

    // copy all of H back to the beginning of D
    thrust::copy(H.begin(), H.end(), D.begin());

    // print D
    for(int i = 0; i < D.size(); i++)
        std::cout << "D[" << i << "] = " << D[i] << std::endl;

    return 0;
}

2-1 Thrust 命名空间

Thrust 命名空间允许我们调用特定的函数或类，如 thrust::host_vector 和 thrust::copy，从而避免与其他库中的函数或类名称冲突。

2-2 迭代器与静态分发 Iterators and Static Dispatching

在 Thrust 中，迭代器类似于指针，用于访问容器中的元素。Thrust 函数通过检查迭代器的类型，自动决定使用主机还是设备实现，这个过程称为静态分发，它在编译时确定，没有运行时开销。

原始指针和device_ptr 互相转换：
如果使用原始指针作为参数，Thrust 会默认使用主机路径。若指针指向设备内存，则需要用 thrust::device_ptr 包装后再调用函数。

size_t N = 10;

// raw pointer to device memory
int * raw_ptr;
cudaMalloc((void **) &raw_ptr, N * sizeof(int));

// wrap raw pointer with a device_ptr
thrust::device_ptr<int> dev_ptr(raw_ptr);

// use device_ptr in thrust algorithms
thrust::fill(dev_ptr, dev_ptr + N, (int) 0);

从 device_ptr 中提取原始指针：

size_t N = 10;

// create a device_ptr
thrust::device_ptr<int> dev_ptr = thrust::device_malloc<int>(N);

// extract raw pointer from device_ptr
int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);

STL 和互相转换：

#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <list>
#include <vector>

int main(void)
{
    // create an STL list with 4 values
    std::list<int> stl_list;

    stl_list.push_back(10);
    stl_list.push_back(20);
    stl_list.push_back(30);
    stl_list.push_back(40);

    // initialize a device_vector with the list
    thrust::device_vector<int> D(stl_list.begin(), stl_list.end());

    // copy a device_vector into an STL vector
    std::vector<int> stl_vector(D.size());
    thrust::copy(D.begin(), D.end(), stl_vector.begin());

    return 0;
}

3 Algorithms

Thrust 提供众多常见并行算法，其中许多与 STL 中的算法相对应，并采用相同名称。这些算法都有主机和设备的实现，根据使用的迭代器类型自动选择执行路径。除了 thrust::copy 外，Thrust 算法的迭代器参数应全部位于同一位置（主机或设备），否则编译器会报错。

3-1 Transformations

以下源代码演示了几种转换算法。

#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/replace.h>
#include <thrust/functional.h>
#include <iostream>

int main(void)
{
    // allocate three device_vectors with 10 elements
    thrust::device_vector<int> X(10);
    thrust::device_vector<int> Y(10);
    thrust::device_vector<int> Z(10);

    // initialize X to 0,1,2,3, ....
    thrust::sequence(X.begin(), X.end());

    // compute Y = -X
    thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());

    // fill Z with twos
    thrust::fill(Z.begin(), Z.end(), 2);

    // compute Y = X mod 2
    thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>());

    // replace all the ones in Y with tens
    thrust::replace(Y.begin(), Y.end(), 1, 10);

    // print Y
    thrust::copy(Y.begin(), Y.end(), std::ostream_iterator<int>(std::cout, "\n"));

    return 0;
}

3-2 Reductions

归约算法使用二元操作将输入序列减少到单个值。
例如，通过使用加法操作对数字数组进行归约，可以得到数组的和。使用 thrust::reduce 实现数组的和如下：

int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());

通过使用接受两个输入并返回最大值的操作符进行归约，可以得到数组的最大值。

 int max_value = thrust::reduce(data.begin(), data.end(),  
                                    data[0], // 初始值  
                                    thrust::maximum<int>()); // 自定义操作符

thrust::count 返回给定序列中特定值的实例数量。

int result = thrust::count(vec.begin(), vec.end(), 1);

3-3 Prefix-Sums

并行前缀和，或扫描操作，是许多并行算法（如流压缩和基数排序）中的重要组成部分。
以下代码演示了使用默认加法运算符的 inclusive_scan 操作：

#include <thrust/scan.h>

int data[6] = {1, 0, 2, 2, 1, 3};

thrust::inclusive_scan(data, data + 6, data); // in-place scan

// data is now {1, 1, 3, 5, 6, 9}

在包含扫描中，输出序列的每个元素都是输入范围内相应部分的和。例如，data[2] = data[0] + data[1] + data[2]。

exclusive scan 类似，但向右移动了一个位置：

#include <thrust/scan.h>

int data[6] = {1, 0, 2, 2, 1, 3};

thrust::exclusive_scan(data, data + 6, data); // in-place scan

// data is now {0, 1, 1, 3, 5, 6}

因此，现在 data[2] = data[0] + data[1]。

Thrust 还提供了 transform_inclusive_scan 和 transform_exclusive_scan 函数，它们在执行扫描之前对输入序列应用一元函数。

3-4 Reordering

Thrust 通过以下算法提供了对分区和流压缩的支持：

copy_if：复制满足特定断言条件的元素到一个新的容器

#include <thrust/copy_if.h>  
#include <thrust/device_vector.h>  
#include <thrust/functional.h> // for thrust::is_even  
#include <iostream>  
  
int main(void)  
{  
    // 初始化设备向量  
    thrust::device_vector<int> data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};  
      
    // 创建一个空的设备向量用于存储结果  
    thrust::device_vector<int> even_numbers(data.size());  
      
    // 使用 copy_if 复制偶数  
    // 注意：copy_if 的结果迭代器是输出范围的开始迭代器  
    thrust::copy_if(data.begin(), data.end(),  
                    even_numbers.begin(),  
                    thrust::is_even<int>()); // 断言条件：检查是否为偶数  
      
    // 获取实际复制的元素数量  
    size_t even_count = thrust::distance(even_numbers.begin(),  
                                          thrust::remove_if(even_numbers.begin(), even_numbers.end(), thrust::not1(thrust::is_even<int>())).end());  
      
    // 调整 even_numbers 的大小以匹配实际元素数量  
    even_numbers.resize(even_count);  
      
    // 输出结果  
    std::cout << "Even numbers: ";  
    thrust::copy(even_numbers.begin(), even_numbers.end(), std::ostream_iterator<int>(std::cout, " "));  
    std::cout << std::endl;  
      
    return 0;  
}

partition：partition 函数会根据提供的断言条件重新排列输入序列中的元素，使得满足条件的元素出现在不满足条件的元素之前。

#include <thrust/partition.h>  
#include <thrust/device_vector.h>  
#include <thrust/functional.h> // for thrust::less  
#include <iostream>  
  
int main(void)  
{  
    // 初始化设备向量  
    thrust::device_vector<int> data = {1, 3, 2, 5, 4, 6, 7, 8, 9, 10};  
      
    // 使用 partition 重新排列数据，使得小于 5 的数在前  
    thrust::partition(data.begin(), data.end(), thrust::less<int>(5));  
      
    // 输出结果  
    std::cout << "Partitioned data: ";  
    thrust::copy(data.begin(), data.end(), std::ostream_iterator<int>(std::cout, " "));  
    std::cout << std::endl;  
      
    return 0;  
}

remove 和 remove_if：remove函数会移除给定序列中等于某个特定值的所有元素，而remove_if函数则会根据提供的断言条件移除元素。

#include <thrust/remove.h>  
#include <thrust/device_vector.h>  
#include <thrust/copy.h>  
#include <iostream>  
  
int main(void)  
{  
    // 初始化设备向量  
    thrust::device_vector<int> data = {1, 2, 3, 2, 4, 2, 5, 6};  
      
    // 使用 remove 移除所有值为 2 的元素  
    thrust::device_vector<int>::iterator new_end = thrust::remove(data.begin(), data.end(), 2);  

	// 使用 remove_if 移除所有大于 5 的元素  
    // thrust::device_vector<int>::iterator new_end = thrust::remove_if(data.begin(), data.end(), thrust::greater<int>(5));
      
    // 调整数据向量的大小以匹配新的末尾迭代器  
    data.resize(thrust::distance(data.begin(), new_end));  
      
    // 输出结果  
    std::cout << "Data after removing 2s: ";  
    thrust::copy(data.begin(), data.end(), std::ostream_iterator<int>(std::cout, " "));  
    std::cout << std::endl;  
      
    return 0;  
}

unique：unique函数会移除序列中连续重复的元素，使得每个元素只出现一次。

#include <thrust/unique.h>  
#include <thrust/device_vector.h>  
#include <thrust/copy.h>  
#include <iostream>  
  
int main(void)  
{  
    // 初始化设备向量  
    thrust::device_vector<int> data = {1, 2, 2, 3, 4, 4, 4, 5, 6, 6};  
      
    // 使用 unique 移除连续重复的元素  
    thrust::device_vector<int>::iterator new_end = thrust::unique(data.begin(), data.end());  
      
    // 调整数据向量的大小以匹配新的末尾迭代器  
    data.resize(thrust::distance(data.begin(), new_end));  
      
    // 输出结果  
    std::cout << "Data after removing consecutive duplicates: ";  
    thrust::copy(data.begin(), data.end(), std::ostream_iterator<int>(std::cout, " "));  
    std::cout << std::endl;  
      
    return 0;  
}

3-5 Sorting

Thrust 提供多种函数，可按给定条件排序或重新排列数据。其中，thrust::sort 和 thrust::stable_sort 与 STL 中的对应函数相似。

#include <thrust/sort.h>

...
const int N = 6;
int A[N] = {1, 4, 2, 8, 5, 7};

thrust::sort(A, A + N);

// A is now {1, 2, 4, 5, 7, 8}

此外，Thrust 还有 thrust::sort_by_key 和 thrust::stable_sort_by_key，用于排序分别存储的键值对。

#include <thrust/sort.h>

...
const int N = 6;
int    keys[N] = {  1,   4,   2,   8,   5,   7};
char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};

thrust::sort_by_key(keys, keys + N, values);

// keys is now   {  1,   2,   4,   5,   7,   8}
// values is now {'a', 'c', 'b', 'e', 'f', 'd'}

这些排序函数也支持用户自定义比较操作符。

#include <thrust/sort.h>
#include <thrust/functional.h>

...
const int N = 6;
int A[N] = {1, 4, 2, 8, 5, 7};

thrust::stable_sort(A, A + N, thrust::greater<int>());

// A is now {8, 7, 5, 4, 2, 1}

4 Fancy Iterators

4-1 constant_iterator

constant_iterator 是一种简单的迭代器，它在每次访问时返回相同的值。

#include <thrust/iterator/constant_iterator.h>
...
// create iterators
thrust::constant_iterator<int> first(10);
thrust::constant_iterator<int> last = first + 3;

first[0]   // returns 10
first[1]   // returns 10
first[100] // returns 10

// sum of [first, last)
thrust::reduce(first, last);   // returns 30 (i.e. 3 * 10)

4-2 counting_iterator

counting_iterator 生成递增的序列。

#include <thrust/iterator/counting_iterator.h>
...
// create iterators
thrust::counting_iterator<int> first(10);
thrust::counting_iterator<int> last = first + 3;

first[0]   // returns 10
first[1]   // returns 11
first[100] // returns 110

// sum of [first, last)
thrust::reduce(first, last);   // returns 33 (i.e. 10 + 11 + 12)

4-3 transform_iterator

transform_iterator 允许我们对序列中的每个元素应用转换。

#include <thrust/iterator/transform_iterator.h>
// initialize vector
thrust::device_vector<int> vec(3);
vec[0] = 10; vec[1] = 20; vec[2] = 30;

// create iterator (type omitted)
...
first = thrust::make_transform_iterator(vec.begin(), negate<int>());
...
last  = thrust::make_transform_iterator(vec.end(),   negate<int>());

first[0]   // returns -10
first[1]   // returns -20
first[2]   // returns -30

// sum of [first, last)
thrust::reduce(first, last);   // returns -60 (i.e. -10 + -20 + -30)

// 或者
// sum of [first, last)
thrust::reduce(thrust::make_transform_iterator(vec.begin(), negate<int>()),
               thrust::make_transform_iterator(vec.end(),   negate<int>()));

4-4 permutation_iterator

permutation_iterator 允许我们重新排列序列中元素的顺序。通过指定一个映射，我们可以按特定顺序访问元素：

#include <thrust/iterator/permutation_iterator.h>

...

// gather locations
thrust::device_vector<int> map(4);
map[0] = 3;
map[1] = 1;
map[2] = 0;
map[3] = 5;

// array to gather from
thrust::device_vector<int> source(6);
source[0] = 10;
source[1] = 20;
source[2] = 30;
source[3] = 40;
source[4] = 50;
source[5] = 60;

// fuse gather with reduction:
//   sum = source[map[0]] + source[map[1]] + ...
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
                         thrust::make_permutation_iterator(source.begin(), map.end()));

4-5 zip_iterator

zip_iterator 可以将多个序列合并为一个元组序列。这使得我们可以同时处理多个序列：

#include <thrust/iterator/zip_iterator.h>
...
// initialize vectors
thrust::device_vector<int>  A(3);
thrust::device_vector<char> B(3);
A[0] = 10;  A[1] = 20;  A[2] = 30;
B[0] = 'x'; B[1] = 'y'; B[2] = 'z';

// create iterator (type omitted)
first = thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin()));
last  = thrust::make_zip_iterator(thrust::make_tuple(A.end(),   B.end()));

first[0]   // returns tuple(10, 'x')
first[1]   // returns tuple(20, 'y')
first[2]   // returns tuple(30, 'z')

// maximum of [first, last)
thrust::maximum< tuple<int,char> > binary_op;
thrust::tuple<int,char> init = first[0];
thrust::reduce(first, last, init, binary_op); // returns tuple(30, 'z')