Question

我正在问,能否改进可考虑的内格矩阵,使之与。矩阵是小的,要素是小的无否定性分类(最小值为20)。

为了保持我们的重点,请允许我说,我有两个3x3矩阵,有0<的分类;=x<15。

以下宽度C++的实施工作用1秒钟左右进行,测量值为<代码> 时值/代码>。

#include <random>

int main() {
//Random number generator
std::random_device rd;
std::mt19937 eng(rd());
std::uniform_int_distribution<> distr(0, 15);

int A[3][3];
int B[3][3];
int C[3][3];
for (int trials = 0; trials <= 1000000; trials++) {
    //Set up A[] and B[]
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            A[i][j] = distr(eng);
            B[i][j] = distr(eng);
            C[i][j] = 0;
        }
    }
    //Compute C[]=A[]*B[]
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
}
return 0;
}

注:

The matrices are not necessarily sparse.

Strassen-like comments does not help here.
Let s try not to use the circumstantial observation, that in this specific problem the matrices A[] and B[] can be encoded as a single 64 bit integer. Think of what would happen for just a bit larger matrices.
Computation is single-threaded.

相关: 缩略语 2048年游戏的最佳算法是什么?

Answer 1

您提出的问题涉及一个矩阵,每个要素都是单一参数。单项数值<代码>a和b,a * b完全等于a & b。

添加两点内容可能是可行的(而且比包装更快),基本上从零件中增加,而XOR(无地增加),然后产生携带、转移和掩饰跨越边界。

第3轨规定,在增加运输量时需要检测,再生产另一条。我认为,与使用《千年发展目标》相比,甚至连3个比值增加或乘数,都是一种胜利。如果没有SIMD(即纯C和uint64_t),它可能具有意义。此外,你可以尝试使用正常的添加物,然后试图拆卸元素边界之间的运输,而不是在XOR/AND/班轮作业中增量。

packed vs. unpacked-to-bytes storage formats

如果你有许多这种微小的矩阵,以压缩格式(如包装4个轨道元素)储存起来,就能够帮助打脚印/记忆带宽。 4个轨道元素很容易分解,使每个元素都包含在病媒的单独副产品中。

否则,每批储存一个矩阵元素。从那里看,如果需要,你可以很容易地把他们包到16比特或32比特,这取决于所千年发展目标指示所规定的内容大小。您可能以未包装的格式将一些地方变量的矩阵保留到多个平台的再利用,但把这些矩阵重新包装为每个元素的4倍,以便储存在一个阵列中。

www.un.org/spanish/ecosoc Compilers suck at this with uint8_t in scalar C Code for x86. See comments on Richard s response: gcc and Corng both similar to use mul r8 for uint8_t, which forces them to transport data into eax (一只全程乘数的默示投入/产出),而不是 ,使用imul r32, r32,忽视了远离目的地登记线8的停车场。

<代码>uint8_t版本实际上比uint16_t版本缓慢,尽管其足迹的一半。

You re probably going to get best results from some kind of SIMD.

ICT SSSE3拥有一个 vector byte 成倍数,但只增加了一个相邻的内容。采用这一方法,将需要将矩阵分解成介质,在行之间或某一行之间有点零,因此,你无法从一行获得数据,而数据则与另一行的数据混在一起。 Fortunately, pshufb 零件和复制件。

更可能有用的是SSE2 PMADDWD,如果你在另外16个轨道矢量元素中不包装每个矩阵要素的话。因此,在一种病媒中投放了一排,在另一个病媒中转列一栏,pmaddw(_mm_madd_epi16)为横向add。除了给您提供产品外,还需要<代码>。 C[i][j] 。

Instead of doing each of those adds separately, you can probably pack multiple pmaddwd results into a single vector so you can store C[i][0..2] in one go.

Answer 2

您可能发现,如果你在大量矩阵中进行这一计算,那么降低数据规模就会大大改善业绩:

#include <cstdint>
#include <cstdlib>

using T = std::uint_fast8_t;

void mpy(T A[3][3], T B[3][3], T C[3][3])
{
for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
}

The pentium can move and sign-extend an 8-bit value in one instruction. This means you re getting 4 times as many matricies per cache line.

UPDATE: Sciiosity piqued, I letter a test:

#include <random>
#include <utility>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <typeinfo>

template<class T>
struct matrix
{
    static constexpr std::size_t rows = 3;
    static constexpr std::size_t cols = 3;
    static constexpr std::size_t size() { return rows * cols; }

    template<class Engine, class U>
    matrix(Engine& engine, std::uniform_int_distribution<U>& dist)
    : matrix(std::make_index_sequence<size()>(), engine, dist)
    {}

    template<class U>
    matrix(std::initializer_list<U> li)
    : matrix(std::make_index_sequence<size()>(), li)
    {

    }

    matrix()
    : _data { 0 }
    {}

    const T* operator[](std::size_t i) const {
        return std::addressof(_data[i * cols]);
    }

    T* operator[](std::size_t i) {
        return std::addressof(_data[i * cols]);
    }

private:

    template<std::size_t...Is, class U, class Engine>
    matrix(std::index_sequence<Is...>, Engine& eng, std::uniform_int_distribution<U>& dist)
    : _data { (void(Is), dist(eng))... }
    {}

    template<std::size_t...Is, class U>
    matrix(std::index_sequence<Is...>, std::initializer_list<U> li)
    : _data { ((Is < li.size()) ? *(li.begin() + Is) : 0)... }
    {}


    T _data[rows * cols];
};

template<class T>
matrix<T> operator*(const matrix<T>& A, const matrix<T>& B)
{
    matrix<T> C;
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
    return C;
}

static constexpr std::size_t test_size = 1000000;
template<class T, class Engine>
void fill(std::vector<matrix<T>>& v, Engine& eng, std::uniform_int_distribution<T>& dist)
{
    v.clear();
    v.reserve(test_size);
    generate_n(std::back_inserter(v), test_size,
               [&] { return matrix<T>(eng, dist); });
}

template<class T>
void test(std::random_device& rd)
{
    std::mt19937 eng(rd());
    std::uniform_int_distribution<T> distr(0, 15);

    std::vector<matrix<T>> As, Bs, Cs;
    fill(As, eng, distr);
    fill(Bs, eng, distr);
    fill(Cs, eng, distr);

    auto start = std::chrono::high_resolution_clock::now();
    auto ia = As.cbegin();
    auto ib = Bs.cbegin();
    for (auto&m : Cs)
    {
        m = *ia++ * *ib++;
    }
    auto stop = std::chrono::high_resolution_clock::now();

    auto diff = stop - start;
    auto millis = std::chrono::duration_cast<std::chrono::microseconds>(diff).count();
    std::cout << "for type " << typeid(T).name() << " time is " << millis << "us" << std::endl;

}

int main() {
    //Random number generator
    std::random_device rd;
    test<std::uint64_t>(rd);
    test<std::uint32_t>(rd);
    test<std::uint16_t>(rd);
    test<std::uint8_t>(rd);
}

例如产出(参考书,64-比值,编为-O3)

for type y time is 32787us
for type j time is 15323us
for type t time is 14347us
for type h time is 31550us

摘要:

事实证明,在这个平台上,英特32和英特16之间是一样快的。 in64和 in8同样缓慢(8倍的结果令我感到惊讶)。

结论:

一如既往,向汇编者表示意向,让选择者做事。如果该方案在生产过程中运行得太慢,则进行测量,选择最差的投产者。

Answer 3

是的,在掌握线性黄色镜的情况下,在64倍硬件乘数的帮助下,你能够进行分类。

If you want a bit of background please refer to a previous answer of mine that does basically what you want except over boolean algebra, and which can serves as a starting point for this problem:

• 快速推广K x k boolean matrices,其中8 <=k <=16

就你而言,你不能代表每64个轨道一字的8x8个ole子,而只能用2x2个 matrix子包装,一度是零胎,一度是零胎。

我获得了进一步细节,因为光顾需要理解。

uint64_t uint7matmul2x2x2 (uint64_t A, uint64_t B) {

    const uint64_t ROW = 0x00000000007F007F;
    const uint64_t COL = 0x0000007F0000007F;

    uint64_t C = 0;

    for (int i=0; i<2; ++i) {
        uint64_t p = COL & (A>>i*16);
        uint64_t r = ROW & (B>>i*32);
        C += (p*r);
    }
    return C;
}

在这一情况下,我们每公吨64倍包装2x2=4元件产品,计算出2x2x2矩阵产品,只剩下2英特64产品。

算法的包装和未包装部分比对铝业案的参与要小得多,但如果你理解算法的其余部分,则比较直截了当。此外,如你所说,如果ger户少于32个,那么你就可以积累部分矩阵产品的多重成果,在u64的代表性中只增加64个产品,然后才能将部分成果分解。

包装2x3 = 每一产品6种复制品可能进一步扩大同样的基本分析范围,甚至比在集团理论帮助下增加,但现在我要做的工作要远远超过我准备做的工作。

packed vs. unpacked-to-bytes storage formats

You re probably going to get best results from some kind of SIMD.

友情链接