English 中文(简体)
快速编织矩阵乘以轨道干扰的 ha
原标题:Fast integer matrix multiplication with bit-twiddling hacks
最佳回答

您提出的问题涉及一个矩阵,每个要素都是单一参数。 单项数值<代码>a和b,a * b完全等于a & b

添加两点内容可能是可行的(而且比包装更快),基本上从零件中增加,而XOR(无地增加),然后产生携带、转移和掩饰跨越边界。

第3轨规定,在增加运输量时需要检测,再生产另一条。 我认为,与使用《千年发展目标》相比,甚至连3个比值增加或乘数,都是一种胜利。 如果没有SIMD(即纯C和uint64_t),它可能具有意义。 此外,你可以尝试使用正常的添加物,然后试图拆卸元素边界之间的运输,而不是在XOR/AND/班轮作业中增量。


packed vs. unpacked-to-bytes storage formats

如果你有许多这种微小的矩阵,以压缩格式(如包装4个轨道元素)储存起来,就能够帮助打脚印/记忆带宽。 4个轨道元素很容易分解,使每个元素都包含在病媒的单独副产品中。

否则,每批储存一个矩阵元素。 从那里看,如果需要,你可以很容易地把他们包到16比特或32比特,这取决于所千年发展目标指示所规定的内容大小。 您可能以未包装的格式将一些地方变量的矩阵保留到多个平台的再利用,但把这些矩阵重新包装为每个元素的4倍,以便储存在一个阵列中。


www.un.org/spanish/ecosoc Compilers suck at this with uint8_t in scalar C Code for x86. See comments on Richard s response: gcc and Corng both similar to use mul r8 for uint8_t, which forces them to transport data into eax (一只全程乘数的默示投入/产出),而不是,使用imul r32, r32,忽视了远离目的地登记线8的停车场

<代码>uint8_t版本实际上比uint16_t版本缓慢,尽管其足迹的一半。


You re probably going to get best results from some kind of SIMD.

ICT SSSE3拥有一个vector byte 成倍数,但只增加了一个相邻的内容。 采用这一方法,将需要将矩阵分解成介质,在行之间或某一行之间有点零,因此,你无法从一行获得数据,而数据则与另一行的数据混在一起。 Fortunately, pshufb 零件和复制件。

更可能有用的是SSE2 PMADDWD,如果你在另外16个轨道矢量元素中不包装每个矩阵要素的话。 因此,在一种病媒中投放了一排,在另一个病媒中转列一栏,pmaddw(_mm_madd_epi16)为横向add。 除了给您提供产品外,还需要<代码>。 C[i][j] 。

Instead of doing each of those adds separately, you can probably pack multiple pmaddwd results into a single vector so you can store C[i][0..2] in one go.

问题回答

您可能发现,如果你在大量矩阵中进行这一计算,那么降低数据规模就会大大改善业绩:

#include <cstdint>
#include <cstdlib>

using T = std::uint_fast8_t;

void mpy(T A[3][3], T B[3][3], T C[3][3])
{
for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
}

The pentium can move and sign-extend an 8-bit value in one instruction. This means you re getting 4 times as many matricies per cache line.

UPDATE: Sciiosity piqued, I letter a test:

#include <random>
#include <utility>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <typeinfo>

template<class T>
struct matrix
{
    static constexpr std::size_t rows = 3;
    static constexpr std::size_t cols = 3;
    static constexpr std::size_t size() { return rows * cols; }

    template<class Engine, class U>
    matrix(Engine& engine, std::uniform_int_distribution<U>& dist)
    : matrix(std::make_index_sequence<size()>(), engine, dist)
    {}

    template<class U>
    matrix(std::initializer_list<U> li)
    : matrix(std::make_index_sequence<size()>(), li)
    {

    }

    matrix()
    : _data { 0 }
    {}

    const T* operator[](std::size_t i) const {
        return std::addressof(_data[i * cols]);
    }

    T* operator[](std::size_t i) {
        return std::addressof(_data[i * cols]);
    }

private:

    template<std::size_t...Is, class U, class Engine>
    matrix(std::index_sequence<Is...>, Engine& eng, std::uniform_int_distribution<U>& dist)
    : _data { (void(Is), dist(eng))... }
    {}

    template<std::size_t...Is, class U>
    matrix(std::index_sequence<Is...>, std::initializer_list<U> li)
    : _data { ((Is < li.size()) ? *(li.begin() + Is) : 0)... }
    {}


    T _data[rows * cols];
};

template<class T>
matrix<T> operator*(const matrix<T>& A, const matrix<T>& B)
{
    matrix<T> C;
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
    return C;
}

static constexpr std::size_t test_size = 1000000;
template<class T, class Engine>
void fill(std::vector<matrix<T>>& v, Engine& eng, std::uniform_int_distribution<T>& dist)
{
    v.clear();
    v.reserve(test_size);
    generate_n(std::back_inserter(v), test_size,
               [&] { return matrix<T>(eng, dist); });
}

template<class T>
void test(std::random_device& rd)
{
    std::mt19937 eng(rd());
    std::uniform_int_distribution<T> distr(0, 15);

    std::vector<matrix<T>> As, Bs, Cs;
    fill(As, eng, distr);
    fill(Bs, eng, distr);
    fill(Cs, eng, distr);

    auto start = std::chrono::high_resolution_clock::now();
    auto ia = As.cbegin();
    auto ib = Bs.cbegin();
    for (auto&m : Cs)
    {
        m = *ia++ * *ib++;
    }
    auto stop = std::chrono::high_resolution_clock::now();

    auto diff = stop - start;
    auto millis = std::chrono::duration_cast<std::chrono::microseconds>(diff).count();
    std::cout << "for type " << typeid(T).name() << " time is " << millis << "us" << std::endl;

}

int main() {
    //Random number generator
    std::random_device rd;
    test<std::uint64_t>(rd);
    test<std::uint32_t>(rd);
    test<std::uint16_t>(rd);
    test<std::uint8_t>(rd);
}

例如产出(参考书,64-比值,编为-O3)

for type y time is 32787us
for type j time is 15323us
for type t time is 14347us
for type h time is 31550us

摘要:

事实证明,在这个平台上,英特32和英特16之间是一样快的。 in64和 in8同样缓慢(8倍的结果令我感到惊讶)。

结论:

一如既往,向汇编者表示意向,让选择者做事。 如果该方案在生产过程中运行得太慢,则进行测量,选择最差的投产者。





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签