I am one of the lead developers of ILNumerics. So I am biased, obviously ;) But we are more disclosed regarding our internals, so I will give some insights into our speed secrets .
It all depends on how system resources are utilized! If you are about pure speed and need to handle large arrays, you will make sure to (ordered by importance, most important first)
Manage your memory appropriately! Naive memory management will lead to bad performance, since it stresses the GC badly, causes memory fragmentation and degrades memory locality (hence cache performance). In a garbage collected environment like .NET, this boils down to preventing from frequent memory allocations. In ILNumerics, we implemented a high performance memory pool in order to archieve this goal (and deterministic disposal of temporary arrays to get a nice, comfortable syntax without clumsy function semantics).
Utilize parallelism! This targets both: thread level parallelism and data level parallelism. Multiple cores are utilized by threading computation intensive parts of the calculation. On X86/X64 CPUs SIMD/multimedia extensions like SSE.XX and AVX allow a small but effective vectorization. They are not directly addressable by current .NET languages. And this is the only reason, why MKL may is still faster than pure .NET code. (But solutions are already rising.)
To archieve the speed of highly optimized languages like FORTRAN and C++, the same optimizations must get applied to your code as done for them. C# offers the option do do so.
Note, these precautions should be followed in that order! It does not make sense to care about SSE extensions or even bound check removal, if the bottleneck is the memory bandwith and the processor(s) spend most the time waiting for new data. Also, for many simple operations it does not even pays of to invest huge efforts to archieve the very last tiny scale up to peak performance! Consider the common example of the LAPACK function DAXPY. It adds the elements of a vector X to the corresponding element of another vector Y. If this is done for the first time, all the memory for X and Y will have to get fetched from the main memory. There is little to nothing you can do about it. And memory is the bottleneck! So regardless if the addition at the end is done the naive way in C#
for (int i = 0; i < C.Length; i++) {
C[i] = X[i] + Y[i];
}
or done by using vectorization strategies - it will have to wait for memory!
I know, this answer does somehow over answers the question, since most of these strategies are currently not utilized from the mentioned product (yet?). By following thoses points, you would eventually end up with much better performance than every naive implementation in a native language.
If you are interested, you could disclose your implementation of L-BFGS? I ll be happy to convert it to ILNumerics and post comparison results and I am sure, other libraries listed here would like to follow. (?)