English 中文(简体)
Calculating variance with large numbers
原标题:

I haven t really used variance calculation that much, and I don t know quite what to expect. Actually I m not too good with math at all.

I have a an array of 1000000 random numeric values in the range 0-10000.

The array could grow even larger, so I use 64 bit int for sum.

I have tried to find code on how to calc variance, but I don t know if I get correct output.

The mean is 4692 and median is 4533. I get variance 1483780.469308 using the following code:

// size is the element count, in this case 1000000
// value_sum is __int64

double p2 = pow( (double)(value_sum - (value_sum/size)), (double)2.0 );
double variance = sqrt( (double)(p2 / (size-1)) );

Am I getting a reasonable value?

Is anything wrong with the calculation?

最佳回答

Note: It doesn t look like you re calculating the variance.

Variance is calculated by subtracting the mean from every element and calculating the weighted sum of these differences.

So what you need to do is:

// Get mean
double mean = static_cast<double>(value_sum)/size;

// Calculate variance
double variance = 0;
for(int i = 0;i<size;++i) 
{
  variance += (MyArray[i]-mean)*(MyArray[i]-mean)/size;
}

// Display
cout<<variance;

Note that this is the sample variance, and is used when the underlying distribution is unknown (so we assume a uniform distribution).

Also, after some digging around, I found that this is not an unbiased estimator. Wolfram Alpha has something to say about this, but as an example, when MATLAB computes the variance, it returns the "bias-corrected sample variance".

The bias-corrected variance can be obtained by dividing by each element by size-1, or:

//Please check that size > 1
variance += (MyArray[i]-mean)*(MyArray[i]-mean)/(size-1); 

Also note that, the value of mean remains the same.

问题回答

First of all, if you re just looking to get a handle on what is a "reasonable" variance, keep in mind that variance is basically standard deviation squared. Standard deviation roughly measures the typical distance from a data point to its expected value.

So if your data has mean 4692, and your calculated variance is coming out to 1483780, that means your standard deviation is about 1218, which would suggest your numbers tend to be somewhere in the vicinity of the range 3474 - 5910. So that variance actually seems a bit low to me if the range of your numbers is 0 - 10000; but it obviously depends on the distribution of your data.

As for the calculation itself: You can calculate the variance using a running calculation as you re reading your data the first time around (you don t have to know the mean in advance) using Welford s Method:

Initialize M1 = x1 and S1 = 0.

For subsequent x s, use the recurrence formulas

Mk = Mk-1+ (xk - Mk-1)/k Sk = Sk-1 + (xk - Mk-1)*(xk - Mk).

For 2 ≤ k ≤ n, the kth estimate of the variance is s2 = Sk/(k - 1).

Just for fun, a slightly different route to the same result, using std::valarray instead of std::vector and (various) algorithms:

template <class T>
T const variance(std::valarray<T> const &v) {
    if (v.size() == 0)
        return T(0.0);
    T average = v.sum() / v.size();
    std::valarray<T> diffs = v-average;
    diffs *= diffs;
    return diffs.sum()/diffs.size();
}

As Jacob hinted, there are really two possible versions of a variance calculation. As it stands, this assumes your inputs are the "universe". If you ve taken only a sample of the overall universe, the last line should use: (diffs.size()-1) instead of diffs.size().

Use a different formula maybe?

#include <functional>
#include <algorithm>
#include <iostream>
int main()
{
 using namespace std;

 vector<double> num( 3 );
 num[ 0 ] = 4000.9, num[ 1 ] = 11111.221, num[ 2 ] = -2;


 double mean = std::accumulate(num.begin(), num.end(), 0.0) / num.size();
 vector<double> diff(num.size());
 std::transform(num.begin(), num.end(), diff.begin(), 
                std::bind2nd(std::minus<double>(), mean));
 double variance = std::inner_product(diff.begin(), diff.end(), 
                                     diff.begin(), 0.0) / (num.size() - 1);
 cout << "mean = " << mean << endl
      << "variance = " << variance << endl;
}

Outputs: mean = 5036.71 variance = 3.16806e+07

Sample Variance calculation:

#include <math.h>
#include <vector>

double Variance(std::vector<double>);

int main()
{
     std::vector<double> samples;
     samples.push_back(2.0);
     samples.push_back(3.0);
     samples.push_back(4.0);
     samples.push_back(5.0);
     samples.push_back(6.0);
     samples.push_back(7.0);

     double variance = Variance(samples);
     return 0;
}

double Variance(std::vector<double> samples)
{
     int size = samples.size();

     double variance = 0;
     double t = samples[0];
     for (int i = 1; i < size; i++)
     {
          t += samples[i];
          double diff = ((i + 1) * samples[i]) - t;
          variance += (diff * diff) / ((i + 1.0) *i);
     }

     return variance / (size - 1);
}

Since you re working with large numbers and then doing floating-point operations on them, you might want to do everything in doubles; that would save you a lot of casts.

Using pow .. 2 to calculate a square seems a bit awkward. You could calculate your number first, then multiply it by itself to get a square.

If you re doing division and feel the need to cast, cast the operands (i.e. the numerator and/or denominator) to double rather than the result. You re losing accuracy if you divide integers.

I m not sure if your formula for variance is correct. You may want to look at the explanation in Wikipedia, for example. But I m no math expert either, so I m not sure you have a mistake.

Since variance is the square of the standard deviation, the answers to SO 1174984 should help out. The short diagnosis is that you need to compute the sum of the squares of the values as well as the sum of the values, and you don t seem to be doing that.

Since you have 106 values, and the square of any value can be up to 108, you could end up with a sum of squares up to 1014; your 64-bit integers can store up to 1018, so you could still handle ten thousand times as many inputs, or values ranging up to one million instead of only ten thousand, without running into overflows. There s no urgent need, therefore, to move to pure double computations.





相关问题
Undefined reference

I m getting this linker error. I know a way around it, but it s bugging me because another part of the project s linking fine and it s designed almost identically. First, I have namespace LCD. Then I ...

C++ Equivalent of Tidy

Is there an equivalent to tidy for HTML code for C++? I have searched on the internet, but I find nothing but C++ wrappers for tidy, etc... I think the keyword tidy is what has me hung up. I am ...

Template Classes in C++ ... a required skill set?

I m new to C++ and am wondering how much time I should invest in learning how to implement template classes. Are they widely used in industry, or is this something I should move through quickly?

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

typedef ing STL wstring

Why is it when i do the following i get errors when relating to with wchar_t? namespace Foo { typedef std::wstring String; } Now i declare all my strings as Foo::String through out the program, ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

Window iconification status via Xlib

Is it possible to check with the means of pure X11/Xlib only whether the given window is iconified/minimized, and, if it is, how?

热门标签