Question

I am currently strugeling with a machine learning problem whereas I have to deal with great unbalanced data sets. That is, there are six classes ( 1 , 2 ... 6 ). Unfortunately there are e.g. for class 1 150 examples/instances, for 2 90 instances and for class 3 only 20. All other classes can t be "trained" since there are no available instances for these classes.

So far, I figured out that WEKA (the machine learning toolkit I am using) provides this supervised "Resample" filter. When I apply this filter with noReplacement =false and bialToUniformClass =1.0 then this results in a data set, where the the number of instances is nice and almost equal (for class 1 .. 3 and the others stay empty).

My question is now: how does WEKA and this filter generate "new"/additional instances for different classes.

Thank you very much in advance for any hints or suggestions.

Cheers Julian

Answer 1

It doesn t. It s resampling existing instances. If you have one class-2 instance, and ask for a resampling with a bias of 1.0, you can expect N copies of that instance and N other instances of each other type for which there is already data.

Answer 2

Using WEKA s supervised Resample filter adds instances to a class. This realized by simply adding instances from the class which has only few instances multiple times to the result data set.

Therefore the resulting data set is strongly biased in terms of a class for which only few samples are available.

Answer 3

Try with the SMOTE filter on the preprocess.

It balance your dataset by generating new data for the minor class.

友情链接