English 中文(简体)
Aggregating automatically-generated feature vectors
原标题:

I ve got a classification system, which I will unfortunately need to be vague about for work reasons. Say we have 5 features to consider, it is basically a set of rules:

A  B  C  D  E  Result
1  2  b  5  3  X
1  2  c  5  4  X
1  2  e  5  2  X

We take a subject and get its values for A-E, then try matching the rules in sequence. If one matches we return the first result.

C is a discrete value, which could be any of a-e. The rest are just integers.

The ruleset has been automatically generated from our old system and has an extremely large number of rules (~25 million). The old rules were if statements, e.g.

result("X") if $A >= 1 && $A <= 10 && $C eq  A ;

As you can see, the old rules often do not even use some features, or accept ranges. Some are more annoying:

result("Y") if ($A == 1 && $B == 2) || ($A == 2 && $B == 4);

The ruleset needs to be much smaller as it has to be human maintained, so I d like to shrink rule sets so that the first example would become:

A  B  C    D  E    Result
1  2  bce  5  2-4  X

The upshot is that we can split the ruleset by the Result column and shrink each independently. However, I cannot think of an easy way to identify and shrink down the ruleset. I ve tried clustering algorithms but they choke because some of the data is discrete, and treating it as continuous is imperfect. Another example:

A  B  C   Result
1  2  a   X
1  2  b   X
(repeat a few hundred times)
2  4  a   X  
2  4  b   X
(ditto)

In an ideal world, this would be two rules:

A  B  C  Result
1  2  *  X
2  4  *  X

That is: not only would the algorithm identify the relationship between A and B, but would also deduce that C is noise (not important for the rule)

Does anyone have an idea of how to go about this problem? Any language or library is fair game, as I expect this to be a mostly one-off process. Thanks in advance.

最佳回答

Check out the Weka machine learning lib for Java. The API is a little bit crufty but it s very useful. Overall, what you seem to want is an off-the-shelf machine learning algorithm, which is exactly what Weka contains. You re apparently looking for something relatively easy to interpret (you mention that you want it to deduce the relationship between A and B and to tell you that C is just noise.) You could try a decision tree, such as J48, as these are usually easy to visualize/interpret.

问题回答

Twenty-five million rules? How many features? How many values per feature? Is it possible to iterate through all combinations in practical time? If you can, you could begin by separating the rules into groups by result.

Then, for each result, do the following. Considering each feature as a dimension, and the allowed values for a feature as the metric along that dimension, construct a huge Karnaugh map representing the entire rule set.

The map has two uses. One: research automated methods for the Quine-McCluskey algorithm. A lot of work has been done in this area. There are even a few programs available, although probably none of them will deal with a Karnaugh map of the size you re going to make.

Two: when you have created your final reduced rule set, iterate over all combinations of all values for all features again, and construct another Karnaugh map using the reduced rule set. If the maps match, your rule sets are equivalent.

-Al.

You could try a neural network approach, trained via backpropagation, assuming you have or can randomly generate (based on the old ruleset) a large set of data that hit all your classes. Using a hidden layer of appropriate size will allow you to approximate arbitrary discriminant functions in your feature space. This is more or less the same idea as clustering, but due to the training paradigm should have no issue with your discrete inputs.

This may, however, be a little too "black box" for your case, particularly if you have zero tolerance for false positives and negatives (although, it being a one-off process, you get an arbitrary degree of confidence by checking a gargantuan validation set).





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签