machine learning - Can the value of information gain be negative? -

Is there any chance of getting the value of information negative? According to this formula, the calculation is done in the following paper. I can not write the formula because it contains some hard notation.

Thank you!

IG (Y | x) = H (Y) - HK (Y? X) > = 0 , since h (y)> = h (y | x) The worst condition is that x and y are free, thus h (y = x) ) = H (Y)

Another way is by looking at the random variable X taking some value, we either do not get any information about Y (you do not lose any Are).

Edit

Let me explain the information benefit in terms of Judgment tree (which actually I Remember in the first place because I came from a machine learning background).

Consider a classification problem where we give a set of examples and labels (discrete classes).

To select the feature to split on each node of the paste, it is to select the feature that divides the class attribute into two purest possible groups of examples (

< / P>

After the split entropy of each branch The number of entries that is weighted by the number of situations under that branch.

Now there is no possible division of square values This is a very good example of a binary classification problem, on a definite node, we have 5 positive examples and 4 negative (9 in total) Therefore, entropy (before partition) is:

  HK ([4,5]) = -4 / 9 * LG (4/9) -5 / 9 * LG (5 / 9)) = 0.99107606

Now consider some cases of division. The best condition is that the current attributes completely decrease (i.e., a branch is all positive, all other negatives): [4 +, 5-] / \ h ([4, 0], [0,5]) = 4/9 * (-4/4 * LG (4/4)) + 5/9 * (-5 / 5 * LG (5/5)) / \ = 0 / / Zero entropy, correct partition [4 +, 0-] [0 +, 5-]

then

  IG = H ([4, 5] ] - H ([4,0], [0,5]) = H ([4,5]) // In this case, most possible

Imagine that the second feature The worst case scenario is possible, where any d Tana does not get, but all the examples go below the other (for example, for example if the property is particularly consistent, thus wasted):

  [4 +, 5-] / \ H ([4,5], [0,0]) = 9/9 * H ([4,5]) + 0 / \ = H ([4,5]] // split Before entropy [4 +, 5-] [0 +, 0-]

and

  Ig = H ([4,5]) - H ([4,5], [0,0]) = 0 // The least possible case in this case

Now you will see anywhere between these two cases as in many cases In:

  [4 +, 5 -] / \ (HK [3,2], [1,3]) = 5/9 * (-3 / 5 * LG (3/5) -2 / 5 * LG (2/5) / \ + 4/9 * (-1 / 4 * LG (1/1) -3 / 4 * LG (3/4)) [3 +, 2-] [1 +, 3-]

and

  IG = H ( [4,5]) - H ([3,2], [1,3]) = [...] = 0.31331323

So how do you divide those 9 instances Despite this, you always get a positive advantage in information. I know that this is not a mathematical proof (go to math overflow!), I thought a real example could help.

(Note: All calculations according to Google)

Also Add Customs

Search This Blog

machine learning - Can the value of information gain be negative? -

Comments

Post a Comment