Classification: garbled data inside a class

Question

Classification: garbled data inside a class

I am trying to build a layered classifier to predict the likelihood that some input will be either 0 or 1. I use a neural network and Tensorflow + Keras (maybe CNN later).

The problem is this: The data is badly distorted. There are more negative examples than positive ones, possibly 90:10. Therefore, my neural network almost always derives very low probabilities for positive examples. Using binary numbers, in most cases 0 would be predicted.

Performance is> 95% for almost all classes, but this is due to the fact that it almost always predicts zero ... Therefore, the number of false negatives is very large.

Some recommendations for fixing this issue?

Here are the ideas I've reviewed so far:

The punishment of false negatives is greater with the configured loss function (my first attempt failed). Like class weighing positive examples inside a class are more than negative. This is similar to the weight of the class, but inside the class. How would you implement this in Keras?
Overfulfill the positive examples by cloning them and then reassigning the neural network so that the positive and negative examples are balanced.

Thanks in advance!

+7

python neural-network tensorflow keras multilabel-classification

Bugridisli Feb 20 '18 at 7:57

source share

2 answers

Norms · Answer 1 · 2018-03-02T08:50:50+0000

You are on the right track.

Usually you either balance your data set before training, i.e. reduce the overrepresented class or generate artificial (augmented) data for the underrepresented class to increase its appearance.

Reducing an overly represented class It's easier, you just randomly select as many samples as there are in an underrepresented class, discard the rest and train with a new subset. Of course, the disadvantage is that you lose some learning potential, depending on how complex (how many functions) your task is.
Additional data Depending on the type of data you are working with, you can “enlarge” the data. It just means that you take existing samples from your data and slightly modify them and use them as additional samples. This works very well with image data, audio data. You can flip / rotate, scale, add noise, enable / decrease brightness, scale, crop, etc. The important thing is that you stay within what can happen in the real world. If, for example, you want to recognize a speed limit sign of 70 mph, well, turning it over does not make sense, you will never encounter an actual flipping icon of 70 mph. If you want to recognize a flower, flipping or rotation is allowed. The same goes for sound; changing the volume / frequency does not make much difference. But changing an audio track changes its “meaning”, and you won’t need to recognize inverse words in the real world.

Now, if you need to increase tabular data, such as sales data, metadata, etc., which is much more complicated, since you must be careful not to introduce your own assumptions into the model.

Anne · Answer 2 · 2018-02-20T08:35:11+0000

I think your two suggestions are already good. Of course, you can also just undo the negativ class.

def balance_occurences(dataframe, zielspalte=target_name, faktor=1): least_frequent_observation=dataframe[zielspalte].value_counts().idxmin() bottleneck=len(dataframe[dataframe[zielspalte]==least_frequent_observation]) balanced_indices=dataframe.index[dataframe[zielspalte]==least_frequent_observation].tolist() for value in (set(dataframe[zielspalte])-{least_frequent_observation}): full_list=dataframe.index[dataframe[zielspalte]==value].tolist() selection=np.random.choice(a=full_list,size=bottleneck*faktor, replace=False) balanced_indices=np.append(balanced_indices,selection) df_balanced=dataframe[dataframe.index.isin(balanced_indices)] return df_balanced

The loss function can take a peek at a positive class recall in conjunction with some other measurements.

Classification: garbled data inside a class

More articles: