Neural network with categorical variables (enum) as input

I am trying to solve some problems of machine learning using neural networks, mainly with the evolution of NEAT (NeuroEvolution of Augmented Topologies).

Some of my input variables are continuous, but some of them are categorical, for example:

  • Species: {Lion, Leopard, Tiger, Jaguar}
  • Affiliates: {Health, Insurance, Finance, IT, Advertising}

At first I wanted to simulate such a variable by comparing categories with discrete numbers, for example:

{Leo: 1, Leopard: 2, Tiger: 3, Jaguar: 4}

But I'm afraid that adds some kind of arbitrary topology to the variable. A tiger is not the sum of a lion and a leopard.

What approaches to this problem are usually used?

+7
enums neural-network
source share
1 answer

Unfortunately, there is no good solution, each of which leads to some problems:

  • Your solution adds topology as you mentioned; this may not be so bad, since NN can correspond to arbitrary functions and represent "ifs", but in many cases it will (since NN often fall into some local minima).
  • You can encode your data in the form is_categorical_feature_i_equal_j , which will not lead to any additional topology, but will increase the number of functions exponentially. Thus, with the help of "views" you get the functions "is_lion", "is_leopard", etc., And only one of them is equal to 1 at a time
  • in the case of a large amount of data compared to possible categorical values โ€‹โ€‹(for example, you have 10,000 data points and only 10 possible categorical values), you can also divide the problem into 10 independent ones, each of which is trained on one specific one (therefore we have a โ€œneural network for lions "," neural network for jaguars ", etc.).

These two first approaches relate to โ€œextremeโ€ cases - one of them is very inexpensive, but can lead to high bias, while semantics are complex, but should not affect the classification process. The latter is rarely used (due to the assumption of a small number of categorical meanings), but is quite reasonable from the point of view of machine learning.

+13
source share

All Articles