How to work with a vector function of variable length?

Suppose you are trying to classify houses based on certain functions:

  • total area
  • Number of rooms
  • Garage area

But not all houses have garages. But when they do, their total area makes a very distinctive feature. What is a good approach to use the information contained in this function?

+8
machine-learning
source share
3 answers

You can include a zero / one dummy variable indicating whether there is a garage, as well as a cross-product of the garage area with a mannequin (for houses without a garage, set the area to zero).

+5
source share

The best approach is to create your data set with all the functions, and in most cases it’s just fine to fill in the columns with zeros that are not available.

Using your example, this will be something like:

Total area Number of rooms Garage area 100 2 0 300 2 5 125 1 1.5 

Often, the training algorithm that you have chosen would be powerful enough to use these zeros to correctly classify this record. In the end, the lack of value is more information for the algorithm. This can be a problem if your data is distorted, but in this case you still need to eliminate the asymmetry.

EDIT:

I just understand that there is another answer with a comment saying that you are afraid to use zeros, given the fact that it can be confused with small garages. Although I still do not see a problem with this (there should be enough difference between a small garage and zero), you can still use the same structure as a garage with non-viability with a negative number (let them say -1).

The solution indicated in another answer is also quite plausible, with an additional sign indicating whether the garage will be in the garage or not, will work fine (especially in decision tree-based algorithms). I just prefer to keep the data dimension as low as possible, but in the end it's more of a preference, but rather a technical solution.

+1
source share

You want to enable the zero indicator function. That is, a function that is 1 when the size of the garage is 0 and 0 for any other value.

Your function vector will be: area | num_rooms | garage_size | garage_exists

Now your machine learning algorithm will be able to see this (non-linear) garage size function.

0
source share

All Articles