Choosing a classification algorithm for classifying a combination of nominal and numeric data?

I have a dataset of about 100,000 customer model purchase records. The dataset contains

  • Age (continuous value from 2 to 120), but I also plan to classify age ranges.
  • Gender (0 or 1)
  • Address (there can be only six types, or I can also use numbers from 1 to 6)
  • Preferred store (can only be from 7 stores), which is a class problem.

So my problem is to classify and predict customers based on their age, gender, and location for store preferences. I tried to use naive trees and decision trees, but their classification accuracy is slightly lower.

I also think about logistic regression, but I'm not sure about the discrete value, like gender and address. But I also assumed SVM with some kernel tricks, but not tested yet.

So what kind of machine learning algorithm do you offer for better accuracy with these features.

+4
source share
2 answers

The problem is that you represent nominal variables on a continuous scale, which imposes (false) ordinal relations between classes when using machine learning methods. For example, if you encode an address as one of six possible integers, then address 1 is closer to address 2 than address 3,4,5,6. This will cause problems when you try to find out something.

Instead, translate the 6-digit categorical variable into six binary variables, one for each categorical value. Then your original function will give six functions in which there will be only one. Also, keep age as an integer value as you lose information, making it categorical.

As for the approaches, this is unlikely to be of great importance (at least at the initial stage). Go with what is easier for you to implement. However, before running on your test suite, make sure that you perform any cross-validation selection option in dev, since all algorithms have parameters that can significantly affect the accuracy of training.

+10
source

You really need to look at the data and determine if there is enough difference between your labels and the functions that you have. Since there are so few functions, but a lot of data, something like kNN may work well.

You can adapt collaborative filtering to solve your problem, as this will also work with similar features.

0
source

All Articles