What is the best way to create fake data for a classification problem?

Question

What is the best way to create fake data for a classification problem?

I am working on a project and I have a subset of user keyword input time data. This means that the user makes n attempts, and I will use this recorded attempt time data in various types of classification algorithms for future user attempts to verify that the login process is performed by the user or some other person. (I can just say that this is biometrics)

I have 3 different times of the login attempt process, of course, this is a subset of infinite data.

this is still an easy classification problem, I decided to use WEKA, but as far as I understand, I need to create some fake data to feed the classification algorithm. Custom measurement attempts will be 1, and fake data will be 0.

Can I use some optimization algorithms? or is there a way to create this fake data in order to get minimal false positives?

thanks

+6

pattern-recognition machine-learning classification weka biometrics

berkay Apr 10 '10 at 0:35

source share

1 answer

dmcer · Accepted Answer · 2010-04-10T05:06:29+0000

There are several ways you could approach this.

Collect negative examples . One simple solution would be to simply collect time synchronization data from other people that could be used as negative examples. If you want to collect a large sample very cheaply, since about 1000 samples for $ 10, you can use a service such as Amazon Mechanical Turk .

That is, you can put together a human intelligence task (HIT) in which people enter a randomized password, such as a sequence. To get time information, you will need an External question , since limited HTML for regular questions does not support JavaScript.

Use Generative Model . Alternatively, you can train a generative probabilistic model of user behavior. For example, you can prepare a Gaussian mixture model (GMM) for a user delay between keystrokes.

Such a model will give you an estimate of the likelihood of information about the time a key is pressed by a specific user. Then you just need to set a threshold for how likely the time information is for the user to be authenticated.

Use SVM with 1 class . Finally, the SVM 1-class allows you to train the SVM classifier using only positive examples. To learn one-class SVMs in WEKA , use the LibSVM wrapper if you are using v3.6. If you are using a developer version that has bleeding problems, then weka.classifiers.meta.OneClassClassifier .

What is the best way to create fake data for a classification problem?

More articles: