Statistics versus machine learning and which Java APIs to use

I am writing a piece of software that will basically analyze a dataset and should be able to “output” or “extrapolate” or “predict” when the next event will happen and what will happen in the next event.

In the hiring process as an example, some events occur at a specific time. At t0, the applicant submits an application, at t1 the HR manager looks at the form and performs basic screening, is sent to the technical counter in the t2 file, etc. Until the applicant is hired or rejected.

I have a good dataset for several “applicants”, and their time and event samples look like this: Applicant, date application, date application reviewed by HR, date application viewed by a technical client, etc.

What I need to do is for the new applicant, I want him to be able to show when the next event will happen.

I evaluate several alternatives: learning algorithms are amazing, but can be excessive, statistical methods, such as extrapolation, may be relevant, but it is difficult that the human factor is involved in the process (human delay), so I'm not sure which direction haunt and which appropriate libraries to use.

Apache Commons Math seems like a good place to start extrapolating.

Any ideas?

+4
source share
2 answers

Not sure if I follow the use of MC in this context. In fact, all this is the task of studying distributions. Once the distribution has been studied, the MC may be useful to answer some questions, given the distribution as input. But if the question is fair: what is the distribution of the next event for this applicant? then the answer is just what has been studied, so no MC is required.

I think you need to get your hands dirty and start analyzing your data. So, for example, for all involved t_n times, I would calculate the average time. Then I would calculate the covariance matrix. It is important; are pretty well independent all the time (what I think mikera accepts in his answer)? Is there a positive or negative correlation in successive steps? About steps that are even more isolated? Etc. If they are mostly independent, your life is relatively simple. Then you can evaluate each distribution separately. You can use parametric or nonparametric methods for this.

If you really get the important relationships between different distributions ... well, then life is complicated. I do not want to understand all the things that you can try, because it depends on the details, and there are many results to cover. I would probably use a 2d (adaptive) kernel density estimate in combination with Bayesian networks.

The main thing is that I would work more with the data, better understand. Then think about a few possible algorithms (perhaps by asking here as soon as you have these details, I am happy to describe in detail everything that I wrote). And then think about libraries.

+1
source

Your best bet for this kind of thing would be the usual Monte Carlo method.

The main approach:

  • Use your dataset to create a schedule for the time between each event. You can either adjust the statistical distribution (shortened normal?), Or simply use the data set to create a large set of possible times that you can arbitrarily choose from.
  • Simulate new candidates by randomly selecting time between events using the above distribution
  • Now you can do any analysis, for example, run a simulation to determine that% of applications are processed within 3 days, or answer questions such as “how much faster will we need to take step 3 if we want applicants to be contacted in less than in 2 days in 95% of cases? "
+2
source

All Articles