What is the right way to prepare a dataset for machine learning?

First of all, thanks for reading this post.

I know when it comes to machine learning, and I'm trying to use ML to classify some data. Now I have done some basic reading on controlled and unsupervised learning algorithms such as decision trees, clustering, neural networks..etc.

What I'm trying to understand is the correct general procedure for preparing datasets for an ML problem.

How to prepare a data set for ML so that I can measure the accuracy of algorithms?

My real understanding is that in order to evaluate the accuracy, the algorithm must be provided with pre-labeled results (from a significant subset of the data set?) In order to assess the difference between the expected result and the solution of the algorithm?

If this is correct, then how to pre-mark large datasets? My dataset is quite large and manual marking is not possible.

In addition, any tips on getting started with machine learning in Python would be greatly appreciated!

Thank you so much for your help in advance!

Yours faithfully,

Mike

+7
python statistics machine-learning data-analysis
source share
1 answer

This is the most important part of any machine learning algorithm. You need to build your data set, extract, make, scale, normalize functions.

If you want to use some controlled learning algorithm, you need tagged data. There are several ways to achieve this:

  • Lebel is by hand.
  • Use some unsupervised learning algorithm to mark data.

You need to use some python machine learning toolkit, for example scikit-learn. scikit-learn contains many useful tools for data processing, function extraction, and preprocessing. For example, it can vectorize your data with DictVictorizer. You can add missing values, scale and normalize functions using only scikit-learn.

I recommend starting with examples here - http://scikit-learn.org/stable/

+4
source share

All Articles