Weka: ReplaceMissingValues ​​for test file

I'm a little worried about using Weka ReplaceMissingValues to enter missing values ​​only for the test arff dataset , but not for the training set . The following is the command line:

 java -classpath weka.jar weka.filters.unsupervised.attribute.ReplaceMissingValues -c last -i "test_file_with_missing_values.arff" -o "test_file_with_filled_missing_values.arff" 

From a previous post ( Replace Missing Values ​​with a Medium (Weka) ), I learned that Weka ReplaceMissingValues simply replaces each missing value with a median value for the corresponding attribute. This means that the average value must be calculated for each attribute. Although calculating this value is great for a training file, it is not normal for a test file.

This is due to the fact that in a typical test case we should not assume that we know the average value of the test attribute for input missing values. We have only one test record with several attributes for classification instead of having the entire set of test records in the test file. Therefore, instead, we will enter the missing value based on the average calculated using the training data. Then the command above will be wrong, since we need a different input (means of train attributes).

Has anyone thought of this before? How do you get around this with weka?

+4
source share
1 answer

Easy, see Batch Filtering

 Instances train = ... // from somewhere Instances test = ... // from somewhere Standardize filter = new Standardize(); filter.setInputFormat(train); // initializing the filter once with training set Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances Instances newTest = Filter.useFilter(test, filter); // create new test set 

The filter is initialized using training data, and then applied to both training and test data.

The problem is that you apply the ReplaceMissingValue filter outside of any processing pipeline, because after writing filtered data, you can no longer distinguish between "real" values ​​and "imputed" values. This is why you should do everything you need to do in one pipeline, for example, using FilteredClassifier:

 java -classpath weka.jar weka.classifiers.meta.FilteredClassifier -t "training_file_with_missing_values.arff" -T "test_file_with_missing_values.arff" -F weka.filters.unsupervised.attribute.ReplaceMissingValues -W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a 

In this example, the ReplaceMissingValues ​​filter will be initialized using the training _file_with_missing_values.arff "dataset, then apply the filter to the _file_with_missing_values.arff test " (with training tools on the training set), then train the multilayer perceptron on the filtered training data and predict the class from data.

+2
source

All Articles