Sum of dataset records != Sum of Confusion matrix #747

dsimop · 2024-08-22T18:33:26Z

Context:

Binary classification dataset with 35 input features and 17,230 records.
Dataset distribution: 8,615 class0 and 8,615 class1 records.
Lines with no missing values: 9318
Most missing values appeared in a single input feature column (7,911 missing values out of 17,230 records)
The proposed model was a Stacked Ensemble one.

Although I cannot provide the raw dataset, I include the respective mljar code snippet:

X_train, X_test, y_train, y_test = train_test_split(
    balanced_df[balanced_df.columns[:-1]], balanced_df["output_column_name"], test_size=0.25)

automl = AutoML(mode="Compete",
                eval_metric="f1",
                total_time_limit=48*3600,
                ml_task="binary_classification")
automl.fit(X_train, `y_train)

Problem:

The sum of the confusion matrix values is 12,922, which is lower than the total number of records (17,230).
Is that normal/expected? Can it be clarified how mljar-supervised currently handles such a scenario?

The text was updated successfully, but these errors were encountered:

pplonski · 2024-08-23T08:03:17Z

12,922/17,230 = 75% are you sure that you checked the X_train shape?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sum of dataset records != Sum of Confusion matrix #747

Sum of dataset records != Sum of Confusion matrix #747

dsimop commented Aug 22, 2024

pplonski commented Aug 23, 2024

Sum of dataset records != Sum of Confusion matrix #747

Sum of dataset records != Sum of Confusion matrix #747

Comments

dsimop commented Aug 22, 2024

pplonski commented Aug 23, 2024