Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum of dataset records != Sum of Confusion matrix #747

Open
dsimop opened this issue Aug 22, 2024 · 1 comment
Open

Sum of dataset records != Sum of Confusion matrix #747

dsimop opened this issue Aug 22, 2024 · 1 comment

Comments

@dsimop
Copy link

dsimop commented Aug 22, 2024

Context:

  • Binary classification dataset with 35 input features and 17,230 records.
  • Dataset distribution: 8,615 class0 and 8,615 class1 records.
  • Lines with no missing values: 9318
  • Most missing values appeared in a single input feature column (7,911 missing values out of 17,230 records)
  • The proposed model was a Stacked Ensemble one.

Although I cannot provide the raw dataset, I include the respective mljar code snippet:

X_train, X_test, y_train, y_test = train_test_split(
    balanced_df[balanced_df.columns[:-1]], balanced_df["output_column_name"], test_size=0.25)

automl = AutoML(mode="Compete",
                eval_metric="f1",
                total_time_limit=48*3600,
                ml_task="binary_classification")
automl.fit(X_train, `y_train)

Problem:

The sum of the confusion matrix values is 12,922, which is lower than the total number of records (17,230).
Is that normal/expected? Can it be clarified how mljar-supervised currently handles such a scenario?

@pplonski
Copy link
Contributor

12,922/17,230 = 75% are you sure that you checked the X_train shape?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants