FIX SimpleImputer uses dtype seen in fit for transform #22063

thomasjpfan · 2021-12-22T19:52:16Z

Reference Issues/PRs

Fixes #19572

What does this implement/fix? Explain your changes.

This PR adjusts SimpleImputer to remember the dtype it used in fit and uses the same dtype for transform.

CC @glemaitre

glemaitre

LGTM. Thanks for checking it closer. I misdiagnose the bug :)

jjerphan

Thank you, @thomasjpfan.

jjerphan · 2022-05-30T07:48:22Z

doc/whats_new/v1.1.rst

+- |Fix| :class:`impute.SimpleImputer` now uses the dtype seen in `fit` for
+  `transform`. :pr:`22063` by `Thomas Fan`_.
+


This should be move to doc/whats_new/v1.2

Actually in 1.1.2. I assume that we will do another bug fix release at some point.

This is likely too big of a change for 1.1.2. Currently fitting on float64 and transforming a float32 would return float32:

import numpy as np from sklearn.impute import SimpleImputer imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') X = np.asarray([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]], dtype=np.float64) imp_mean.fit(X) X_test = np.asarray([[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]], dtype=np.float32) X_trans = imp_mean.transform(X_test) print(X_trans.dtype) # float32

Looking at this again, I think it's better to error when fitting on an object dtype, but transforming on a non-object.

glemaitre

LGTM

thomasjpfan · 2022-05-31T00:20:19Z

sklearn/impute/_base.py

@@ -278,6 +278,10 @@ def _validate_input(self, X, in_fit):
        else:
            dtype = FLOAT_DTYPES

+        if not in_fit and self._fit_dtype.kind == "O":
+            # Use object dtype if fitted on object dtypes
+            dtype = self._fit_dtype


I updated this PR to only use the fit_dtype only if the dype during fit is object.

This is to preserve the current behavior, of "Fit on float64 -> transform on float32 returns float32"

import numpy as np from sklearn.impute import SimpleImputer imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') X = np.asarray([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]], dtype=np.float64) imp_mean.fit(X) X_test = np.asarray([[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]], dtype=np.float32) X_trans = imp_mean.transform(X_test) print(X_trans.dtype) # float32

This is a good point. Do you think that we should add a unit test regarding the bitness preservation since we try to have it in other cases?

I added the test here: e97a8df (#22063).

glemaitre · 2022-06-01T09:03:08Z

Thanks @thomasjpfan

…22063) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…22063) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

FIX SimpleImputer uses dtype seen in fit for transform

1eece58

github-actions bot added the module:impute label Dec 22, 2021

DOC Adds whats new PR number

2ce69b5

thomasjpfan mentioned this pull request Dec 22, 2021

SimpleImputer strategy "most_frequent" returning ValueError: could not convert string to float when imputing strings #19572

Closed

glemaitre approved these changes Dec 22, 2021

View reviewed changes

DOC Adds comment

ab29525

cmarmo added the Waiting for Reviewer label May 16, 2022

jjerphan approved these changes May 30, 2022

View reviewed changes

glemaitre approved these changes May 30, 2022

View reviewed changes

glemaitre removed the Waiting for Reviewer label May 30, 2022

glemaitre added this to the 1.1.2 milestone May 30, 2022

thomasjpfan added 3 commits May 30, 2022 16:36

Merge remote-tracking branch 'upstream/main' into imputer_dtype

f875601

DOC Move to whats_new 1.2

1dfae0a

DOC Update whats new

b68b1b9

thomasjpfan commented May 31, 2022

View reviewed changes

thomasjpfan and others added 4 commits May 31, 2022 10:33

Merge remote-tracking branch 'upstream/main' into imputer_dtype

eaa7fab

TST Improves test

7a9a5c3

TST Adds test based on dtype

e97a8df

Merge branch 'main' into imputer_dtype

776d3d2

glemaitre merged commit 8ea2997 into scikit-learn:main Jun 1, 2022
15 of 25 checks passed

ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022

FIX SimpleImputer uses dtype seen in fit for transform (scikit-learn#…

136fb51

…22063) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022

FIX SimpleImputer uses dtype seen in fit for transform (scikit-learn#…

cc569a5

…22063) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre added a commit that referenced this pull request Aug 5, 2022

FIX SimpleImputer uses dtype seen in fit for transform (#22063)

9d5cb36

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

eddiebergman mentioned this pull request Nov 15, 2022

Update scikit learn 1.2 automl/auto-sklearn#1611

Open

54 tasks

mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022

FIX SimpleImputer uses dtype seen in fit for transform (scikit-learn#…

58a70f4

…22063) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Mar	APR	May
	27
2022	2023	2024

FIX SimpleImputer uses dtype seen in fit for transform #22063

FIX SimpleImputer uses dtype seen in fit for transform #22063

thomasjpfan commented Dec 22, 2021 •

edited

glemaitre left a comment

jjerphan left a comment

jjerphan May 30, 2022

glemaitre May 30, 2022

thomasjpfan May 30, 2022

glemaitre left a comment

thomasjpfan May 31, 2022 •

edited

glemaitre May 31, 2022

thomasjpfan May 31, 2022

glemaitre commented Jun 1, 2022

		- \|Fix\| :class:`impute.SimpleImputer` now uses the dtype seen in `fit` for
		`transform`. :pr:`22063` by `Thomas Fan`_.

FIX SimpleImputer uses dtype seen in fit for transform #22063

FIX SimpleImputer uses dtype seen in fit for transform #22063

Conversation

thomasjpfan commented Dec 22, 2021 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

glemaitre left a comment

Choose a reason for hiding this comment

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan May 30, 2022

Choose a reason for hiding this comment

glemaitre May 30, 2022

Choose a reason for hiding this comment

thomasjpfan May 30, 2022

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

thomasjpfan May 31, 2022 • edited

Choose a reason for hiding this comment

glemaitre May 31, 2022

Choose a reason for hiding this comment

thomasjpfan May 31, 2022

Choose a reason for hiding this comment

glemaitre commented Jun 1, 2022

thomasjpfan commented Dec 22, 2021 •

edited

thomasjpfan May 31, 2022 •

edited