Implement Series.factorize() #1972
Merged
+241
−1
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1972 +/- ##
==========================================
+ Coverage 94.52% 94.58% +0.06%
==========================================
Files 50 50
Lines 10952 11041 +89
==========================================
+ Hits 10352 10443 +91
+ Misses 600 598 -2
Continue to review full report at Codecov.
|
d26b899
to
a86af14
databricks/koalas/series.py
Outdated
| raise ValueError( | ||
| "Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' " | ||
| "to restrict the total number of unique values of the current Series." | ||
| "Note that, before changing the 'compute.max_rows', " | ||
| "this operation is considerably expensive." | ||
| ) |
Comment on lines
1998
to
2003
xinrong-databricks
Jan 9, 2021
Author
Contributor
Got it! Modified to toPandas() without limits for now.
|
Otherwise, LGTM. |
databricks/koalas/series.py
Outdated
| if na_sentinel is not None: | ||
| # Drops the NaN from the uniques of the values | ||
| non_na_list = [x for x in uniques_list if not pd.isna(x)] | ||
| if len(non_na_list) == 0: | ||
| uniques = pd.Index(non_na_list) | ||
| else: | ||
| uniques = ks.Index(non_na_list) | ||
| else: | ||
| uniques = ks.Index(uniques_list) |
Comment on lines
2032
to
2040
ueshin
Jan 8, 2021
Collaborator
I think we can always return pd.Index as uniques ..? cc @HyukjinKwon
databricks/koalas/series.py
Outdated
| raise ValueError( | ||
| "Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' " | ||
| "to restrict the total number of unique values of the current Series." | ||
| "Note that, before changing the 'compute.max_rows', " | ||
| "this operation is considerably expensive." | ||
| ) |
|
LGTM, pending tests. |
|
@xinrong-databricks Could you try the following as well? >>> kser = ks.Series([1, 2, np.nan, 4, 5])
>>> kser.loc[3] = np.nan
>>> kser.factorize(na_sentinel=None)
(0 0
1 1
2 4
3 4
4 2
dtype: int32, Float64Index([1.0, 2.0, 5.0, nan, nan], dtype='float64'))
>>> kser.to_pandas().factorize(na_sentinel=None)
(array([0, 1, 3, 3, 2]), Float64Index([1.0, 2.0, 5.0, nan], dtype='float64')) |
|
@ueshin Good catch! Let me look into this. |
|
Thanks! merging. |
ce2d260
into
databricks:master
10 checks passed
10 checks passed
Conda (Python, Spark, pandas, PyArrow) (3.6, 2.4.7, 0.24.2, 0.14.1, databricks.koalas.usage_loggi...
Details
|
Thank you for reviewing and merging the PR! @ueshin :) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.


ref #1929