X Tutup
The Wayback Machine - https://web.archive.org/web/20210121124105/https://github.com/databricks/koalas/pull/1972
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Series.factorize() #1972

Merged
merged 19 commits into from Jan 13, 2021
Merged

Conversation

@xinrong-databricks
Copy link
Contributor

@xinrong-databricks xinrong-databricks commented Dec 16, 2020

ref #1929

        >>> kser = ks.Series(['b', None, 'a', 'c', 'b'])
        >>> codes, uniques = kser.factorize()
        >>> codes
        0    1
        1   -1
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c'], dtype='object')

        >>> codes, uniques = kser.factorize(na_sentinel=None)
        >>> codes
        0    1
        1    3
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c', None], dtype='object')

        >>> codes, uniques = kser.factorize(na_sentinel=-2)
        >>> codes
        0    1
        1   -2
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c'], dtype='object')
@codecov-io
Copy link

@codecov-io codecov-io commented Dec 16, 2020

Codecov Report

Merging #1972 (2e538a7) into master (0e44bc7) will increase coverage by 0.06%.
The diff coverage is 95.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1972      +/-   ##
==========================================
+ Coverage   94.52%   94.58%   +0.06%     
==========================================
  Files          50       50              
  Lines       10952    11041      +89     
==========================================
+ Hits        10352    10443      +91     
+ Misses        600      598       -2     
Impacted Files Coverage Δ
databricks/koalas/missing/series.py 100.00% <ø> (ø)
databricks/koalas/series.py 96.75% <95.00%> (+<0.01%) ⬆️
databricks/koalas/plot/matplotlib.py 92.62% <0.00%> (-0.73%) ⬇️
databricks/koalas/generic.py 92.60% <0.00%> (-0.01%) ⬇️
databricks/koalas/missing/frame.py 100.00% <0.00%> (ø)
databricks/koalas/tests/plot/test_frame_plot.py 100.00% <0.00%> (ø)
...s/koalas/tests/plot/test_series_plot_matplotlib.py 100.00% <0.00%> (ø)
databricks/koalas/groupby.py 91.60% <0.00%> (+0.02%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e44bc7...2e538a7. Read the comment docs.

@xinrong-databricks xinrong-databricks requested a review from ueshin Dec 21, 2020
@xinrong-databricks xinrong-databricks marked this pull request as ready for review Dec 21, 2020
databricks/koalas/series.py Outdated Show resolved Hide resolved
@xinrong-databricks xinrong-databricks force-pushed the xinrong-databricks:seriesFactorize branch from d26b899 to a86af14 Jan 4, 2021
@xinrong-databricks xinrong-databricks requested a review from ueshin Jan 4, 2021
databricks/koalas/series.py Outdated Show resolved Hide resolved
raise ValueError(
"Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' "
"to restrict the total number of unique values of the current Series."
"Note that, before changing the 'compute.max_rows', "
"this operation is considerably expensive."
)
Comment on lines 1998 to 2003

This comment has been minimized.

@ueshin

ueshin Jan 7, 2021
Collaborator

In this case, we should just collect all the data? cc @HyukjinKwon

This comment has been minimized.

@xinrong-databricks

xinrong-databricks Jan 8, 2021
Author Contributor

What does it mean by collect all the data?

This comment has been minimized.

@ueshin

ueshin Jan 8, 2021
Collaborator

Just do toPandas() without limits.

This comment has been minimized.

@xinrong-databricks

xinrong-databricks Jan 9, 2021
Author Contributor

Got it! Modified to toPandas() without limits for now.

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/tests/test_series.py Show resolved Hide resolved
Copy link
Collaborator

@ueshin ueshin left a comment

Otherwise, LGTM.

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
if na_sentinel is not None:
# Drops the NaN from the uniques of the values
non_na_list = [x for x in uniques_list if not pd.isna(x)]
if len(non_na_list) == 0:
uniques = pd.Index(non_na_list)
else:
uniques = ks.Index(non_na_list)
else:
uniques = ks.Index(uniques_list)
Comment on lines 2032 to 2040

This comment has been minimized.

@ueshin

ueshin Jan 8, 2021
Collaborator

I think we can always return pd.Index as uniques ..? cc @HyukjinKwon

This comment has been minimized.

@xinrong-databricks

xinrong-databricks Jan 8, 2021
Author Contributor

Modified it to pd.Index for now.

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
raise ValueError(
"Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' "
"to restrict the total number of unique values of the current Series."
"Note that, before changing the 'compute.max_rows', "
"this operation is considerably expensive."
)

This comment has been minimized.

@ueshin

ueshin Jan 8, 2021
Collaborator

Just do toPandas() without limits.

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
@ueshin
ueshin approved these changes Jan 11, 2021
Copy link
Collaborator

@ueshin ueshin left a comment

LGTM, pending tests.

@ueshin
Copy link
Collaborator

@ueshin ueshin commented Jan 11, 2021

@xinrong-databricks Could you try the following as well?

>>> kser = ks.Series([1, 2, np.nan, 4, 5])
>>> kser.loc[3] = np.nan
>>> kser.factorize(na_sentinel=None)
(0    0
1    1
2    4
3    4
4    2
dtype: int32, Float64Index([1.0, 2.0, 5.0, nan, nan], dtype='float64'))

>>> kser.to_pandas().factorize(na_sentinel=None)
(array([0, 1, 3, 3, 2]), Float64Index([1.0, 2.0, 5.0, nan], dtype='float64'))
@xinrong-databricks
Copy link
Contributor Author

@xinrong-databricks xinrong-databricks commented Jan 11, 2021

@ueshin Good catch! Let me look into this.

@ueshin
Copy link
Collaborator

@ueshin ueshin commented Jan 13, 2021

Thanks! merging.

@ueshin ueshin merged commit ce2d260 into databricks:master Jan 13, 2021
10 checks passed
10 checks passed
PIP (Python, Spark, pandas, PyArrow) (3.5, 2.3.4, 0.23.4, 0.16.0)
Details
PIP (Python, Spark, pandas, PyArrow) (3.5, 2.3.4, 0.24.2, 0.10.0, distributed-sequence)
Details
Conda (Python, Spark, pandas, PyArrow) (3.6, 2.4.7, 0.24.2, 0.14.1, databricks.koalas.usage_loggi...
Details
Conda (Python, Spark, pandas, PyArrow) (3.6, 2.4.7, 0.25.3, 0.15.1, distributed-sequence)
Details
Conda (Python, Spark, pandas, PyArrow) (3.7, 2.4.7, 0.25.3, 0.14.1)
Details
Conda (Python, Spark, pandas, PyArrow) (3.7, 2.4.7, 1.0.5, 0.15.1, distributed-sequence)
Details
Conda (Python, Spark, pandas, PyArrow) (3.7, 3.0.1, 0.25.3, 1.0.1)
Details
Conda (Python, Spark, pandas, PyArrow) (3.8, 3.0.1, 1.1.5, 2.0.0, distributed-sequence)
Details
codecov/patch 95.00% of diff hit (target 94.52%)
Details
codecov/project 94.58% (+0.06%) compared to 0e44bc7
Details
@xinrong-databricks
Copy link
Contributor Author

@xinrong-databricks xinrong-databricks commented Jan 13, 2021

Thank you for reviewing and merging the PR! @ueshin :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
X Tutup