-
-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/expected categories #1597
base: main
Are you sure you want to change the base?
Feature/expected categories #1597
Conversation
Duly noted, I'll take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, it's useful.
My preference here would be to not adhere to sklearn, and set categories
to None
rather than auto
.
Code review fixes Co-authored-by: Max Halford <maxhalford25@gmail.com>
Code review fixes Co-authored-by: Max Halford <maxhalford25@gmail.com>
Deal. I shall modify. |
@ColdTeapot273K sorry for not replying in a while! I like the changes, we can merge. Before that though, could you add an entry to |
@MaxHalford no problem, i understand. Done, please check. |
Add support for processing only explicitly expected categories for
preprocessing.OneHotEncoder
,preprocessing.OrdinalEncoder
, akin tosklearn
api for respective encoders.All doctests pass (i've added some).
Rationale:
sklearn
has this neat feature where you can explicitly pass category values you want to see in the encoder state, other values are filtered out. Seecategories
parameter: OneHotEncoder, OrdinalEncoderThis is convenient when you work with high cardinality category spaces where some values are rare and you want to regularize your model. E.g. I've had a practical problem where constraining only to pre-selected top 20% frequent categories in 1 000 000 cardinality space can give you a 10%+ latency boost with no significant loss in metrics, and also make a model lighter on RAM.
This implementation is hackable so if user wants to modify lists of expected categories between training steps, they can do so by direct attribute access. E.g. can glue with modules like TargetAgg for some cool dynamic reevaluation of expected category lists.
P.S. Pls bump Ruff, my LSP config compains coz api changes. Also MyPy complained a lot about about
str | dict | defaultdict
type hints forcategory
parameter, I just had to give up on them, maybe someone has better ideas how to handle them.