-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improved autofix strategy #148
base: main
Are you sure you want to change the base?
Conversation
request my review when this is ready |
add a little script on how the user is going to use this thing as a PR comment |
cleanlab_studio/internal/util.py
Outdated
cleanset_df: pd.DataFrame, name_col: str, num_rows: int, asc=True | ||
) -> List[str]: | ||
""" | ||
Extracts the top specified number of rows based on a specified score column from a DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only return the IDs of datapoints to drop for a given setting of the num_rows to drop during autofix
Parameters: | ||
- cleanset_df (pd.DataFrame): The input DataFrame containing the cleanset. | ||
- name_col (str): The name of the column indicating the category for which the top rows should be extracted. | ||
- num_rows (int): The number of rows to be extracted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In autofix, we can simply multiply the fraction of issues that are the cleanset defaults by the number of datapoints to get this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right when we spoke originally, we wanted this call to be similar to the Studio web interface call, hence I rewrote it this way, it was floating percentage before.
the function _get_autofix_defaults
does the multiplication by number of datapoints
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
cleanlab_studio/internal/util.py
Outdated
} | ||
|
||
|
||
def _get_autofix_defaults(cleanset_df: pd.DataFrame) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Studio team should move this function to backend of the app so it happens on server (eventually should be used in web app too)
cleanlab_studio/internal/util.py
Outdated
return default_values | ||
|
||
|
||
def _get_top_fraction_ids( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Studio team should move this function to backend of the app so it happens on server (eventually should be used in web app too)
cleanlab_studio/studio/studio.py
Outdated
cleanset_df = self.download_cleanlab_columns(cleanset_id) | ||
if params is None: | ||
params = _get_autofix_defaults(cleanset_df) | ||
print("Using autofix parameters:", params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: replace this code once an analogous method exists in Studio backend
from anish: Would you want this to be: studio.autofix_dataset(cleanset_id) |
cleanlab_studio/internal/util.py
Outdated
"label_issue": 0.5, | ||
"near_duplicate": 0.2, | ||
"outlier": 0.5, | ||
"confidence_threshold": 0.95, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to: "relabel_confidence_threshold"
cleanlab_studio/internal/util.py
Outdated
@@ -63,3 +64,131 @@ def check_none(x: Any) -> bool: | |||
|
|||
def check_not_none(x: Any) -> bool: | |||
return not check_none(x) | |||
|
|||
|
|||
def _get_autofix_default_params() -> dict: # Studio team port to backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow string options (rethink names):
"optimized_training_data"
"drop_all_issues"
"suggested_actions"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop_all_issues strategy should just drop all issues from the DF (no re-label)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested_actions strategy should relabel all label issues, drop outliers, and extra copies of near duplicates. Nothing done to ambiguous examples or other issues
@@ -383,3 +386,32 @@ def poll_cleanset_status(self, cleanset_id: str, timeout: Optional[int] = None) | |||
|
|||
except (TimeoutError, CleansetError): | |||
return False | |||
|
|||
def autofix_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow string options to passed straight through into _get_autofix_default_params()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be added now, clarified in the docs:
params (dict, optional): Default parameter dictionary containing confidence threshold for auto-relabelling, and |
cleanlab_studio/studio/studio.py
Outdated
self, original_df: pd.DataFrame, cleanset_id: str, params: dict = None | ||
) -> pd.DataFrame: | ||
""" | ||
This method returns the auto-fixed dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring should clarify that Dataset must be a DataFrame (text or tabular dataset only)
cleanlab_studio/internal/util.py
Outdated
|
||
def _get_autofix_default_params() -> dict: # Studio team port to backend | ||
"""returns default percentage-wise params of autofix""" | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can choose more specific key names here
cleanlab_studio/studio/studio.py
Outdated
|
||
Example: | ||
{ | ||
'drop_ambiguous': 9, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to fractions
cleanlab_studio/internal/util.py
Outdated
def get_autofix_defaults_for_strategy(strategy): | ||
return AUTOFIX_DEFAULTS[strategy] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make everything that should be ported to backend a private method
cleanlab_studio/internal/util.py
Outdated
@@ -198,6 +221,130 @@ def check_not_none(x: Any) -> bool: | |||
return not check_none(x) | |||
|
|||
|
|||
# Studio team port to backend | |||
def get_autofix_defaults_for_strategy(strategy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_autofix_defaults_for_strategy(strategy): | |
def _get_autofix_defaults_for_strategy(strategy): |
cleanlab_studio/studio/studio.py
Outdated
dataset, cl_cols, id_col, label_col, keep_excluded | ||
) | ||
return corrected_ds | ||
return snowflake_corrected_ds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be from studio team?
in that case we need to merge their main branch first
@aditya1503 will have a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes need to rebase everything here against latest master branch
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Skeleton code for improved Auto-Fix strategies
Link to Notion: https://www.notion.so/cleanlab/Improve-ML-accuracy-with-Studio-via-better-Autofix-99434fa92a164131b3860093d85e5350?pvs=4
Note: this is only for text/tabular datasets, not image.