Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New filter for orphan afterglows #403

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added fink_science/data/models/model_orphans.pkl
Binary file not shown.
1 change: 1 addition & 0 deletions fink_science/orphans/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Orphan GRB afterglow classifier
120 changes: 120 additions & 0 deletions fink_science/orphans/filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
from pyspark.sql.functions import pandas_udf, PandasUDFType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All files should start with the header containing the license statement & authorship (see other modules for example)

from pyspark.sql.types import BooleanType

import os
import pandas as pd
import numpy as np
import joblib
from sklearn import preprocessing

from filter_utils import compute_duration_between_first_and_peak, compute_rate, compute_color, fit_light_curve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid name conflicts, the import should always be absolute:

from fink_science.orphans.filter_utils import ...


from fink_filters.tester import spark_unit_tests


@pandas_udf(BooleanType(), PandasUDFType.SCALAR)
def orphan_grb(ctimemjd, cabmags, cabmagserr, cfilts):
""" Filter to extract orphan GRB candidates.

The filter is based on different features calculated from the light curve:
- Duration between the first detection and the peak
- Increase and decrease rates
- Colour between different filter pairs
- 4 parameters and chi2 from fitting the light curve

A Boosting Decision Tree classifier was trained to discriminate orphans using these features.

Parameters
----------
ctimemjd: pandas.Series of list of float
Concatenated time in MJD for the object
cabmags: pandas.Series of list of float, or Nan
Concatenated magnitude for the object
cabmagserr: pandas.Series of list of float, or Nan
Concatenated errors on the magnitude for the object
cfilts: pandas.Series of list of int
Concatenated filters for the object

Returns
----------
out: pandas.Series of bool
Return a Pandas DataFrame with the appropriate flag:
false for bad alert, and true for good alert.
"""


# keep only non-NaN values and sort points by time
times, mags, err, filts = clean_and_sort_light_curve(ctimemjd, cabmags, cabmagserr, cfilts)


# duration between the first detection and the peak
duration = np.array(
[
compute_duration_between_first_and_peak(t, m) for t, m in zip(times.values, mags.values)
]
)

# increase rate and decrease rates (average, in the 1/3 and the 3/3 parts of the light curve)
rates = np.array(
[
compute_rate(t, m, f) for t, m, f in zip(times.values, mags.values, filts.values)
]
)

# colours for the pairs (u, g), (g, r), (r, i), (i, z), (z, Y)
colour = np.array(
[
compute_color(t, m, f) for t, m, f in zip(times.values, mags.values, filts.values)
]
)

# parameters of the fit A, B, C, D and chi2
fit_parameters = np.array(
[
fit_light_curve(t, m, e, f) for t, m, e, f in zip(times.values, mags.values, err.values, filts.values)
]
)

# gather all the features in one DataFrame and normalise them
features = {
'duration': duration,
'increase_rate': rates[:, 0],
'decrease_rate': rates[:, 1],
'decrease_rate_1': rates[:, 2],
'decrease_rate_3': rates[:, 3],
'colour': colour,
'A': fit_parameters[:, 0],
'B': fit_parameters[:, 1],
'C': fit_parameters[:, 2],
'D': fit_parameters[:, 3],
'A/B': fit_parameters[:, 0] / fit_parameters[:, 1],
'chi2': fit_parameters[:, 4]
}

df_features = pd.DataFrame(data=features)

# clean non-valid data
df_features.replace([np.inf, -np.inf], 1000, inplace=True)
df_features.fillna(0, inplace=True)

features_norm = preprocessing.normalize(df_features, norm='max')

# use Boosting Decision Tree classifier
curdir = os.path.dirname(os.path.abspath(__file__))
model_path = curdir + '/data/models/'
clf = joblib.load(model_path + 'model_orphans.pkl')
proba = clf.predict_proba(features_norm)

# `True` for the objects that have a probability > 0.999999 to be an orphan, else `False`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more flexibility, we usually decouple the computation of a quantity (fink science module), and the relevant threshold(s) to be applied on this quantity (fink filter). Hence I would suggest to return the probability values here, and then design a filter in fink-filters.

threshold = 0.999999
res = (proba[:, 1] >= threshold)

return pd.Series(res, dtype=bool)


if __name__ == "__main__":
""" Execute the test suite """

# Run the test suite
globs = globals()
spark_unit_tests(globs)
Loading