-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Threshold for pairs learners #168
[MRG+1] Threshold for pairs learners #168
Conversation
I just added two commits: In the first commit, I just added a minimal implementation of threshold, which already fixes #165. (Since it gives a In the second commit, I also modified LSML to be able to Note that I also realized that the score was not computing the accuracy because in fact outputs were either -1 or 1, not 0 or 1, so I changed it, including the newly introduced |
metric_learn/base_metric.py
Outdated
|
||
Returns | ||
------- | ||
score : float | ||
The quadruplets score. | ||
""" | ||
return -np.mean(self.predict(quadruplets)) | ||
quadruplets = check_input(quadruplets, y, type_of_inputs='tuples', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that here quadruplets will be checked twice (once here, once in predict). This is because when I do y = np.ones(quadruplets.shape[0])
, I want to be sure that I can do quadruplets.shape[0]
, and otherwise an error message would be returned before (by the check_input
method). I think it's fine isn't it ? I don't see any other solution to do so. Note that I also check at the same time y
because I like the fact that column_or_1d
check will be called on y (since it is not done in accuracy_score
).
I'll solve errors in tests, but one of them is that 'roc_auc' is not defined if there is only one class (fails for quadruplet I think because the true labels are Plus there cannot be another threshold than zero for quadruplets because say we put threshold=1, and a score of quadruplet is 0.7 (meaning the distance between the two first points is 0.7 + the distance between the two last points). Then the quadruplet would be classified as -1 (i.e. the two first points are considered "closer" than the two last points are). But if we reverse the order of the pairs in the quadruplet the score would be -0.7, so classification would be the same (-1), which would be inconsistent with what the classifier said just before (i.e. the two first points (which were the old two last points) are now considered "closer" than the two last points (which were the two first points)) I think this code is still useful to use |
I was about to make a test for being compatible with scikit-learn's |
metric_learn/base_metric.py
Outdated
@@ -296,6 +296,7 @@ def get_mahalanobis_matrix(self): | |||
|
|||
class _PairsClassifierMixin(BaseMetricLearner): | |||
|
|||
classes_ = [-1, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need estimators to have a classes_
attribute for CalibratedClassifierCV
to work. Typically, scikit-learn estimators deduce it at fit time, but in our case we will always have these ones (see also issue #164). That's why I put it as a class attribute, to be more explicit.
Classes have to be in this order ([-1, 1]
)(i.e. sorted), otherwise there is a bug: because here https://github.com/scikit-learn/scikit-learn/blob/1a850eb5b601f3bf0f88a43090f83c51b3d8c593/sklearn/calibration.py#L312-L313:
self.label_encoder_.classes_
is a sorted list of classes (so it'll be [-1, 1]
in every cases). So when it transforms self.base_estimator.classes_
it will return [0, 1]
if our estimator has [-1, 1]
as classes, and [1, 0]
if it has [1, -1]
. In the latter case, it will return IndexError
because here (https://github.com/scikit-learn/scikit-learn/blob/1a850eb5b601f3bf0f88a43090f83c51b3d8c593/sklearn/calibration.py#L357) it will try to reach the column n°1 (and not 0) of Y
(which does not exist).
Also, we have to put -1 and 1 in this order so that for label_binarizer
(which is called by CalibratedClassifierCV
), -1 (first element of [-1, 1]
) will be considered as the negative label, and 1 will be considered as the positive label. (That's how label_binarizer
work, see the example below)
Example: (warning: pos_label and neg_label are not to say what input classes are pos or neg, as one could think, but rather how we want them to appear in the output)
In [1]: from sklearn.preprocessing import label_binarize
...: print(label_binarize([-1, 1, -1], [-1, 1], pos_label=2019, neg_label=2018))
[[2018]
[2019]
[2018]]
And:
In [1]: from sklearn.preprocessing import label_binarize
...: print(label_binarize([-1, 1, -1], [1, -1], pos_label=2019, neg_label=2018))
[[2019]
[2018]
[2019]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the classes_
attribute in this PR? Otherwise it could be saved for #173
metric_learn/base_metric.py
Outdated
estimator=self, tuple_size=self._tuple_size) | ||
# checked_input will be of the form `(checked_quadruplets, checked_y)` if | ||
# `y` is not None, or just `checked_quadruplets` if `y` is None | ||
quadruplets = checked_input if y is None else checked_input[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a bit ugly, but I couldn't find another way to do it. Maybe refactor check_input
with two functions like check_X_y
and check_array
? Or we can just keep it as it is for now
test/test_quadruplets_classifiers.py
Outdated
@@ -25,35 +25,11 @@ def test_predict_only_one_or_minus_one(estimator, build_dataset, | |||
assert np.isin(predictions, [-1, 1]).all() | |||
|
|||
|
|||
@pytest.mark.parametrize('with_preprocessor', [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this test because it makes no sense for quadruplets since the threshold should be always 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments:
1/ I think we have a problem with decision_function
: this is interpreted as the larger, the more likely it corresponds to the positive class
. However as it is a distance it is the other way around for us, which I think will make the computation of AUC wrong. I think we should return minus the distance instead. To return the actual distances, we already have score_pairs.
2/ For quadruplet metric learner: I think the only score that makes sense is the accuracy, because each quadruplet contains both a positive and a negative pair. So while the trick of adding an argument y
works to call the accuracy score function of sklearn, an easy way to avoid that users try to use other scoring functions would be not to add the y
and simply compute the accuracy ourselves. This way it will raise an error when trying to use other sklearn scoring functions. What do you think? I am not sure how we could raise a specific error to explain that only accuracy makes sense for quadruplets. In any case, we could deal with the quadruplet thing in a separate PR (see also point 3/ below) and keep this PR only about pairwise metric learners.
3/ For quadruplet metric learners but also the metric transformers, in terms of use case it would still be useful to allow people to easily get a pair classifier out of the learned distance. For this we could for instance introduce a new " instance of pair metric learner which would simply take a predefined metric as input instead of fitting on data (users can then easily set/calibrate the threshold using existing methods). There may be a better way to do this. We should discuss this separately (will create an issue).
@@ -296,6 +296,7 @@ def get_mahalanobis_matrix(self): | |||
|
|||
class _PairsClassifierMixin(BaseMetricLearner): | |||
|
|||
classes_ = [-1, 1] | |||
_tuple_size = 2 # number of points in a tuple, 2 for pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we document the attribute threshold_
here in the general class rather than in the downstream classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I should document it for both abstract and downstream classes, but I don't think I can document it only here: indeed, we can inherit docstrings, but I don't think we can inherit docstrings paragraph-wise (e.g. _PairsClassifierMixin
has an attribute threshold_
that we want to transmit to MMC
. But the problem is that MMC
has also a title, and also other attributes like transformer_
). I didn't find an option to do this in sphinx and this impossibility make some sense because if we write two docstrings paragraphs (below the title), would the result in the downstream class be the concatenation of both descriptions ? It could give weird results (e.g. : "This is the abstract class", "This is is the MMC class", would give: "This is the abstract class. This is the MMC class") However there seem to exist some packages that could allow to do something like this. We could use them but maybe later on in the development if things get too replicated. https://github.com/Chilipp/docrep https://github.com/meowklaski/custom_inherit
Then there's also the solution not to care about inheriting the docstrings, but I think it's better for a user to know what are all the attributes in his object without needing to click on the doc of all mother classes (that's what is done in scikit-learn for instance I think)
Finally there's also another option which is to declare attributes as properties so that we can have a more granular inheritance, but I think this adds to much complexity just for docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(done: I just added them to the base PairsClassifier class too)
Yes, absolutely, I just changed it |
I agree, however I think functions that use Also, all scikit-learn estimators have a If we wanted to still implement a way to raise an error if the user used scikit-learn's scoring function, we could maybe do something with the But to sum up this would mean doing something hacky for in the end allowing functionnalities that are not mainstream (using the probabilistic losses, and calling the scoring with |
I agree, let's deal with this in the new issue |
Yes I agree, I was hesitating between both, but I guess a negative threshold is weird indeed |
@bellet Thanks for your review, I addressed all your comments Before merging, there just remains the following:
|
Yes this is a good idea. I would maybe name this parameter |
metric_learn/itml.py
Outdated
|
||
Returns | ||
------- | ||
self : object | ||
Returns the instance. | ||
""" | ||
return self._fit(pairs, y, bounds=bounds) | ||
self._fit(pairs, y, bounds=bounds) | ||
self.calibrate_threshold(pairs, y, **(threshold_params if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all algorithms we should test that threshold_params
has allowed values at the beginning of fit
otherwise we will fit for nothing (which may take some time) if bad params have been given...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point,
done
metric_learn/base_metric.py
Outdated
return self | ||
|
||
def calibrate_threshold(self, pairs_valid, y_valid, strategy='accuracy', | ||
threshold=None, beta=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or desired_rate
cum_accuracy = (cum_tp + cum_tn) / n_samples | ||
imax = np.argmax(cum_accuracy) | ||
# note: we want a positive threshold (distance), so we take - threshold | ||
self.threshold_ = - scores_sorted[imax] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the threshold be positive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, got confused again because I did not read the comment ;-) (this confirms that the comment is useful)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the comment can be a bit more explicit: "we are working with negative distances but we want the threshold to be with respect to the actual distances so we take minus sign" or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, done
You can ignore this, you replied in the proper thread, not sure why it appeared out of context |
Sorry for the messy review (several tabs opened). About your questions:
|
…t accepted value and highest rejected value), and max + 1 or min - 1 for extreme points
# Conflicts: # test/test_mahalanobis_mixin.py # test/test_utils.py
cum_tn_inverted = stable_cumsum(y_ordered[::-1] == -1) | ||
cum_tn = np.concatenate([[0], cum_tn_inverted[:-1]])[::-1] | ||
cum_tn = np.concatenate([[0.], cum_tn_inverted])[::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to include the last cum_tn_inverted
now, since it's the one where all samples will be rejected
y_valid, self.decision_function(pairs_valid), pos_label=1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to take part of the code (but not all) from precision_recall_curve
here because there was a section of the code in precision_recall_curve
that deleted points when the maximum recall is attained. This was OK when we set the threshold to the the last value we accept, but it's not OK with the new computation of the threshold because we need all the points to put the threshold in the middle of two points. (Otherwise the algo can think some point is the last point of the list and so be set to this value - 1, whereas it should be the mean between this value and the next one)
I can detail more this if needed
metric_learn/base_metric.py
Outdated
return self | ||
|
||
fpr, tpr, thresholds = roc_curve(y_valid, | ||
self.decision_function(pairs_valid), | ||
pos_label=1) | ||
pos_label=1, drop_intermediate=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need drop_intermediate=False
here for a similar problem than above: drop_intermediate=True
worked fine when we were setting the threshold to an exact value of a point, but if we remove some points then the mean between one point and the next can be wrong (because the "next" point is not the actual "next" point: the actual "next" would be removed)
drop_intermediate=False
ensures that no point is dropped
I just added the new computation of the threshold as we said, so I'll just refactor it a bit (right now it works, but it's a bit messy), and then I should document how we set the threshold with the new computation, then I should add the check before fitting for the |
If it remains quite messy, we have to ask ourselves whether this mean strategy is worth the extra complexity in the code |
I agree, also, I think some of the complexity comes from the fact that we use scikit-learn functions and deal with the special cases where they remove points in curves etc... Maybe re-implementing the code would be simpler, more robust to sklearn's changes, have less comments to explain how we deal with special cases, probably some parts could also be mutualized like the cumsums so it wouldn't be a code as big as just recoding each part... I'll also think about that to see what it would look like this way |
If we call sklearn routines in a straightforward way (avoiding edges cases), I think there is no reason to expect breaking changes. So this is an option (i.e., revert back to picking a simple returned threshold instead of averaging). Re-implementing everything on ROC/PR curves may seem like an overkill? |
Yes I agree, I'll put the recent changes in a branch of my fork in case we need these changes one day, but I agree for now we can stick to the first version |
I'll just need to still cherry pick in the changes the way we can put a threshold to the worst score -1, it's the only solution if we want to reject everything (which is a case that is now also tested) |
I came back to the previous implementation for the threshold, put a calibration params validation before fit, and addressed all the comments. Just a last questions before final review and merge: I noted in the todo that we could raise a warning when calibrating the threshold on the trainset at fit time. Is it really necessary in the end ? We already add a little comment about that in the User Guide. Isn't it enough ? |
LGTM. Warning is not necessary (and would be quite annoying). |
This PR fits a threshold for tuples learners to allow a predict (and scoring) on pairs
Fixes #131
Fixes #165
Finally, we decided that it would be good to have at least a minimal implementation of threshold calibration inside metric-learn, so that
_PairsClassifiers
can have a threshold hence a predict directly out of the box, without the need for a MetaEstimator. A MetaEstimator could however be used outside the algorithm for more precise threshold calibration (with cross-validation).The following features should be implemented:
We should have two methods for
_PairsClassifiers
:set_threshold()
andWe went for the same syntax that scikit-learn's PR.calibrate_threshold(validation_set, method='max_tpr', args={'min_tnr': 0.1})
.set_threshold
will set the threshold to a hard value, andcalibrate_threshold
will take a validation set and a method ('accuracy', 'f1'...) and will find the threshold which optimizes the metric inmethod
on the validation set.At fit time, we should either have a simple rule to set a threhold (for instance median of distances, or mean between positive pairs distances mean and negative pairs distances mean), or we should return
calibrate_threshold(trainset)
, and in this case also raise a warning at the end that says that the threshold has been fitted on the trainset, so we should check scikit-learn's utilities for calibration to have a calibration less prone to overfitting. Also in this case we could allow to put arguments infit
like `fit(pairs, y, threshold_method='max_tpr', threshold_args={'min_tnr': 0.1})The following scores should be implemented:
See scikit-learn's calibration PR ([MRG] Add decision threshold calibration wrapper scikit-learn/scikit-learn#10117) for more details, and the documentation of it here https://35753-843222-gh.circle-artifacts.com/0/doc/modules/calibration.html
For some estimators for which a natural threshold exist (like
ITML
: the mean between the lower threshold and the higher threshold), we should put this thresholdDecide what to do by default, rule of thumb scoring or calibration on trainset ?
Questions:
TODO:
CalibratedClassifierCV
Think about which scoring make sense with quadruplets and what impact is has on the codeUse/AdaptCalibratedClassifierCV
or an equivalent for quadrupletsMaybe test CalibratedClassifierCV that it returns a coherent value (like all that have predict_proba = 0.8 have indeed 80% success)CalibratedClassifier's behaviour should be tested in scikit-learn: in metric learn we should just test the API (but on a small draft example I tested it for ITML and it worked more or less)