Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Threshold for pairs learners #168

Merged
merged 43 commits into from
Apr 15, 2019

Conversation

wdevazelhes
Copy link
Member

@wdevazelhes wdevazelhes commented Feb 4, 2019

This PR fits a threshold for tuples learners to allow a predict (and scoring) on pairs

Fixes #131
Fixes #165

Finally, we decided that it would be good to have at least a minimal implementation of threshold calibration inside metric-learn, so that _PairsClassifiers can have a threshold hence a predict directly out of the box, without the need for a MetaEstimator. A MetaEstimator could however be used outside the algorithm for more precise threshold calibration (with cross-validation).

The following features should be implemented:

  • We should have two methods for _PairsClassifiers: set_threshold() and calibrate_threshold(validation_set, method='max_tpr', args={'min_tnr': 0.1}). set_threshold will set the threshold to a hard value, and calibrate_threshold will take a validation set and a method ('accuracy', 'f1'...) and will find the threshold which optimizes the metric in method on the validation set. We went for the same syntax that scikit-learn's PR.

  • At fit time, we should either have a simple rule to set a threhold (for instance median of distances, or mean between positive pairs distances mean and negative pairs distances mean), or we should return calibrate_threshold(trainset), and in this case also raise a warning at the end that says that the threshold has been fitted on the trainset, so we should check scikit-learn's utilities for calibration to have a calibration less prone to overfitting. Also in this case we could allow to put arguments in fit like `fit(pairs, y, threshold_method='max_tpr', threshold_args={'min_tnr': 0.1})

  • The following scores should be implemented:

  • For some estimators for which a natural threshold exist (like ITML: the mean between the lower threshold and the higher threshold), we should put this threshold

  • Decide what to do by default, rule of thumb scoring or calibration on trainset ?

Questions:

  • Should we do the same thing for QuadrupletsLearners ? (the natural threshold is 0, so I don't think it would make as much sense here, and users would maybe rather use the future meta estimator from scikit-learn's PR [MRG] Add decision threshold calibration wrapper scikit-learn/scikit-learn#10117), but since we will already have it implemented for pairs, and maybe for coherence, we could have it also for Quadruplets Learners

TODO:

  • Implement tests that check that we can use custom scores in cross val (cf Scoring functions #165). This needs to have a predict hence is related to this PR.
  • Implement API tests (that the behaviour is as expected)
  • Implement numerical tests (on examples where we know the f_1 score etc
  • Implement the actual method (cf. features above)
  • Add this in the doc: also talk in the doc about the CalibratedClassifierCV
  • Think about which scoring make sense with quadruplets and what impact is has on the code
  • Use/Adapt CalibratedClassifierCV or an equivalent for quadruplets
  • Maybe test CalibratedClassifierCV that it returns a coherent value (like all that have predict_proba = 0.8 have indeed 80% success) CalibratedClassifier's behaviour should be tested in scikit-learn: in metric learn we should just test the API (but on a small draft example I tested it for ITML and it worked more or less)
  • Add a case (and a test case) where the best accuracy is when we need to reject all points (so threshold = best score + 1). See if this applies too to other strategies

@wdevazelhes
Copy link
Member Author

I just added two commits:

In the first commit, I just added a minimal implementation of threshold, which already fixes #165. (Since it gives a predict function to all pairs metric learners)

In the second commit, I also modified LSML to be able to cross_val_score. To do that I add an optional argument y. This is to allow using things like cross_val_score(lsml, quadruplets, y, scoring='f1') which uses scikit learn scoring function hence need an explicit y. I know that this adds an extra argument that will almost never be used (y), but since we deal well with the default case, and document it, I think it's worth it since it allows to use scikit-learn functions out of the box. (The alternative solution would be including cross_val_score in metric-learn's codebase specially for LSML, to fill y with one before doing the actual scoring)
(Note that there is no need to refactor other parts of the code because LSML was always used with a custom dataset that put the second argument at np.ones(n_samples) (see build_quadruplets in test_utils.py). Before it therefore put weights=np.ones(n_samples), but now it puts y=np.ones(n_samples), which is the same (and desired) behaviour)

Note that I also realized that the score was not computing the accuracy because in fact outputs were either -1 or 1, not 0 or 1, so I changed it, including the newly introduced y. Note that we use accuracy instead of roc_auc like it's the case for _PairsClassifier. We could use roc_auc as a default score, what do you think ? I don't know what would be the right default score.


Returns
-------
score : float
The quadruplets score.
"""
return -np.mean(self.predict(quadruplets))
quadruplets = check_input(quadruplets, y, type_of_inputs='tuples',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that here quadruplets will be checked twice (once here, once in predict). This is because when I do y = np.ones(quadruplets.shape[0]), I want to be sure that I can do quadruplets.shape[0], and otherwise an error message would be returned before (by the check_input method). I think it's fine isn't it ? I don't see any other solution to do so. Note that I also check at the same time y because I like the fact that column_or_1d check will be called on y (since it is not done in accuracy_score).

@wdevazelhes
Copy link
Member Author

wdevazelhes commented Feb 6, 2019

I'll solve errors in tests, but one of them is that 'roc_auc' is not defined if there is only one class (fails for quadruplet I think because the true labels are np.ones(n_samples)). This makes sense, 'roc_auc' should indeed not be defined for quadruplets, since a task to classify a quadruplet can be arbitrarily defined with a label +1 and some ordering of pairs in quadruplets, or a label of -1 and the opposite ordering. Therefore there is no notion of "positive" or "negative" like in usual classification I guess.

Plus there cannot be another threshold than zero for quadruplets because say we put threshold=1, and a score of quadruplet is 0.7 (meaning the distance between the two first points is 0.7 + the distance between the two last points). Then the quadruplet would be classified as -1 (i.e. the two first points are considered "closer" than the two last points are). But if we reverse the order of the pairs in the quadruplet the score would be -0.7, so classification would be the same (-1), which would be inconsistent with what the classifier said just before (i.e. the two first points (which were the old two last points) are now considered "closer" than the two last points (which were the two first points))

I think this code is still useful to use cross_val_score(..., 'accuracy'), but I'll double check every score since I guess some other useful scoring for classification will be problematic too for quadruplets.

@wdevazelhes
Copy link
Member Author

I was about to make a test for being compatible with scikit-learn's CalibratedClassifierCV, and realized I also need to check whether it would do the right thing for quadruplets.
Even if it doesn't, I think some probability calibration should still be useful for quadruplets, like a function that would return a proba between 0 and 1 such that quadruplets which have a proba of 0.2 have 80% chance to be in the wrong order and 20% chance to be in the right order. This would also mean that swaping the pairs in the quadruplets should return the opposite proba. I still need to understand better CalibratedClassifierCV to see how it could be used/adapted for quadruplets.

@@ -296,6 +296,7 @@ def get_mahalanobis_matrix(self):

class _PairsClassifierMixin(BaseMetricLearner):

classes_ = [-1, 1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need estimators to have a classes_ attribute for CalibratedClassifierCV to work. Typically, scikit-learn estimators deduce it at fit time, but in our case we will always have these ones (see also issue #164). That's why I put it as a class attribute, to be more explicit.

Classes have to be in this order ([-1, 1])(i.e. sorted), otherwise there is a bug: because here https://github.com/scikit-learn/scikit-learn/blob/1a850eb5b601f3bf0f88a43090f83c51b3d8c593/sklearn/calibration.py#L312-L313:
self.label_encoder_.classes_ is a sorted list of classes (so it'll be [-1, 1] in every cases). So when it transforms self.base_estimator.classes_ it will return [0, 1] if our estimator has [-1, 1] as classes, and [1, 0] if it has [1, -1]. In the latter case, it will return IndexError because here (https://github.com/scikit-learn/scikit-learn/blob/1a850eb5b601f3bf0f88a43090f83c51b3d8c593/sklearn/calibration.py#L357) it will try to reach the column n°1 (and not 0) of Y (which does not exist).

Also, we have to put -1 and 1 in this order so that for label_binarizer (which is called by CalibratedClassifierCV), -1 (first element of [-1, 1]) will be considered as the negative label, and 1 will be considered as the positive label. (That's how label_binarizer work, see the example below)

Example: (warning: pos_label and neg_label are not to say what input classes are pos or neg, as one could think, but rather how we want them to appear in the output)

In [1]: from sklearn.preprocessing import label_binarize 
   ...: print(label_binarize([-1, 1, -1], [-1, 1], pos_label=2019, neg_label=2018))                                                                                                                                                                                                        
[[2018]
 [2019]
 [2018]]

And:

In [1]: from sklearn.preprocessing import label_binarize 
   ...: print(label_binarize([-1, 1, -1], [1, -1], pos_label=2019, neg_label=2018))                                                                                                                                               
[[2019]
 [2018]
 [2019]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the classes_ attribute in this PR? Otherwise it could be saved for #173

estimator=self, tuple_size=self._tuple_size)
# checked_input will be of the form `(checked_quadruplets, checked_y)` if
# `y` is not None, or just `checked_quadruplets` if `y` is None
quadruplets = checked_input if y is None else checked_input[0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit ugly, but I couldn't find another way to do it. Maybe refactor check_input with two functions like check_X_y and check_array ? Or we can just keep it as it is for now

@@ -25,35 +25,11 @@ def test_predict_only_one_or_minus_one(estimator, build_dataset,
assert np.isin(predictions, [-1, 1]).all()


@pytest.mark.parametrize('with_preprocessor', [True, False])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this test because it makes no sense for quadruplets since the threshold should be always 0

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments:

1/ I think we have a problem with decision_function: this is interpreted as the larger, the more likely it corresponds to the positive class. However as it is a distance it is the other way around for us, which I think will make the computation of AUC wrong. I think we should return minus the distance instead. To return the actual distances, we already have score_pairs.

2/ For quadruplet metric learner: I think the only score that makes sense is the accuracy, because each quadruplet contains both a positive and a negative pair. So while the trick of adding an argument y works to call the accuracy score function of sklearn, an easy way to avoid that users try to use other scoring functions would be not to add the y and simply compute the accuracy ourselves. This way it will raise an error when trying to use other sklearn scoring functions. What do you think? I am not sure how we could raise a specific error to explain that only accuracy makes sense for quadruplets. In any case, we could deal with the quadruplet thing in a separate PR (see also point 3/ below) and keep this PR only about pairwise metric learners.

3/ For quadruplet metric learners but also the metric transformers, in terms of use case it would still be useful to allow people to easily get a pair classifier out of the learned distance. For this we could for instance introduce a new " instance of pair metric learner which would simply take a predefined metric as input instead of fitting on data (users can then easily set/calibrate the threshold using existing methods). There may be a better way to do this. We should discuss this separately (will create an issue).

@@ -296,6 +296,7 @@ def get_mahalanobis_matrix(self):

class _PairsClassifierMixin(BaseMetricLearner):

classes_ = [-1, 1]
_tuple_size = 2 # number of points in a tuple, 2 for pairs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we document the attribute threshold_ here in the general class rather than in the downstream classes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I should document it for both abstract and downstream classes, but I don't think I can document it only here: indeed, we can inherit docstrings, but I don't think we can inherit docstrings paragraph-wise (e.g. _PairsClassifierMixin has an attribute threshold_ that we want to transmit to MMC. But the problem is that MMC has also a title, and also other attributes like transformer_). I didn't find an option to do this in sphinx and this impossibility make some sense because if we write two docstrings paragraphs (below the title), would the result in the downstream class be the concatenation of both descriptions ? It could give weird results (e.g. : "This is the abstract class", "This is is the MMC class", would give: "This is the abstract class. This is the MMC class") However there seem to exist some packages that could allow to do something like this. We could use them but maybe later on in the development if things get too replicated. https://github.com/Chilipp/docrep https://github.com/meowklaski/custom_inherit
Then there's also the solution not to care about inheriting the docstrings, but I think it's better for a user to know what are all the attributes in his object without needing to click on the doc of all mother classes (that's what is done in scikit-learn for instance I think)
Finally there's also another option which is to declare attributes as properties so that we can have a more granular inheritance, but I think this adds to much complexity just for docstrings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done: I just added them to the base PairsClassifier class too)

@wdevazelhes
Copy link
Member Author

1/ I think we have a problem with decision_function: this is interpreted as the larger, the more likely it corresponds to the positive class. However as it is a distance it is the other way around for us, which I think will make the computation of AUC wrong. I think we should return minus the distance instead. To return the actual distances, we already have score_pairs.

Yes, absolutely, I just changed it

@wdevazelhes
Copy link
Member Author

2/ For quadruplet metric learner: I think the only score that makes sense is the accuracy, because each quadruplet contains both a positive and a negative pair. So while the trick of adding an argument y works to call the accuracy score function of sklearn, an easy way to avoid that users try to use other scoring functions would be not to add the y and simply compute the accuracy ourselves. This way it will raise an error when trying to use other sklearn scoring functions. What do you think? I am not sure how we could raise a specific error to explain that only accuracy makes sense for quadruplets. In any case, we could deal with the quadruplet thing in a separate PR (see also point 3/ below) and keep this PR only about pairwise metric learners.

I agree, however I think functions that use predict_proba could still make sense for quadruplets ? (like scikit-learn's brier loss or neg log loss) (assuming that we have a predict_proba function for quadruplets)? Indeed they don't have a notion of positive or negative class. But it seems more important to prevent using a lot of wrong common scoring functions rather than allowing to use two rare scoring functions.

Also, all scikit-learn estimators have a y argument even if it is not used, for compatibility with pipelines.
But however I don't think pipelines with LSML would be wanted by users.

If we wanted to still implement a way to raise an error if the user used scikit-learn's scoring function, we could maybe do something with the inspect module (inspecting which function has called predict and raise an error if it is a _score that is not the accuracy or brier or neg log loss or something like that).

But to sum up this would mean doing something hacky for in the end allowing functionnalities that are not mainstream (using the probabilistic losses, and calling the scoring with cross_val_score(scoring= 'accuracy') instead of cross_val_score(scoring=None)), so you're right for now I'll go with the simpler thing (removing the y), and indeed, doing so in another PR seems better

@wdevazelhes
Copy link
Member Author

3/ For quadruplet metric learners but also the metric transformers, in terms of use case it would still be useful to allow people to easily get a pair classifier out of the learned distance. For this we could for instance introduce a new " instance of pair metric learner which would simply take a predefined metric as input instead of fitting on data (users can then easily set/calibrate the threshold using existing methods). There may be a better way to do this. We should discuss this separately (will create an issue).

I agree, let's deal with this in the new issue

@wdevazelhes
Copy link
Member Author

can't we rather keep the threshold positive (i.e., which works with actual distances and not negative distances), and change the predict accordingly? It feels weird to have a negative threshold

Yes I agree, I was hesitating between both, but I guess a negative threshold is weird indeed

@wdevazelhes wdevazelhes changed the title [MRG] Threshold for pairs learners [WIP] Threshold for pairs learners Mar 21, 2019
@wdevazelhes
Copy link
Member Author

@bellet Thanks for your review, I addressed all your comments

Before merging, there just remains the following:

  • Before implementing the mean of the smallest score to accept and the higher score to reject we need to decide what to do in the case where we want to accept all points (see comment [MRG+1] Threshold for pairs learners #168 (comment))
  • We need to choose what to do for ITML (threshold with the mean of the bounds or default calibrate threshold)
  • We need to add for accuracy (and others) the extreme point case (see the TODO)

@bellet
Copy link
Member

bellet commented Mar 22, 2019

Thanks for the review @bellet !

* It would be nice to have a parameter in `fit` to set the calibration strategy (with accuracy as default)

I agree, I was thinking to go for something like threshold_params=None, (None will call accuracy by default) that could be say {'strategy': 'accuracy'} or {'strategy': 'max_tpr', 'threshold':0.1} for instance, what do you think ?
This way first we don't have a lot of additional arguments in fit, and we don't have to change the code if we change the API of calibrate_threshold

Yes this is a good idea. I would maybe name this parameter calibration_params


Returns
-------
self : object
Returns the instance.
"""
return self._fit(pairs, y, bounds=bounds)
self._fit(pairs, y, bounds=bounds)
self.calibrate_threshold(pairs, y, **(threshold_params if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all algorithms we should test that threshold_params has allowed values at the beginning of fit otherwise we will fit for nothing (which may take some time) if bad params have been given...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point,
done

return self

def calibrate_threshold(self, pairs_valid, y_valid, strategy='accuracy',
threshold=None, beta=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or desired_rate

cum_accuracy = (cum_tp + cum_tn) / n_samples
imax = np.argmax(cum_accuracy)
# note: we want a positive threshold (distance), so we take - threshold
self.threshold_ = - scores_sorted[imax]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the threshold be positive?

Copy link
Member

@bellet bellet Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, got confused again because I did not read the comment ;-) (this confirms that the comment is useful)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the comment can be a bit more explicit: "we are working with negative distances but we want the threshold to be with respect to the actual distances so we take minus sign" or something like that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, done

@bellet
Copy link
Member

bellet commented Mar 22, 2019

I guess that the same reasoning could be applied when setting the threshold for other calibration strategies (other than accuracy)?

You mean the cumsum ? It's already used by precision_recall_curve, and roc_curve

You can ignore this, you replied in the proper thread, not sure why it appeared out of context

@bellet
Copy link
Member

bellet commented Mar 22, 2019

Sorry for the messy review (several tabs opened).

About your questions:

  • Let's do the same for ITML as for all other algorithms for consistency
  • For the extreme point question I do not think it matters much. You may just use +1 / -1 to give a safety margin. This is not an interesting case anyway!

cum_tn_inverted = stable_cumsum(y_ordered[::-1] == -1)
cum_tn = np.concatenate([[0], cum_tn_inverted[:-1]])[::-1]
cum_tn = np.concatenate([[0.], cum_tn_inverted])[::-1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to include the last cum_tn_inverted now, since it's the one where all samples will be rejected

y_valid, self.decision_function(pairs_valid), pos_label=1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to take part of the code (but not all) from precision_recall_curve here because there was a section of the code in precision_recall_curve that deleted points when the maximum recall is attained. This was OK when we set the threshold to the the last value we accept, but it's not OK with the new computation of the threshold because we need all the points to put the threshold in the middle of two points. (Otherwise the algo can think some point is the last point of the list and so be set to this value - 1, whereas it should be the mean between this value and the next one)
I can detail more this if needed

return self

fpr, tpr, thresholds = roc_curve(y_valid,
self.decision_function(pairs_valid),
pos_label=1)
pos_label=1, drop_intermediate=False)
Copy link
Member Author

@wdevazelhes wdevazelhes Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need drop_intermediate=False here for a similar problem than above: drop_intermediate=True worked fine when we were setting the threshold to an exact value of a point, but if we remove some points then the mean between one point and the next can be wrong (because the "next" point is not the actual "next" point: the actual "next" would be removed)
drop_intermediate=False ensures that no point is dropped

@wdevazelhes
Copy link
Member Author

I just added the new computation of the threshold as we said, so I'll just refactor it a bit (right now it works, but it's a bit messy), and then I should document how we set the threshold with the new computation, then I should add the check before fitting for the calibration_params, address the few remaining comments and we should be good to go

@bellet
Copy link
Member

bellet commented Mar 27, 2019

If it remains quite messy, we have to ask ourselves whether this mean strategy is worth the extra complexity in the code

@wdevazelhes
Copy link
Member Author

If it remains quite messy, we have to ask ourselves whether this mean strategy is worth the extra complexity in the code

I agree, also, I think some of the complexity comes from the fact that we use scikit-learn functions and deal with the special cases where they remove points in curves etc... Maybe re-implementing the code would be simpler, more robust to sklearn's changes, have less comments to explain how we deal with special cases, probably some parts could also be mutualized like the cumsums so it wouldn't be a code as big as just recoding each part... I'll also think about that to see what it would look like this way

@bellet
Copy link
Member

bellet commented Mar 28, 2019

If we call sklearn routines in a straightforward way (avoiding edges cases), I think there is no reason to expect breaking changes. So this is an option (i.e., revert back to picking a simple returned threshold instead of averaging). Re-implementing everything on ROC/PR curves may seem like an overkill?

@wdevazelhes
Copy link
Member Author

Yes I agree, I'll put the recent changes in a branch of my fork in case we need these changes one day, but I agree for now we can stick to the first version

@wdevazelhes
Copy link
Member Author

I'll just need to still cherry pick in the changes the way we can put a threshold to the worst score -1, it's the only solution if we want to reject everything (which is a case that is now also tested)

@wdevazelhes
Copy link
Member Author

I came back to the previous implementation for the threshold, put a calibration params validation before fit, and addressed all the comments. Just a last questions before final review and merge: I noted in the todo that we could raise a warning when calibrating the threshold on the trainset at fit time. Is it really necessary in the end ? We already add a little comment about that in the User Guide. Isn't it enough ?

@bellet bellet changed the title [WIP] Threshold for pairs learners [MRG+1] Threshold for pairs learners Apr 3, 2019
@bellet
Copy link
Member

bellet commented Apr 3, 2019

LGTM. Warning is not necessary (and would be quite annoying).
Maybe @perimosocordiae or @terrytangyuan want to have a final look at this, otherwise we can merge

@bellet bellet merged commit edad55d into scikit-learn-contrib:master Apr 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scoring functions Use a threshold for prediction for pairs
2 participants