-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to approach loss of less than 0.7 even when testing multiple learning rates. #30
Comments
Hi, I have seen the VGG-Face2 dataset and have always wanted to train on it, but never found the time, unfortunately. So far, I was almost always able to train on any dataset I have tried, but for a few especially difficult ones, it took considerable time to find good hyperparameters. (I think for the worst case, only a single value for hyperparameters converged!) You might also want to try another optimizer or, for Adam, it can be necessary to also tune the "epsilon" value, as mentioned in the TensorFlow documentation. An option to make training more robust, which has been done in many recent papers, is to add a softmax loss to the total loss, this usually helps overcome the "difficult phase" which is what you see. Finally, this is also what happens when you make a mistake somewhere, for example mix-up PIDs by mistake, or pre-process images so as to make them corrupt, or whatever other mistakes can happen that kill the structure in the data. |
Yes the latter is what I am very worried about. Do you know if there any way to efficiently output an anchor, positive and negative at some point during training? I am new to TF. I have tried the other optimiser you mentioned in the paper no difference. I am judging that the training is working on the fact that if the min loss drops below 0.69 early on for efficiency sake is this a really poor idea? Thank you very much for getting back to me! currently trying facenets:
|
For completeness, I'm repeating here what I wrote in an e-mail to you. For especially difficult datasets, you really need to try many hyperparameter values to find some that converge. For example, try learning rates 1e-1, 3e-2, 1e-2, 3e-3, ..., 1e-8, each of them for 2-5k updates, and stop early for those that don't go below maybe 0.65 or so. That should be feasible in one night or so. About outputting things, have a look at "summaries" in tensorflow, you can output even images and look at them in tensorboard. It doesn't make sense to try optimizer settings from other papers, every model/dataset combination has their own sweet spot, so if the paper you take one from doesn't use the same model/dataset as you, it's only luck if the optimizer setting will work or not. However, RMSProp is a good one to try, too. |
Okay after starting the sesh I have ran:
this returns:
Am I correct in thinking things have gone terribly wrong hence the first two files are not class |
Yes, good find, it seems things have gone wrong in the dataset preparation somewhere! PS: Thanks for going through moving this to a new issue! |
You should probably run print(sess.run([pids, fids])) instead. The way you call it separately will show you the pids and fids from different batches. |
Awkward. Yeah works fine :( what is the problem!! |
@Pandoro is right, I rejoiced too early. |
Definitely a good port of call for anyone running into this issue in the future though. I am going to carry on going through the |
I just set the |
@maxisme what is happening is that your embeddings are collapsing all to a single point. This is also what we discussed quite extensively in the supplementary material of our paper. When you use a margin based loss, if all embedding vectors collapse to the same point, your loss will always be equal to the margin. If you use the soft-margin loss, that is why you end up with ln(1 + exp(0)) ≈ 0.7 as the value it goes against. Also make sure you don't confuse "learning rate" with "loss". Those are fundamentally different things. Nevertheless, I can just second the things @lucasb-eyer suggested in order to fix the collapsing of the embeddings. |
@Pandoro thank you for that explanation. Does "collapsing all to a single point" mean all the values/embeddings/features outputted by the CNN are becoming the same no matter the input data (image) - |
All embeddings/output values are becoming the same, yes, but that doesn't mean that network weights are becoming the same (that isn't too meaningful of a statement, actually). Please check the paper's appendix. In addition to what I recommended above (highly recommended to try thoroughly!) please also check if you are actually loading pre-trained weights, that's another common mistake people forget to do! |
I do https://github.com/VisualComputingInstitute/triplet-reid/blob/master/train.py#L286 but I have deleted L365 through L367 as I am not bother about checkpointing. Is that what you mean? |
OMG completely misunderstood checkpointing. Well half misunderstood. Downloading now.... |
😆 you deleted exactly important lines, yes. |
Never been happier to see a float under |
hahaha congrats! If it does converge, please close the issue, and if you are especially nice, report the scores you get here :) |
Haha. Definitely will do! I probably have the wrong learning rate but I will get there! |
Unfortunately I have been playing with values all day and the mean still effectively convolves to |
Okay I ran all the learning rates and epsilons:
for 1000 iterations each. Never did the |
Given that you removed some lines, did you maybe make any other changes to the code? Did you try smaller P? For the ranges that you tried, epsilon might want to go larger, as you can see in the TF docs they mention that values of 1.0 or 0.1 are good ones for ImageNet, so the range you tried is not a good one. Instead, I'd try something like:
And, I think I mentioned it already, but people have great success adding a softmax loss (on a separate embedding) to the triplet loss. I think it should be relatively straightforward, I might even have a look at it myself soon. |
Finally, I might implement the trick I mentioned here in the next days, it can be helpful for especially difficult datasets. |
Here is my full code I have cross referenced it with yours many times now and I don't think it has any major differences just removed args and logging etc...
|
That would be fantastic. Unfortunately I have until Thursday to implement this. Just read the comment - only two lines of code lol. |
I am sorry can you explain "softmax loss (on a separate embedding)" in a bit more detail? What is the separate embedding? |
Btw I have also tried different architectures too with no luck. I am currently trying with resnet_v1_101 |
Large epsilon values don't ever seem to converge when testing with the |
A few more points:
Again, I believe that I have never encountered a dataset where I couldn't find a single working setup, and I have trained on a lot of datasets so far. |
Re 7: How do you get to 160x160? Are all your detections square, too, or do you stretch them? I can imagine stretching destroys quite some information about a face. You could try rescaling the larger side to 160 and padding with grey, or rescaling the smaller side to 160 and cropping the middle. As I said, for me choosing the right one of these once surprisingly made the difference between convergence and stuck training. Also, try other sizes that are divisible by 32 (because that's what resnet does) such as 128x128, 192x192, 224x224. |
Absolutely no reason. Detections are not square. FaceNet performs stretching... As I initially thought the same as you. Will give that a try in replacement for your:
|
I think I am going mad 😞 . The mean loss is now always returning values ~2 no matter the learning rate. What have I done...!! I need to take a break 😉 |
Sorry, I likely won't get to it this evening anymore, I have too much to do! (I'm in the middle of a cross-country move actually.) Will do it tomorrow at work first thing. |
Ahh, that is okay! Oh jeez! Good luck! Thank you again! |
Can I take back what I said about part 8. A horizontal flip would maybe help but cropping definitely would not on already cropped facial data. That is the great thing about me pre processing all the data with a face localisation model as I can guarantee similar data if it passes through that stage. I was worried that it was a bit too hard core of a crop too. So I have also added a bit of padding to the output function usually misses out ears and fringes etc... (could be quite helpful for recognition) |
There is no such thing as a "guarantee" in ML, believe me 😄 See #33, especially give the other losses such as batch_all and batch_sample a try, they might solve your problem, albeit typically converge to slightly worse results than if you would get batch_hard to work. |
😀 sweet jesus tell me about it!! Thank you!! Currently testing:
No luck so far. |
Woah, scary how much the size of the image affects your mAP in the paper. Was that using the stretching similar to Line 155 in 250eb17
|
is the target loss for this:
1? I am understanding this as: where x is 0. Is that correct? I also don't get why a loss of <= x is to be cared about surely it should be when |
By design the triplet loss can never be smaller than zero. Be it with a margin or with a soft margin formulation. Your formula is wrong, the denominator is not the sum over all losses that are bigger than zero, but the number of losses that are zero. The idea of this loss is to not "wash out" the actual loss by all the triplets that are in fact giving a loss of zero, but only taking the mean of "active" triplets over all the possible triplets in the batch. |
Does 'active' mean being selected as the a, p and ns? Do you mention this concept in your paper I cannot find it? Unfortunately I have now tried 'batch_hard' 'batch_sample' and 'batch_all' but none of them or converging 😞 . |
To be honest, if none of the three converge after learning-rate search, I think you have a bug somewhere or your dataset is broken. Neither I, nor anyone I know has encountered this situation with any dataset that I know of, and collectively we have tried a lot of datasets. The only time this happens is when there's a bug somewhere. |
Any other techniques other than the initially mentioned printing of the pids and fids I can check for a buggy dataset? I have visually checked about 10% and all the folders match the same individual. |
Oh and we implicitly define "active" as non-zero triplets in 3.4; it means triplets which violate the margin, i.e. where You could try to do classification and see if that works, as it's a completely different thing. But you may also have made other mistakes when extracting your code from ours. Maybe pre-processing, data fiddling, whatever that kills structure in the data. It's really hard to say, as if I could say, I would know what your mistake is 😄 |
Alternatively, you can try your code/pipeline with another dataset, maybe a toy one. |
As in one of toys or a small easy dataset? haha! |
@Pandoro can I ask you to have a quick look at my code: #30 (comment) to see if anything jumps out as wrong? |
I just glimpsed over your code and I think you copied the most relevant parts. Although I do wonder why you did that instead of just using our code. The only thing that strikes me is the fact that the optimizer that is used in the version you posted is RMSProp, I don't think either @lucasb-eyer or me have ever used this optimizer during training and I have no idea what epsilon is used for in that optimizer, however, I do guess that it will need a completely different learning rate. Nevertheless, I think you have ignored some key hints that @lucasb-eyer has given you. The image size of 160x160 surely isn't optimal, I think trying something like 256x256 or even 320x320 would be way more appropriate. Also you should definitely try a smaller K, in the "easiest" case try P=16, K=2. That should make the batch hard loss a lot more easy to optimize. Even though it will not converge to some great score in the end I guess. Finally on an unrelated note, we pretty much always used the resnet 50 and didn't get real improvements with the resnet 101, apart from it taking more memory and longer to train. |
I have learn't two important things in the last week. Check your dataset about 5,000,000 times or write tests 😉 . Don't let iteration count impact your assumption at all. I have used the parameters and as a last effort let it run over night:
by iteration 25,000 pretty much every |
Does that mean you indeed still had an error in your data? Would you mind sharing a screenshot of a training curve? |
And you might want to look at the samples causing such huge max-loss after long training, they will either be extremely hard cases (and thus interesting to look at) or errors/noise in the data. |
@lucasb-eyer I will definitely do that. No just thought it is worth a mention as that was the cause for a bug in the face localisation model I was training haha! Yes will send it over now just training with slightly different params:
I tested over 50,000 iterations with the params:
and the final mean was the dreaded 0.69 |
Okay here are the graphs over 30,000 iterations sorry for the lack of x axis excel is rubbish. Mean LossMax LossMin LossCSVWith the actual data and iterations (just replace .txt with .csv): |
Thanks, I think that looks better right? I consider this to mean the original issue of "Unable to approach loss of less than 0.7 [...]" to be solved and so am closing the issue, but feel free to re-open if you disagree. |
Hey, Sorry to bring this back to life! How would I go about debugging which files were the 'extremely hard cases'? |
It's those which always violate the margin. Edit: or even better, those which are always selected as the hard examples in the batch-hard loss. |
I have tried many different learning rates and
optimizers
but I have not once seen a min loss drop below 0.69.If I use
learning_rate = 1e-2
:If I use
learning_rate = 1e-6
:What does this affectively mean? "Nonzero triplets never decreases" - not quite sure what that means?
I am using the vgg dataset with the file structure like this:
I set the
pids, fids = [], []
like this:where
DATA_DIR
is the directory of the vgg dataset.The text was updated successfully, but these errors were encountered: