Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to approach loss of less than 0.7 even when testing multiple learning rates. #30

Closed
maxisme opened this issue Apr 23, 2018 · 57 comments

Comments

@maxisme
Copy link
Contributor

maxisme commented Apr 23, 2018

I have tried many different learning rates and optimizers but I have not once seen a min loss drop below 0.69.

If I use learning_rate = 1e-2:

iter:    20, loss min|avg|max: 0.713|2.607|60.013, batch-p@3: 4.43%, ETA: 6:01:18 (0.87s/it)
iter:    40, loss min|avg|max: 0.696|1.204|21.239, batch-p@3: 5.99%, ETA: 5:58:53 (0.86s/it)
iter:    60, loss min|avg|max: 0.696|1.643|25.543, batch-p@3: 4.69%, ETA: 5:36:32 (0.81s/it)
iter:    80, loss min|avg|max: 0.695|1.679|42.339, batch-p@3: 7.03%, ETA: 5:58:01 (0.86s/it)
iter:   100, loss min|avg|max: 0.694|1.806|47.572, batch-p@3: 6.51%, ETA: 6:08:57 (0.89s/it)
iter:   120, loss min|avg|max: 0.695|1.200|21.791, batch-p@3: 4.43%, ETA: 6:14:15 (0.90s/it)
iter:   140, loss min|avg|max: 0.694|2.744|87.940, batch-p@3: 5.47%, ETA: 6:21:29 (0.92s/it)

If I use learning_rate = 1e-6:

iter:    20, loss min|avg|max: 0.741|14.827|440.151, batch-p@3: 1.04%, ETA: 6:23:26 (0.92s/it)
iter:    40, loss min|avg|max: 0.712|9.662|146.125, batch-p@3: 2.86%, ETA: 6:03:24 (0.87s/it)
iter:    60, loss min|avg|max: 0.697|3.944|100.707, batch-p@3: 4.17%, ETA: 6:10:44 (0.89s/it)
iter:    80, loss min|avg|max: 0.695|2.408|75.002, batch-p@3: 2.86%, ETA: 5:44:48 (0.83s/it)
iter:   100, loss min|avg|max: 0.694|2.272|67.504, batch-p@3: 2.86%, ETA: 6:03:45 (0.88s/it)
iter:   120, loss min|avg|max: 0.694|1.091|17.292, batch-p@3: 2.86%, ETA: 5:42:45 (0.83s/it)
iter:   140, loss min|avg|max: 0.693|1.069|15.975, batch-p@3: 5.73%, ETA: 5:46:48 (0.84s/it)
...
iter:   900, loss min|avg|max: 0.693|0.694| 0.709, batch-p@3: 2.08%, ETA: 5:15:00 (0.78s/it)
iter:   920, loss min|avg|max: 0.693|0.693| 0.701, batch-p@3: 2.34%, ETA: 5:39:12 (0.85s/it)
iter:   940, loss min|avg|max: 0.693|0.694| 0.704, batch-p@3: 5.99%, ETA: 5:46:12 (0.86s/it)
iter:   960, loss min|avg|max: 0.693|0.693| 0.705, batch-p@3: 2.86%, ETA: 5:24:59 (0.81s/it)
iter:   980, loss min|avg|max: 0.693|0.693| 0.700, batch-p@3: 3.65%, ETA: 5:39:47 (0.85s/it)
iter:  1000, loss min|avg|max: 0.693|0.693| 0.698, batch-p@3: 3.39%, ETA: 5:27:59 (0.82s/it)
iter:  1020, loss min|avg|max: 0.693|0.693| 0.700, batch-p@3: 6.51%, ETA: 5:36:38 (0.84s/it)
iter:  1040, loss min|avg|max: 0.693|0.694| 0.699, batch-p@3: 2.86%, ETA: 5:22:05 (0.81s/it)
...
iter:  1640, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 2.60%, ETA: 5:09:58 (0.80s/it)
iter:  1660, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 2.08%, ETA: 5:48:27 (0.90s/it)
iter:  1680, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 4.43%, ETA: 5:23:23 (0.83s/it)
iter:  1700, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 6.51%, ETA: 5:25:04 (0.84s/it)
iter:  1720, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 3.12%, ETA: 5:39:08 (0.87s/it)

What does this affectively mean? "Nonzero triplets never decreases" - not quite sure what that means?


I am using the vgg dataset with the file structure like this:

class_a/file.jpg
class_b/file.jpg
class_c/file.jpg
...

I set the pids, fids = [], [] like this:

classes = [path for path in os.listdir(DATA_DIR) if os.path.isdir(os.path.join(DATA_DIR, path))]
for c in classes:
    for file in glob.glob(DATA_DIR+c+"/*.jpg"):
        pids.append(c)
        fids.append(file)

where DATA_DIR is the directory of the vgg dataset.

@lucasb-eyer
Copy link
Member

Hi, I have seen the VGG-Face2 dataset and have always wanted to train on it, but never found the time, unfortunately. So far, I was almost always able to train on any dataset I have tried, but for a few especially difficult ones, it took considerable time to find good hyperparameters. (I think for the worst case, only a single value for hyperparameters converged!)

You might also want to try another optimizer or, for Adam, it can be necessary to also tune the "epsilon" value, as mentioned in the TensorFlow documentation.

An option to make training more robust, which has been done in many recent papers, is to add a softmax loss to the total loss, this usually helps overcome the "difficult phase" which is what you see.

Finally, this is also what happens when you make a mistake somewhere, for example mix-up PIDs by mistake, or pre-process images so as to make them corrupt, or whatever other mistakes can happen that kill the structure in the data.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

Yes the latter is what I am very worried about. Do you know if there any way to efficiently output an anchor, positive and negative at some point during training? I am new to TF. I have tried the other optimiser you mentioned in the paper no difference. I am judging that the training is working on the fact that if the min loss drops below 0.69 early on for efficiency sake is this a really poor idea?

Thank you very much for getting back to me!


currently trying facenets:

optimizer = tf.train.RMSPropOptimizer(learning_rate, decay=0.9, momentum=0.9, epsilon=1.0)

@lucasb-eyer
Copy link
Member

For completeness, I'm repeating here what I wrote in an e-mail to you. For especially difficult datasets, you really need to try many hyperparameter values to find some that converge. For example, try learning rates 1e-1, 3e-2, 1e-2, 3e-3, ..., 1e-8, each of them for 2-5k updates, and stop early for those that don't go below maybe 0.65 or so. That should be feasible in one night or so.

About outputting things, have a look at "summaries" in tensorflow, you can output even images and look at them in tensorboard.

It doesn't make sense to try optimizer settings from other papers, every model/dataset combination has their own sweet spot, so if the paper you take one from doesn't use the same model/dataset as you, it's only luck if the optimizer setting will work or not. However, RMSProp is a good one to try, too.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

Okay after starting the sesh I have ran:

print(sess.run(fids))
print(sess.run(pids))

this returns:

['/path/to/data-set/n006101/0320_01.jpg'
 '/path/to/data-set/n006101/0131_05.jpg'
 '/path/to/data-set/n006101/0043_01.jpg'
 '/path/to/data-set/n006101/0114_01.jpg'
 '/path/to/data-set/n004359/0174_04.jpg'
 '/path/to/data-set/n004359/0058_01.jpg'
 '/path/to/data-set/n004359/0241_01.jpg'
 '/path/to/data-set/n004359/0175_01.jpg'
 '/path/to/data-set/n007003/0417_05.jpg'
 '/path/to/data-set/n007003/0209_01.jpg'
 '/path/to/data-set/n007003/0077_01.jpg'
 '/path/to/data-set/n007003/0057_01.jpg'
 '/path/to/data-set/n000457/0203_01.jpg'
 '/path/to/data-set/n000457/0159_01.jpg'
 '/path/to/data-set/n000457/0197_01.jpg'
 '/path/to/data-set/n000457/0161_01.jpg'
 '/path/to/data-set/n000549/0363_02.jpg'
 '/path/to/data-set/n000549/0141_01.jpg'
 '/path/to/data-set/n000549/0414_01.jpg'
 '/path/to/data-set/n000549/0328_01.jpg'
 '/path/to/data-set/n006797/0147_02.jpg'
 '/path/to/data-set/n006797/0035_02.jpg'
 '/path/to/data-set/n006797/0101_02.jpg'
 '/path/to/data-set/n006797/0145_01.jpg'
 '/path/to/data-set/n001789/0210_01.jpg'
 '/path/to/data-set/n001789/0012_01.jpg'
 '/path/to/data-set/n001789/0087_01.jpg'
 '/path/to/data-set/n001789/0159_01.jpg'
 '/path/to/data-set/n000473/0074_01.jpg'
 '/path/to/data-set/n000473/0039_02.jpg'
 '/path/to/data-set/n000473/0174_01.jpg'
 '/path/to/data-set/n000473/0211_01.jpg'
 '/path/to/data-set/n000489/0008_01.jpg'
 '/path/to/data-set/n000489/0131_02.jpg'
 '/path/to/data-set/n000489/0176_01.jpg'
 '/path/to/data-set/n000489/0221_01.jpg'
 '/path/to/data-set/n002198/0159_07.jpg'
 '/path/to/data-set/n002198/0033_02.jpg'
 '/path/to/data-set/n002198/0181_03.jpg'
 '/path/to/data-set/n002198/0126_01.jpg'
 '/path/to/data-set/n000777/0445_01.jpg'
 '/path/to/data-set/n000777/0126_01.jpg'
 '/path/to/data-set/n000777/0456_04.jpg'
 '/path/to/data-set/n000777/0392_01.jpg'
 '/path/to/data-set/n007482/0196_03.jpg'
 '/path/to/data-set/n007482/0013_01.jpg'
 '/path/to/data-set/n007482/0344_01.jpg'
 '/path/to/data-set/n007482/0064_01.jpg'
 '/path/to/data-set/n005586/0061_02.jpg'
 '/path/to/data-set/n005586/0100_01.jpg'
 '/path/to/data-set/n005586/0144_01.jpg'
 '/path/to/data-set/n005586/0382_01.jpg'
 '/path/to/data-set/n000944/0317_01.jpg'
 '/path/to/data-set/n000944/0144_01.jpg'
 '/path/to/data-set/n000944/0469_01.jpg'
 '/path/to/data-set/n000944/0030_01.jpg'
 '/path/to/data-set/n003644/0695_01.jpg'
 '/path/to/data-set/n003644/0104_02.jpg'
 '/path/to/data-set/n003644/0032_01.jpg'
 '/path/to/data-set/n003644/0131_01.jpg'
 '/path/to/data-set/n006191/0241_01.jpg'
 '/path/to/data-set/n006191/0186_03.jpg'
 '/path/to/data-set/n006191/0073_01.jpg'
 '/path/to/data-set/n006191/0157_05.jpg'
 '/path/to/data-set/n004641/0269_01.jpg'
 '/path/to/data-set/n004641/0030_01.jpg'
 '/path/to/data-set/n004641/0179_01.jpg'
 '/path/to/data-set/n004641/0132_01.jpg'
 '/path/to/data-set/n000881/0086_01.jpg'
 '/path/to/data-set/n000881/0351_03.jpg'
 '/path/to/data-set/n000881/0233_01.jpg'
 '/path/to/data-set/n000881/0130_01.jpg']
['n000203' 'n000203' 'n000203' 'n000203' 'n009265' 'n009265' 'n009265'
 'n009265' 'n006279' 'n006279' 'n006279' 'n006279' 'n005480' 'n005480'
 'n005480' 'n005480' 'n005396' 'n005396' 'n005396' 'n005396' 'n007609'
 'n007609' 'n007609' 'n007609' 'n002699' 'n002699' 'n002699' 'n002699'
 'n008955' 'n008955' 'n008955' 'n008955' 'n000885' 'n000885' 'n000885'
 'n000885' 'n007587' 'n007587' 'n007587' 'n007587' 'n008725' 'n008725'
 'n008725' 'n008725' 'n006369' 'n006369' 'n006369' 'n006369' 'n008052'
 'n008052' 'n008052' 'n008052' 'n000116' 'n000116' 'n000116' 'n000116'
 'n008270' 'n008270' 'n008270' 'n008270' 'n000668' 'n000668' 'n000668'
 'n000668' 'n006747' 'n006747' 'n006747' 'n006747' 'n002827' 'n002827'
 'n002827' 'n002827']

Am I correct in thinking things have gone terribly wrong hence the first two files are not class n000203 etc...?

@lucasb-eyer
Copy link
Member

Yes, good find, it seems things have gone wrong in the dataset preparation somewhere!

PS: Thanks for going through moving this to a new issue!

@Pandoro
Copy link
Member

Pandoro commented Apr 23, 2018

You should probably run

print(sess.run([pids, fids]))

instead. The way you call it separately will show you the pids and fids from different batches.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

Awkward. Yeah works fine :( what is the problem!!

@lucasb-eyer
Copy link
Member

@Pandoro is right, I rejoiced too early.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

Definitely a good port of call for anyone running into this issue in the future though. I am going to carry on going through the learning_rates but have a feeling this is not the solution. As the variance is so massive you are very likely to hit a number below 0.7 very early on yet this has never happened with ~10 different learning_rates .

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

I just set the margin to 10.0 now the min loss is 10. Is that expected? So soft margin is 0.7? and setting margin = 'none' my min loss hits 0 and stays at 0.

@Pandoro
Copy link
Member

Pandoro commented Apr 23, 2018

@maxisme what is happening is that your embeddings are collapsing all to a single point. This is also what we discussed quite extensively in the supplementary material of our paper. When you use a margin based loss, if all embedding vectors collapse to the same point, your loss will always be equal to the margin. If you use the soft-margin loss, that is why you end up with ln(1 + exp(0)) ≈ 0.7 as the value it goes against.

Also make sure you don't confuse "learning rate" with "loss". Those are fundamentally different things.

Nevertheless, I can just second the things @lucasb-eyer suggested in order to fix the collapsing of the embeddings.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

@Pandoro thank you for that explanation. Does "collapsing all to a single point" mean all the values/embeddings/features outputted by the CNN are becoming the same no matter the input data (image) - i.e the network weights are becoming the same? Or have I misunderstood?

@lucasb-eyer
Copy link
Member

All embeddings/output values are becoming the same, yes, but that doesn't mean that network weights are becoming the same (that isn't too meaningful of a statement, actually). Please check the paper's appendix.

In addition to what I recommended above (highly recommended to try thoroughly!) please also check if you are actually loading pre-trained weights, that's another common mistake people forget to do!

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

I do https://github.com/VisualComputingInstitute/triplet-reid/blob/master/train.py#L286 but I have deleted L365 through L367 as I am not bother about checkpointing. Is that what you mean?

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

OMG completely misunderstood checkpointing. Well half misunderstood. Downloading now....

@lucasb-eyer
Copy link
Member

😆 you deleted exactly important lines, yes.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

iter:   120, loss min|avg|max: 0.607|1.615| 2.571, batch-p@3: 15.74%, ETA: 3:20:07 (0.48s/it)

Never been happier to see a float under 0.693 although all the others are still over haha.

@lucasb-eyer
Copy link
Member

hahaha congrats! If it does converge, please close the issue, and if you are especially nice, report the scores you get here :)

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

Haha. Definitely will do! I probably have the wrong learning rate but I will get there!

@maxisme
Copy link
Contributor Author

maxisme commented Apr 23, 2018

Unfortunately I have been playing with values all day and the mean still effectively convolves to 0.693. This is the common thing that happens https://i.imgur.com/3dl8zli.png

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

Okay I ran all the learning rates and epsilons:

nums = (1e-9, 3e-9, 1e-8, 3e-8, 1e-7, 3e-7, 1e-6, 3e-6, 1e-5, 3e-5, 1e-4, 3e-4)
for learning_rate in nums:
    for epsilon in nums:
        ...

for 1000 iterations each.

Never did the mean loss drop below the dreaded 0.693. I think it may just be the dataset I guess? 😞

@lucasb-eyer
Copy link
Member

Given that you removed some lines, did you maybe make any other changes to the code?

Did you try smaller P?

For the ranges that you tried, epsilon might want to go larger, as you can see in the TF docs they mention that values of 1.0 or 0.1 are good ones for ImageNet, so the range you tried is not a good one. Instead, I'd try something like:

for lr in (1e0, 3e-1, 1e-1, 3e-2, 1e-2, 3e-3, 1e-3, 3e-4, 1e-4, 3e-5, 1e-5, 3e-6, 1e-6):
    for eps in (1e0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8):
        ...

And, I think I mentioned it already, but people have great success adding a softmax loss (on a separate embedding) to the triplet loss. I think it should be relatively straightforward, I might even have a look at it myself soon.

@lucasb-eyer
Copy link
Member

Finally, I might implement the trick I mentioned here in the next days, it can be helpful for especially difficult datasets.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

Here is my full code I have cross referenced it with yours many times now and I don't think it has any major differences just removed args and logging etc...

#!/usr/bin/env python3
from importlib import import_module
import os, time, glob
import numpy as np
import tensorflow as tf

from train import triplet_loss as loss
from settings import config

################
## variables         ###
################
DATA_DIR = "/root/vggface/train_cropped/"

batch_p = 18
batch_k = 4
learning_rate = 1e-8
epsilon = 1e-9
train_iterations = 600
decay_start_iteration = 450
checkpoint_frequency = 1000
net_input_size = (config.feature_extractor_img_size, config.feature_extractor_img_size)
embedding_dim = 128
margin = 'soft'
metric='euclidean' #sqeuclidean
output_model = config.feature_model_dir + "tmp"
out_dir = output_model + "/save/"
log_every = 20
resume = False # set to true when wanting to extend team data

################
##    run               ###
################

# make out directory
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

"""
PIDs are the "person IDs", i.e. class names/labels.
FIDs are the "file IDs", which are individual relative filenames.
"""
pids, fids = [], []
classes = [path for path in os.listdir(DATA_DIR) if os.path.isdir(os.path.join(DATA_DIR, path))]
for c in classes:
    for file in glob.glob(DATA_DIR+c+"/*.jpg"):
        pids.append(c)
        fids.append(file)

# Setup a tf.Dataset where one "epoch" loops over all PIDS.
# PIDS are shuffled after every epoch and continue indefinitely.
unique_pids = np.unique(pids)
dataset = tf.data.Dataset.from_tensor_slices(unique_pids)
dataset = dataset.shuffle(len(unique_pids))

# Constrain the dataset size to a multiple of the batch-size, so that
# we don't get overlap at the end of each epoch.
dataset = dataset.take((len(unique_pids) // batch_p) * batch_p)
dataset = dataset.repeat(None)  # Repeat forever. Funny way of stating it.

# For every PID, get K images.
dataset = dataset.map(lambda pid: sample_k_fids_for_pid(
    pid, all_fids=fids, all_pids=pids, batch_k=batch_k))

# Ungroup/flatten the batches for easy loading of the files.
dataset = dataset.apply(tf.contrib.data.unbatch())

# Convert filenames to actual image tensors.
dataset = dataset.map(
    lambda fid, pid: fid_to_image(
        fid, pid,
        image_size=net_input_size),
    num_parallel_calls=8)

# Group it back into PK batches.
batch_size = batch_p * batch_k
dataset = dataset.batch(batch_size)

# Overlap producing and consuming for parallelism.
dataset = dataset.prefetch(1)

# Since we repeat the data infinitely, we only need a one-shot iterator.
images, fids, pids = dataset.make_one_shot_iterator().get_next()

# Create the model and an embedding head.
model = import_module('train.resnet_v1_101')
head = import_module('train.fc1024')

# Feed the image through the model. The returned `body_prefix` will be used
# further down to load the pre-trained weights for all variables with this
# prefix.
endpoints, body_prefix = model.endpoints(images, is_training=True)
with tf.name_scope('head'):
    endpoints = head.head(endpoints, embedding_dim, is_training=True)

# Create the loss in two steps:
# 1. Compute all pairwise distances according to the specified metric.
# 2. For each anchor along the first dimension, compute its loss.
dists = loss.cdist(endpoints['emb'], endpoints['emb'], metric=metric)
losses, train_top1, prec_at_k, _, neg_dists, pos_dists = loss.LOSS_CHOICES['batch_hard'](
    dists, pids, margin, batch_precision_at_k=batch_k-1)

# Count the number of active entries, and compute the total batch loss.
num_active = tf.reduce_sum(tf.cast(tf.greater(losses, 1e-5), tf.float32))
loss_mean = tf.reduce_mean(losses)

# These are collected here before we add the optimizer, because depending
# on the optimizer, it might add extra slots, which are also global
# variables, with the exact same prefix.
model_variables = tf.get_collection(
    tf.GraphKeys.GLOBAL_VARIABLES, body_prefix)

# Define the optimizer and the learning-rate schedule.
# Unfortunately, we get NaNs if we don't handle no-decay separately.
global_step = tf.Variable(0, name='global_step', trainable=False)
if 0 <= decay_start_iteration < train_iterations:
    learning_rate = tf.train.exponential_decay(
        learning_rate,
        tf.maximum(0, global_step - decay_start_iteration),
        train_iterations - decay_start_iteration, 0.001)
else:
    learning_rate = learning_rate
tf.summary.scalar('learning_rate', learning_rate)
# optimizer = tf.train.AdamOptimizer(learning_rate, epsilon=epsilon)
optimizer = tf.train.RMSPropOptimizer(learning_rate, epsilon=epsilon)

# Update_ops are used to update batchnorm stats.
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
    train_op = optimizer.minimize(loss_mean, global_step=global_step)

# Define a saver for the complete model.
checkpoint_saver = tf.train.Saver(max_to_keep=0)

cp = tf.ConfigProto()
cp.gpu_options.allow_growth = True
with tf.Session(config=cp) as sess:
    if resume:
        # In case we're resuming, simply load the full checkpoint to init.
        last_checkpoint = tf.train.latest_checkpoint(out_dir)
        checkpoint_saver.restore(sess, last_checkpoint)
    else:
        sess.run(tf.global_variables_initializer())

        saver = tf.train.Saver(model_variables)
        saver.restore(sess, config.resnet_init) # get data from

        checkpoint_saver.save(sess, os.path.join(
            out_dir, 'checkpoint'), global_step=0)

    merged_summary = tf.summary.merge_all()
    start_step = sess.run(global_step)

    # Finally, here comes the main-loop. This `Uninterrupt` is a handy
    # utility such that an iteration still finishes on Ctrl+C and we can
    # stop the training cleanly.
    for i in range(start_step, train_iterations):

        # Compute gradients, update weights, store logs!
        start_time = time.time()
        _, summary, step, b_prec_at_k, b_embs, b_loss, b_fids = \
            sess.run([train_op, merged_summary, global_step,
                      prec_at_k, endpoints['emb'], losses, fids])
        elapsed_time = time.time() - start_time

        if step % log_every == 0 or np.min(b_loss) < 0.693:
            print(str(step)+","+str(np.min(b_loss))+","+str(np.mean(b_loss))+","+str(np.max(b_loss)))

        if np.mean(b_loss) < 0.693:
            print("wahooooooooooooo")


        # Save a checkpoint of training every so often.
        if (checkpoint_frequency > 0 and step % checkpoint_frequency == 0):
            checkpoint_saver.save(sess, os.path.join(out_dir, 'checkpoint'), global_step=step)

    # Store one final checkpoint. This might be redundant, but it is crucial
    # in case intermediate storing was disabled and it saves a checkpoint
    # when the process was interrupted.
    checkpoint_saver.save(sess, os.path.join(out_dir, 'checkpoint'), global_step=step)

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

That would be fantastic. Unfortunately I have until Thursday to implement this. Can I help you at all?

Just read the comment - only two lines of code lol.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

I am sorry can you explain "softmax loss (on a separate embedding)" in a bit more detail? What is the separate embedding?

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

Btw I have also tried different architectures too with no luck. I am currently trying with resnet_v1_101

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

Large epsilon values don't ever seem to converge when testing with the AdamOptimizer.

@lucasb-eyer
Copy link
Member

A few more points:

  1. I'll try to get at least the trick I mention in the other issue in this evening.
  2. re softmax: Well, attach a second embedding to the head and use a softmax loss on it. I'll also see if I manage to implement it this evening, but no promise.
  3. You seem to ignore my suggestion of trying smaller P? It once helped me on a very hard dataset.
  4. Given that you don't use our exact code, I cannot be sure that you don't have other (possibly silly/simple) mistakes like you had with loading pre-trained weights.
  5. IIRC you mentioned in an e-mail that you are running a custom detector for cropping. I know it's not your goal, but have you tried to see if it converges with the official dataset's crops?
  6. Large epsilon typically need to be coupled with larger learning-rates. Actually, only a loop such as I posted above can give definitive answer.
  7. I now remember that I once had the surprising experience that only a specific scaling of the input image worked at all! I don't remember with which dataset, but it can make a lot of sense to play with the input image scale. For one, consistent aspect ratio is very helpful. For another, many people report better success with larger input images. What is your input size? Try some larger sizes and other scalings/crops.
  8. You don't use cropping/flipping augmentation. Why? Especially for "more wild" datasets where the image is not perfectly aligned, crop-augmentation makes a lot of sense.

Again, I believe that I have never encountered a dataset where I couldn't find a single working setup, and I have trained on a lot of datasets so far.

@lucasb-eyer
Copy link
Member

Re 7: How do you get to 160x160? Are all your detections square, too, or do you stretch them? I can imagine stretching destroys quite some information about a face. You could try rescaling the larger side to 160 and padding with grey, or rescaling the smaller side to 160 and cropping the middle. As I said, for me choosing the right one of these once surprisingly made the difference between convergence and stuck training. Also, try other sizes that are divisible by 32 (because that's what resnet does) such as 128x128, 192x192, 224x224.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

Absolutely no reason. Detections are not square. FaceNet performs stretching... As I initially thought the same as you. Will give that a try in replacement for your:

image_resized = tf.image.resize_images(image_decoded, image_size)

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

I think I am going mad 😞 . The mean loss is now always returning values ~2 no matter the learning rate. What have I done...!! I need to take a break 😉

@lucasb-eyer
Copy link
Member

Sorry, I likely won't get to it this evening anymore, I have too much to do! (I'm in the middle of a cross-country move actually.) Will do it tomorrow at work first thing.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

Ahh, that is okay! Oh jeez! Good luck! Thank you again!

@maxisme
Copy link
Contributor Author

maxisme commented Apr 24, 2018

Can I take back what I said about part 8. A horizontal flip would maybe help but cropping definitely would not on already cropped facial data. That is the great thing about me pre processing all the data with a face localisation model as I can guarantee similar data if it passes through that stage. I was worried that it was a bit too hard core of a crop too. So I have also added a bit of padding to the output function usually misses out ears and fringes etc... (could be quite helpful for recognition)

@lucasb-eyer
Copy link
Member

There is no such thing as a "guarantee" in ML, believe me 😄

See #33, especially give the other losses such as batch_all and batch_sample a try, they might solve your problem, albeit typically converge to slightly worse results than if you would get batch_hard to work.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 25, 2018

😀 sweet jesus tell me about it!! Thank you!! Currently testing:

loss_mean = tf.reduce_sum(losses) / (1e-33 + tf.count_nonzero(losses, dtype=tf.float32))

No luck so far.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 25, 2018

Woah, scary how much the size of the image affects your mAP in the paper. Was that using the stretching similar to

image_resized = tf.image.resize_images(image_decoded, image_size)
?

@maxisme
Copy link
Contributor Author

maxisme commented Apr 26, 2018

is the target loss for this:

loss_mean = tf.reduce_sum(losses) / (1e-33 + tf.count_nonzero(losses, dtype=tf.float32))

1?

I am understanding this as:

image

where x is 0. Is that correct? I also don't get why a loss of <= x is to be cared about surely it should be when loss = positive - negative that we care about x being 0 not when loss = apply_margin(positive - negative, margin)

@Pandoro
Copy link
Member

Pandoro commented Apr 26, 2018

By design the triplet loss can never be smaller than zero. Be it with a margin or with a soft margin formulation. Your formula is wrong, the denominator is not the sum over all losses that are bigger than zero, but the number of losses that are zero. The idea of this loss is to not "wash out" the actual loss by all the triplets that are in fact giving a loss of zero, but only taking the mean of "active" triplets over all the possible triplets in the batch.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 26, 2018

Does 'active' mean being selected as the a, p and ns? Do you mention this concept in your paper I cannot find it? Unfortunately I have now tried 'batch_hard' 'batch_sample' and 'batch_all' but none of them or converging 😞 .

@lucasb-eyer
Copy link
Member

To be honest, if none of the three converge after learning-rate search, I think you have a bug somewhere or your dataset is broken. Neither I, nor anyone I know has encountered this situation with any dataset that I know of, and collectively we have tried a lot of datasets. The only time this happens is when there's a bug somewhere.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 26, 2018

Any other techniques other than the initially mentioned printing of the pids and fids I can check for a buggy dataset? I have visually checked about 10% and all the folders match the same individual.

@lucasb-eyer
Copy link
Member

Oh and we implicitly define "active" as non-zero triplets in 3.4; it means triplets which violate the margin, i.e. where n is closer to a than p is, plus margin.

You could try to do classification and see if that works, as it's a completely different thing. But you may also have made other mistakes when extracting your code from ours. Maybe pre-processing, data fiddling, whatever that kills structure in the data. It's really hard to say, as if I could say, I would know what your mistake is 😄

@lucasb-eyer
Copy link
Member

Alternatively, you can try your code/pipeline with another dataset, maybe a toy one.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 26, 2018

As in one of toys or a small easy dataset? haha!

@maxisme
Copy link
Contributor Author

maxisme commented Apr 26, 2018

@Pandoro can I ask you to have a quick look at my code: #30 (comment) to see if anything jumps out as wrong?

@Pandoro
Copy link
Member

Pandoro commented Apr 26, 2018

I just glimpsed over your code and I think you copied the most relevant parts. Although I do wonder why you did that instead of just using our code.

The only thing that strikes me is the fact that the optimizer that is used in the version you posted is RMSProp, I don't think either @lucasb-eyer or me have ever used this optimizer during training and I have no idea what epsilon is used for in that optimizer, however, I do guess that it will need a completely different learning rate.

Nevertheless, I think you have ignored some key hints that @lucasb-eyer has given you. The image size of 160x160 surely isn't optimal, I think trying something like 256x256 or even 320x320 would be way more appropriate. Also you should definitely try a smaller K, in the "easiest" case try P=16, K=2. That should make the batch hard loss a lot more easy to optimize. Even though it will not converge to some great score in the end I guess.

Finally on an unrelated note, we pretty much always used the resnet 50 and didn't get real improvements with the resnet 101, apart from it taking more memory and longer to train.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 28, 2018

I have learn't two important things in the last week. Check your dataset about 5,000,000 times or write tests 😉 . Don't let iteration count impact your assumption at all. I have used the parameters and as a last effort let it run over night:

tech = 'batch_hard'# 'batch_hard' 'batch_sample' 'batch_all'
arch = 'resnet_v1_50'
batch_k = 2
batch_p = 18
learning_rate = 3e-5
epsilon = 1e-8
optimizer_name = 'Adam' # Adam MO RMS
train_iterations = 50000
decay_start_iteration = 20000
checkpoint_frequency = 1000
net_input_size = (256, 256)
embedding_dim = 128
margin = 'soft' # 'soft'
metric='euclidean' #sqeuclidean
output_model = config.feature_model_dir + "tmp"
out_dir = output_model + "/save/"
log_every = 5

by iteration 25,000 pretty much every mean loss is below 0.7 and averaging 0.312 average min loss is < 0.05 but slightly worryingly the max loss has not changed at all 1-4. Looking forward to implementation!

@lucasb-eyer
Copy link
Member

lucasb-eyer commented Apr 29, 2018

Check your dataset about 5,000,000 times or write tests 😉

Does that mean you indeed still had an error in your data?

Would you mind sharing a screenshot of a training curve?
I forgot that one, but for much larger datasets it is indeed good to train much longer, and start decaying much later!

@lucasb-eyer
Copy link
Member

And you might want to look at the samples causing such huge max-loss after long training, they will either be extremely hard cases (and thus interesting to look at) or errors/noise in the data.

@maxisme
Copy link
Contributor Author

maxisme commented Apr 29, 2018

@lucasb-eyer I will definitely do that. No just thought it is worth a mention as that was the cause for a bug in the face localisation model I was training haha! Yes will send it over now just training with slightly different params:

margin: soft 
tech: batch_hard 
arch: resnet_v1_50 
batch_p: 18 
lr: 3e-05 
input: (256, 256) 
metric:euclidean 
epsilon: 1e-07 
optimizer: Adam

I tested over 50,000 iterations with the params:

margin: soft 
tech: batch_hard 
arch: resnet_v1_50 
batch_p: 18 
lr: 0.001 
input: (256, 256) 
metric:euclidean 
epsilon: 1e-08 
optimizer: Adam

and the final mean was the dreaded 0.69

@maxisme
Copy link
Contributor Author

maxisme commented Apr 29, 2018

Okay here are the graphs over 30,000 iterations sorry for the lack of x axis excel is rubbish.

Mean Loss

screen shot 2018-04-29 at 16 44 53

Max Loss

screen shot 2018-04-29 at 16 43 54

Min Loss

screen shot 2018-04-29 at 16 44 23

CSV

With the actual data and iterations (just replace .txt with .csv):

loss.txt

@lucasb-eyer
Copy link
Member

Thanks, I think that looks better right? I consider this to mean the original issue of "Unable to approach loss of less than 0.7 [...]" to be solved and so am closing the issue, but feel free to re-open if you disagree.

@maxisme
Copy link
Contributor Author

maxisme commented Sep 23, 2018

And you might want to look at the samples causing such huge max-loss after long training, they will either be extremely hard cases (and thus interesting to look at) or errors/noise in the data.

Hey, Sorry to bring this back to life! How would I go about debugging which files were the 'extremely hard cases'?

@lucasb-eyer
Copy link
Member

lucasb-eyer commented Oct 28, 2018

It's those which always violate the margin. Edit: or even better, those which are always selected as the hard examples in the batch-hard loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants