Unable to approach loss of less than 0.7 even when testing multiple learning rates. #30

maxisme · 2018-04-23T14:51:53Z

I have tried many different learning rates and optimizers but I have not once seen a min loss drop below 0.69.

If I use learning_rate = 1e-2:

iter:    20, loss min|avg|max: 0.713|2.607|60.013, batch-p@3: 4.43%, ETA: 6:01:18 (0.87s/it)
iter:    40, loss min|avg|max: 0.696|1.204|21.239, batch-p@3: 5.99%, ETA: 5:58:53 (0.86s/it)
iter:    60, loss min|avg|max: 0.696|1.643|25.543, batch-p@3: 4.69%, ETA: 5:36:32 (0.81s/it)
iter:    80, loss min|avg|max: 0.695|1.679|42.339, batch-p@3: 7.03%, ETA: 5:58:01 (0.86s/it)
iter:   100, loss min|avg|max: 0.694|1.806|47.572, batch-p@3: 6.51%, ETA: 6:08:57 (0.89s/it)
iter:   120, loss min|avg|max: 0.695|1.200|21.791, batch-p@3: 4.43%, ETA: 6:14:15 (0.90s/it)
iter:   140, loss min|avg|max: 0.694|2.744|87.940, batch-p@3: 5.47%, ETA: 6:21:29 (0.92s/it)

If I use learning_rate = 1e-6:

iter:    20, loss min|avg|max: 0.741|14.827|440.151, batch-p@3: 1.04%, ETA: 6:23:26 (0.92s/it)
iter:    40, loss min|avg|max: 0.712|9.662|146.125, batch-p@3: 2.86%, ETA: 6:03:24 (0.87s/it)
iter:    60, loss min|avg|max: 0.697|3.944|100.707, batch-p@3: 4.17%, ETA: 6:10:44 (0.89s/it)
iter:    80, loss min|avg|max: 0.695|2.408|75.002, batch-p@3: 2.86%, ETA: 5:44:48 (0.83s/it)
iter:   100, loss min|avg|max: 0.694|2.272|67.504, batch-p@3: 2.86%, ETA: 6:03:45 (0.88s/it)
iter:   120, loss min|avg|max: 0.694|1.091|17.292, batch-p@3: 2.86%, ETA: 5:42:45 (0.83s/it)
iter:   140, loss min|avg|max: 0.693|1.069|15.975, batch-p@3: 5.73%, ETA: 5:46:48 (0.84s/it)
...
iter:   900, loss min|avg|max: 0.693|0.694| 0.709, batch-p@3: 2.08%, ETA: 5:15:00 (0.78s/it)
iter:   920, loss min|avg|max: 0.693|0.693| 0.701, batch-p@3: 2.34%, ETA: 5:39:12 (0.85s/it)
iter:   940, loss min|avg|max: 0.693|0.694| 0.704, batch-p@3: 5.99%, ETA: 5:46:12 (0.86s/it)
iter:   960, loss min|avg|max: 0.693|0.693| 0.705, batch-p@3: 2.86%, ETA: 5:24:59 (0.81s/it)
iter:   980, loss min|avg|max: 0.693|0.693| 0.700, batch-p@3: 3.65%, ETA: 5:39:47 (0.85s/it)
iter:  1000, loss min|avg|max: 0.693|0.693| 0.698, batch-p@3: 3.39%, ETA: 5:27:59 (0.82s/it)
iter:  1020, loss min|avg|max: 0.693|0.693| 0.700, batch-p@3: 6.51%, ETA: 5:36:38 (0.84s/it)
iter:  1040, loss min|avg|max: 0.693|0.694| 0.699, batch-p@3: 2.86%, ETA: 5:22:05 (0.81s/it)
...
iter:  1640, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 2.60%, ETA: 5:09:58 (0.80s/it)
iter:  1660, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 2.08%, ETA: 5:48:27 (0.90s/it)
iter:  1680, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 4.43%, ETA: 5:23:23 (0.83s/it)
iter:  1700, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 6.51%, ETA: 5:25:04 (0.84s/it)
iter:  1720, loss min|avg|max: 0.693|0.693| 0.694, batch-p@3: 3.12%, ETA: 5:39:08 (0.87s/it)

What does this affectively mean? "Nonzero triplets never decreases" - not quite sure what that means?

I am using the vgg dataset with the file structure like this:

class_a/file.jpg
class_b/file.jpg
class_c/file.jpg
...

I set the pids, fids = [], [] like this:

classes = [path for path in os.listdir(DATA_DIR) if os.path.isdir(os.path.join(DATA_DIR, path))]
for c in classes:
    for file in glob.glob(DATA_DIR+c+"/*.jpg"):
        pids.append(c)
        fids.append(file)

where DATA_DIR is the directory of the vgg dataset.

The text was updated successfully, but these errors were encountered:

lucasb-eyer · 2018-04-23T14:53:14Z

Hi, I have seen the VGG-Face2 dataset and have always wanted to train on it, but never found the time, unfortunately. So far, I was almost always able to train on any dataset I have tried, but for a few especially difficult ones, it took considerable time to find good hyperparameters. (I think for the worst case, only a single value for hyperparameters converged!)

You might also want to try another optimizer or, for Adam, it can be necessary to also tune the "epsilon" value, as mentioned in the TensorFlow documentation.

An option to make training more robust, which has been done in many recent papers, is to add a softmax loss to the total loss, this usually helps overcome the "difficult phase" which is what you see.

Finally, this is also what happens when you make a mistake somewhere, for example mix-up PIDs by mistake, or pre-process images so as to make them corrupt, or whatever other mistakes can happen that kill the structure in the data.

maxisme · 2018-04-23T14:54:12Z

Yes the latter is what I am very worried about. Do you know if there any way to efficiently output an anchor, positive and negative at some point during training? I am new to TF. I have tried the other optimiser you mentioned in the paper no difference. I am judging that the training is working on the fact that if the min loss drops below 0.69 early on for efficiency sake is this a really poor idea?

Thank you very much for getting back to me!

currently trying facenets:

optimizer = tf.train.RMSPropOptimizer(learning_rate, decay=0.9, momentum=0.9, epsilon=1.0)

lucasb-eyer · 2018-04-23T14:54:42Z

For completeness, I'm repeating here what I wrote in an e-mail to you. For especially difficult datasets, you really need to try many hyperparameter values to find some that converge. For example, try learning rates 1e-1, 3e-2, 1e-2, 3e-3, ..., 1e-8, each of them for 2-5k updates, and stop early for those that don't go below maybe 0.65 or so. That should be feasible in one night or so.

About outputting things, have a look at "summaries" in tensorflow, you can output even images and look at them in tensorboard.

It doesn't make sense to try optimizer settings from other papers, every model/dataset combination has their own sweet spot, so if the paper you take one from doesn't use the same model/dataset as you, it's only luck if the optimizer setting will work or not. However, RMSProp is a good one to try, too.

maxisme · 2018-04-23T14:54:49Z

Okay after starting the sesh I have ran:

print(sess.run(fids))
print(sess.run(pids))

this returns:

['/path/to/data-set/n006101/0320_01.jpg'
 '/path/to/data-set/n006101/0131_05.jpg'
 '/path/to/data-set/n006101/0043_01.jpg'
 '/path/to/data-set/n006101/0114_01.jpg'
 '/path/to/data-set/n004359/0174_04.jpg'
 '/path/to/data-set/n004359/0058_01.jpg'
 '/path/to/data-set/n004359/0241_01.jpg'
 '/path/to/data-set/n004359/0175_01.jpg'
 '/path/to/data-set/n007003/0417_05.jpg'
 '/path/to/data-set/n007003/0209_01.jpg'
 '/path/to/data-set/n007003/0077_01.jpg'
 '/path/to/data-set/n007003/0057_01.jpg'
 '/path/to/data-set/n000457/0203_01.jpg'
 '/path/to/data-set/n000457/0159_01.jpg'
 '/path/to/data-set/n000457/0197_01.jpg'
 '/path/to/data-set/n000457/0161_01.jpg'
 '/path/to/data-set/n000549/0363_02.jpg'
 '/path/to/data-set/n000549/0141_01.jpg'
 '/path/to/data-set/n000549/0414_01.jpg'
 '/path/to/data-set/n000549/0328_01.jpg'
 '/path/to/data-set/n006797/0147_02.jpg'
 '/path/to/data-set/n006797/0035_02.jpg'
 '/path/to/data-set/n006797/0101_02.jpg'
 '/path/to/data-set/n006797/0145_01.jpg'
 '/path/to/data-set/n001789/0210_01.jpg'
 '/path/to/data-set/n001789/0012_01.jpg'
 '/path/to/data-set/n001789/0087_01.jpg'
 '/path/to/data-set/n001789/0159_01.jpg'
 '/path/to/data-set/n000473/0074_01.jpg'
 '/path/to/data-set/n000473/0039_02.jpg'
 '/path/to/data-set/n000473/0174_01.jpg'
 '/path/to/data-set/n000473/0211_01.jpg'
 '/path/to/data-set/n000489/0008_01.jpg'
 '/path/to/data-set/n000489/0131_02.jpg'
 '/path/to/data-set/n000489/0176_01.jpg'
 '/path/to/data-set/n000489/0221_01.jpg'
 '/path/to/data-set/n002198/0159_07.jpg'
 '/path/to/data-set/n002198/0033_02.jpg'
 '/path/to/data-set/n002198/0181_03.jpg'
 '/path/to/data-set/n002198/0126_01.jpg'
 '/path/to/data-set/n000777/0445_01.jpg'
 '/path/to/data-set/n000777/0126_01.jpg'
 '/path/to/data-set/n000777/0456_04.jpg'
 '/path/to/data-set/n000777/0392_01.jpg'
 '/path/to/data-set/n007482/0196_03.jpg'
 '/path/to/data-set/n007482/0013_01.jpg'
 '/path/to/data-set/n007482/0344_01.jpg'
 '/path/to/data-set/n007482/0064_01.jpg'
 '/path/to/data-set/n005586/0061_02.jpg'
 '/path/to/data-set/n005586/0100_01.jpg'
 '/path/to/data-set/n005586/0144_01.jpg'
 '/path/to/data-set/n005586/0382_01.jpg'
 '/path/to/data-set/n000944/0317_01.jpg'
 '/path/to/data-set/n000944/0144_01.jpg'
 '/path/to/data-set/n000944/0469_01.jpg'
 '/path/to/data-set/n000944/0030_01.jpg'
 '/path/to/data-set/n003644/0695_01.jpg'
 '/path/to/data-set/n003644/0104_02.jpg'
 '/path/to/data-set/n003644/0032_01.jpg'
 '/path/to/data-set/n003644/0131_01.jpg'
 '/path/to/data-set/n006191/0241_01.jpg'
 '/path/to/data-set/n006191/0186_03.jpg'
 '/path/to/data-set/n006191/0073_01.jpg'
 '/path/to/data-set/n006191/0157_05.jpg'
 '/path/to/data-set/n004641/0269_01.jpg'
 '/path/to/data-set/n004641/0030_01.jpg'
 '/path/to/data-set/n004641/0179_01.jpg'
 '/path/to/data-set/n004641/0132_01.jpg'
 '/path/to/data-set/n000881/0086_01.jpg'
 '/path/to/data-set/n000881/0351_03.jpg'
 '/path/to/data-set/n000881/0233_01.jpg'
 '/path/to/data-set/n000881/0130_01.jpg']
['n000203' 'n000203' 'n000203' 'n000203' 'n009265' 'n009265' 'n009265'
 'n009265' 'n006279' 'n006279' 'n006279' 'n006279' 'n005480' 'n005480'
 'n005480' 'n005480' 'n005396' 'n005396' 'n005396' 'n005396' 'n007609'
 'n007609' 'n007609' 'n007609' 'n002699' 'n002699' 'n002699' 'n002699'
 'n008955' 'n008955' 'n008955' 'n008955' 'n000885' 'n000885' 'n000885'
 'n000885' 'n007587' 'n007587' 'n007587' 'n007587' 'n008725' 'n008725'
 'n008725' 'n008725' 'n006369' 'n006369' 'n006369' 'n006369' 'n008052'
 'n008052' 'n008052' 'n008052' 'n000116' 'n000116' 'n000116' 'n000116'
 'n008270' 'n008270' 'n008270' 'n008270' 'n000668' 'n000668' 'n000668'
 'n000668' 'n006747' 'n006747' 'n006747' 'n006747' 'n002827' 'n002827'
 'n002827' 'n002827']

Am I correct in thinking things have gone terribly wrong hence the first two files are not class n000203 etc...?

lucasb-eyer · 2018-04-23T14:55:15Z

Yes, good find, it seems things have gone wrong in the dataset preparation somewhere!

PS: Thanks for going through moving this to a new issue!

Pandoro · 2018-04-23T14:55:55Z

You should probably run

print(sess.run([pids, fids]))

instead. The way you call it separately will show you the pids and fids from different batches.

maxisme · 2018-04-23T14:56:34Z

Awkward. Yeah works fine :( what is the problem!!

lucasb-eyer · 2018-04-23T14:56:59Z

@Pandoro is right, I rejoiced too early.

maxisme · 2018-04-23T14:58:08Z

Definitely a good port of call for anyone running into this issue in the future though. I am going to carry on going through the learning_rates but have a feeling this is not the solution. As the variance is so massive you are very likely to hit a number below 0.7 very early on yet this has never happened with ~10 different learning_rates .

maxisme · 2018-04-23T15:06:14Z

I just set the margin to 10.0 now the min loss is 10. Is that expected? So soft margin is 0.7? and setting margin = 'none' my min loss hits 0 and stays at 0.

Pandoro · 2018-04-23T15:11:20Z

@maxisme what is happening is that your embeddings are collapsing all to a single point. This is also what we discussed quite extensively in the supplementary material of our paper. When you use a margin based loss, if all embedding vectors collapse to the same point, your loss will always be equal to the margin. If you use the soft-margin loss, that is why you end up with ln(1 + exp(0)) ≈ 0.7 as the value it goes against.

Also make sure you don't confuse "learning rate" with "loss". Those are fundamentally different things.

Nevertheless, I can just second the things @lucasb-eyer suggested in order to fix the collapsing of the embeddings.

maxisme · 2018-04-23T15:14:18Z

@Pandoro thank you for that explanation. Does "collapsing all to a single point" mean all the values/embeddings/features outputted by the CNN are becoming the same no matter the input data (image) - ~~i.e the network weights are becoming the same?~~ Or have I misunderstood?

lucasb-eyer · 2018-04-23T15:17:01Z

All embeddings/output values are becoming the same, yes, but that doesn't mean that network weights are becoming the same (that isn't too meaningful of a statement, actually). Please check the paper's appendix.

In addition to what I recommended above (highly recommended to try thoroughly!) please also check if you are actually loading pre-trained weights, that's another common mistake people forget to do!

maxisme · 2018-04-23T15:23:14Z

I do https://github.com/VisualComputingInstitute/triplet-reid/blob/master/train.py#L286 but I have deleted L365 through L367 as I am not bother about checkpointing. Is that what you mean?

maxisme · 2018-04-23T15:27:58Z

OMG completely misunderstood checkpointing. Well half misunderstood. Downloading now....

lucasb-eyer · 2018-04-23T15:33:11Z

😆 you deleted exactly important lines, yes.

maxisme · 2018-04-23T15:42:56Z

iter:   120, loss min|avg|max: 0.607|1.615| 2.571, batch-p@3: 15.74%, ETA: 3:20:07 (0.48s/it)

Never been happier to see a float under 0.693 although all the others are still over haha.

lucasb-eyer · 2018-04-23T15:44:15Z

hahaha congrats! If it does converge, please close the issue, and if you are especially nice, report the scores you get here :)

maxisme · 2018-04-23T15:45:31Z

Haha. Definitely will do! I probably have the wrong learning rate but I will get there!

maxisme · 2018-04-23T20:40:57Z

Unfortunately I have been playing with values all day and the mean still effectively convolves to 0.693. This is the common thing that happens https://i.imgur.com/3dl8zli.png

maxisme · 2018-04-24T10:28:23Z

Okay I ran all the learning rates and epsilons:

nums = (1e-9, 3e-9, 1e-8, 3e-8, 1e-7, 3e-7, 1e-6, 3e-6, 1e-5, 3e-5, 1e-4, 3e-4)
for learning_rate in nums:
    for epsilon in nums:
        ...

for 1000 iterations each.

Never did the mean loss drop below the dreaded 0.693. I think it may just be the dataset I guess? 😞

lucasb-eyer · 2018-04-24T12:43:39Z

Given that you removed some lines, did you maybe make any other changes to the code?

Did you try smaller P?

For the ranges that you tried, epsilon might want to go larger, as you can see in the TF docs they mention that values of 1.0 or 0.1 are good ones for ImageNet, so the range you tried is not a good one. Instead, I'd try something like:

for lr in (1e0, 3e-1, 1e-1, 3e-2, 1e-2, 3e-3, 1e-3, 3e-4, 1e-4, 3e-5, 1e-5, 3e-6, 1e-6):
    for eps in (1e0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8):
        ...

And, I think I mentioned it already, but people have great success adding a softmax loss (on a separate embedding) to the triplet loss. I think it should be relatively straightforward, I might even have a look at it myself soon.

lucasb-eyer · 2018-04-24T12:45:43Z

Finally, I might implement the trick I mentioned here in the next days, it can be helpful for especially difficult datasets.

maxisme · 2018-04-24T12:49:49Z

Here is my full code I have cross referenced it with yours many times now and I don't think it has any major differences just removed args and logging etc...

#!/usr/bin/env python3
from importlib import import_module
import os, time, glob
import numpy as np
import tensorflow as tf

from train import triplet_loss as loss
from settings import config

################
## variables         ###
################
DATA_DIR = "/root/vggface/train_cropped/"

batch_p = 18
batch_k = 4
learning_rate = 1e-8
epsilon = 1e-9
train_iterations = 600
decay_start_iteration = 450
checkpoint_frequency = 1000
net_input_size = (config.feature_extractor_img_size, config.feature_extractor_img_size)
embedding_dim = 128
margin = 'soft'
metric='euclidean' #sqeuclidean
output_model = config.feature_model_dir + "tmp"
out_dir = output_model + "/save/"
log_every = 20
resume = False # set to true when wanting to extend team data

################
##    run               ###
################

# make out directory
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

"""
PIDs are the "person IDs", i.e. class names/labels.
FIDs are the "file IDs", which are individual relative filenames.
"""
pids, fids = [], []
classes = [path for path in os.listdir(DATA_DIR) if os.path.isdir(os.path.join(DATA_DIR, path))]
for c in classes:
    for file in glob.glob(DATA_DIR+c+"/*.jpg"):
        pids.append(c)
        fids.append(file)

# Setup a tf.Dataset where one "epoch" loops over all PIDS.
# PIDS are shuffled after every epoch and continue indefinitely.
unique_pids = np.unique(pids)
dataset = tf.data.Dataset.from_tensor_slices(unique_pids)
dataset = dataset.shuffle(len(unique_pids))

# Constrain the dataset size to a multiple of the batch-size, so that
# we don't get overlap at the end of each epoch.
dataset = dataset.take((len(unique_pids) // batch_p) * batch_p)
dataset = dataset.repeat(None)  # Repeat forever. Funny way of stating it.

# For every PID, get K images.
dataset = dataset.map(lambda pid: sample_k_fids_for_pid(
    pid, all_fids=fids, all_pids=pids, batch_k=batch_k))

# Ungroup/flatten the batches for easy loading of the files.
dataset = dataset.apply(tf.contrib.data.unbatch())

# Convert filenames to actual image tensors.
dataset = dataset.map(
    lambda fid, pid: fid_to_image(
        fid, pid,
        image_size=net_input_size),
    num_parallel_calls=8)

# Group it back into PK batches.
batch_size = batch_p * batch_k
dataset = dataset.batch(batch_size)

# Overlap producing and consuming for parallelism.
dataset = dataset.prefetch(1)

# Since we repeat the data infinitely, we only need a one-shot iterator.
images, fids, pids = dataset.make_one_shot_iterator().get_next()

# Create the model and an embedding head.
model = import_module('train.resnet_v1_101')
head = import_module('train.fc1024')

# Feed the image through the model. The returned `body_prefix` will be used
# further down to load the pre-trained weights for all variables with this
# prefix.
endpoints, body_prefix = model.endpoints(images, is_training=True)
with tf.name_scope('head'):
    endpoints = head.head(endpoints, embedding_dim, is_training=True)

# Create the loss in two steps:
# 1. Compute all pairwise distances according to the specified metric.
# 2. For each anchor along the first dimension, compute its loss.
dists = loss.cdist(endpoints['emb'], endpoints['emb'], metric=metric)
losses, train_top1, prec_at_k, _, neg_dists, pos_dists = loss.LOSS_CHOICES['batch_hard'](
    dists, pids, margin, batch_precision_at_k=batch_k-1)

# Count the number of active entries, and compute the total batch loss.
num_active = tf.reduce_sum(tf.cast(tf.greater(losses, 1e-5), tf.float32))
loss_mean = tf.reduce_mean(losses)

# These are collected here before we add the optimizer, because depending
# on the optimizer, it might add extra slots, which are also global
# variables, with the exact same prefix.
model_variables = tf.get_collection(
    tf.GraphKeys.GLOBAL_VARIABLES, body_prefix)

# Define the optimizer and the learning-rate schedule.
# Unfortunately, we get NaNs if we don't handle no-decay separately.
global_step = tf.Variable(0, name='global_step', trainable=False)
if 0 <= decay_start_iteration < train_iterations:
    learning_rate = tf.train.exponential_decay(
        learning_rate,
        tf.maximum(0, global_step - decay_start_iteration),
        train_iterations - decay_start_iteration, 0.001)
else:
    learning_rate = learning_rate
tf.summary.scalar('learning_rate', learning_rate)
# optimizer = tf.train.AdamOptimizer(learning_rate, epsilon=epsilon)
optimizer = tf.train.RMSPropOptimizer(learning_rate, epsilon=epsilon)

# Update_ops are used to update batchnorm stats.
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
    train_op = optimizer.minimize(loss_mean, global_step=global_step)

# Define a saver for the complete model.
checkpoint_saver = tf.train.Saver(max_to_keep=0)

cp = tf.ConfigProto()
cp.gpu_options.allow_growth = True
with tf.Session(config=cp) as sess:
    if resume:
        # In case we're resuming, simply load the full checkpoint to init.
        last_checkpoint = tf.train.latest_checkpoint(out_dir)
        checkpoint_saver.restore(sess, last_checkpoint)
    else:
        sess.run(tf.global_variables_initializer())

        saver = tf.train.Saver(model_variables)
        saver.restore(sess, config.resnet_init) # get data from

        checkpoint_saver.save(sess, os.path.join(
            out_dir, 'checkpoint'), global_step=0)

    merged_summary = tf.summary.merge_all()
    start_step = sess.run(global_step)

    # Finally, here comes the main-loop. This `Uninterrupt` is a handy
    # utility such that an iteration still finishes on Ctrl+C and we can
    # stop the training cleanly.
    for i in range(start_step, train_iterations):

        # Compute gradients, update weights, store logs!
        start_time = time.time()
        _, summary, step, b_prec_at_k, b_embs, b_loss, b_fids = \
            sess.run([train_op, merged_summary, global_step,
                      prec_at_k, endpoints['emb'], losses, fids])
        elapsed_time = time.time() - start_time

        if step % log_every == 0 or np.min(b_loss) < 0.693:
            print(str(step)+","+str(np.min(b_loss))+","+str(np.mean(b_loss))+","+str(np.max(b_loss)))

        if np.mean(b_loss) < 0.693:
            print("wahooooooooooooo")


        # Save a checkpoint of training every so often.
        if (checkpoint_frequency > 0 and step % checkpoint_frequency == 0):
            checkpoint_saver.save(sess, os.path.join(out_dir, 'checkpoint'), global_step=step)

    # Store one final checkpoint. This might be redundant, but it is crucial
    # in case intermediate storing was disabled and it saves a checkpoint
    # when the process was interrupted.
    checkpoint_saver.save(sess, os.path.join(out_dir, 'checkpoint'), global_step=step)

maxisme · 2018-04-24T12:52:00Z

That would be fantastic. Unfortunately I have until Thursday to implement this. ~~Can I help you at all?~~

Just read the comment - only two lines of code lol.

maxisme · 2018-04-24T12:54:42Z

I am sorry can you explain "softmax loss (on a separate embedding)" in a bit more detail? What is the separate embedding?

maxisme · 2018-04-24T13:22:41Z

Btw I have also tried different architectures too with no luck. I am currently trying with resnet_v1_101

maxisme · 2018-04-24T15:46:29Z

Large epsilon values don't ever seem to converge when testing with the AdamOptimizer.

lucasb-eyer · 2018-04-24T16:09:26Z

A few more points:

I'll try to get at least the trick I mention in the other issue in this evening.
re softmax: Well, attach a second embedding to the head and use a softmax loss on it. I'll also see if I manage to implement it this evening, but no promise.
You seem to ignore my suggestion of trying smaller P? It once helped me on a very hard dataset.
Given that you don't use our exact code, I cannot be sure that you don't have other (possibly silly/simple) mistakes like you had with loading pre-trained weights.
IIRC you mentioned in an e-mail that you are running a custom detector for cropping. I know it's not your goal, but have you tried to see if it converges with the official dataset's crops?
Large epsilon typically need to be coupled with larger learning-rates. Actually, only a loop such as I posted above can give definitive answer.
I now remember that I once had the surprising experience that only a specific scaling of the input image worked at all! I don't remember with which dataset, but it can make a lot of sense to play with the input image scale. For one, consistent aspect ratio is very helpful. For another, many people report better success with larger input images. What is your input size? Try some larger sizes and other scalings/crops.
You don't use cropping/flipping augmentation. Why? Especially for "more wild" datasets where the image is not perfectly aligned, crop-augmentation makes a lot of sense.

Again, I believe that I have never encountered a dataset where I couldn't find a single working setup, and I have trained on a lot of datasets so far.

lucasb-eyer · 2018-04-24T16:27:56Z

Re 7: How do you get to 160x160? Are all your detections square, too, or do you stretch them? I can imagine stretching destroys quite some information about a face. You could try rescaling the larger side to 160 and padding with grey, or rescaling the smaller side to 160 and cropping the middle. As I said, for me choosing the right one of these once surprisingly made the difference between convergence and stuck training. Also, try other sizes that are divisible by 32 (because that's what resnet does) such as 128x128, 192x192, 224x224.

maxisme · 2018-04-24T16:47:04Z

Absolutely no reason. Detections are not square. FaceNet performs stretching... As I initially thought the same as you. Will give that a try in replacement for your:

image_resized = tf.image.resize_images(image_decoded, image_size)

maxisme · 2018-04-24T18:03:42Z

I think I am going mad 😞 . The mean loss is now always returning values ~2 no matter the learning rate. What have I done...!! I need to take a break 😉

lucasb-eyer · 2018-04-24T19:01:34Z

Sorry, I likely won't get to it this evening anymore, I have too much to do! (I'm in the middle of a cross-country move actually.) Will do it tomorrow at work first thing.

maxisme · 2018-04-24T19:11:46Z

Ahh, that is okay! Oh jeez! Good luck! Thank you again!

maxisme · 2018-04-24T23:18:00Z

Can I take back what I said about part 8. A horizontal flip would maybe help but cropping definitely would not on already cropped facial data. That is the great thing about me pre processing all the data with a face localisation model as I can guarantee similar data if it passes through that stage. I was worried that it was a bit too hard core of a crop too. So I have also added a bit of padding to the output function usually misses out ears and fringes etc... (could be quite helpful for recognition)

lucasb-eyer · 2018-04-25T13:21:06Z

There is no such thing as a "guarantee" in ML, believe me 😄

See #33, especially give the other losses such as batch_all and batch_sample a try, they might solve your problem, albeit typically converge to slightly worse results than if you would get batch_hard to work.

maxisme · 2018-04-25T13:57:32Z

😀 sweet jesus tell me about it!! Thank you!! Currently testing:

loss_mean = tf.reduce_sum(losses) / (1e-33 + tf.count_nonzero(losses, dtype=tf.float32))

No luck so far.

maxisme · 2018-04-25T16:39:49Z

Woah, scary how much the size of the image affects your mAP in the paper. Was that using the stretching similar to

triplet-reid/common.py

Line 155 in 250eb17

image_resized = tf.image.resize_images(image_decoded, image_size)

?

maxisme · 2018-04-26T18:39:25Z

is the target loss for this:

loss_mean = tf.reduce_sum(losses) / (1e-33 + tf.count_nonzero(losses, dtype=tf.float32))

1?

I am understanding this as:

where x is 0. Is that correct? I also don't get why a loss of <= x is to be cared about surely it should be when loss = positive - negative that we care about x being 0 not when loss = apply_margin(positive - negative, margin)

Pandoro · 2018-04-26T18:45:57Z

By design the triplet loss can never be smaller than zero. Be it with a margin or with a soft margin formulation. Your formula is wrong, the denominator is not the sum over all losses that are bigger than zero, but the number of losses that are zero. The idea of this loss is to not "wash out" the actual loss by all the triplets that are in fact giving a loss of zero, but only taking the mean of "active" triplets over all the possible triplets in the batch.

maxisme · 2018-04-26T18:48:58Z

Does 'active' mean being selected as the a, p and ns? Do you mention this concept in your paper I cannot find it? Unfortunately I have now tried 'batch_hard' 'batch_sample' and 'batch_all' but none of them or converging 😞 .

lucasb-eyer · 2018-04-26T18:53:59Z

To be honest, if none of the three converge after learning-rate search, I think you have a bug somewhere or your dataset is broken. Neither I, nor anyone I know has encountered this situation with any dataset that I know of, and collectively we have tried a lot of datasets. The only time this happens is when there's a bug somewhere.

maxisme · 2018-04-26T18:55:45Z

Any other techniques other than the initially mentioned printing of the pids and fids I can check for a buggy dataset? I have visually checked about 10% and all the folders match the same individual.

lucasb-eyer · 2018-04-26T19:01:52Z

Oh and we implicitly define "active" as non-zero triplets in 3.4; it means triplets which violate the margin, i.e. where n is closer to a than p is, plus margin.

You could try to do classification and see if that works, as it's a completely different thing. But you may also have made other mistakes when extracting your code from ours. Maybe pre-processing, data fiddling, whatever that kills structure in the data. It's really hard to say, as if I could say, I would know what your mistake is 😄

lucasb-eyer · 2018-04-26T19:03:52Z

Alternatively, you can try your code/pipeline with another dataset, maybe a toy one.

maxisme · 2018-04-26T19:04:41Z

As in one of toys or a small easy dataset? haha!

maxisme · 2018-04-26T20:21:02Z

@Pandoro can I ask you to have a quick look at my code: #30 (comment) to see if anything jumps out as wrong?

Pandoro · 2018-04-26T20:41:42Z

I just glimpsed over your code and I think you copied the most relevant parts. Although I do wonder why you did that instead of just using our code.

The only thing that strikes me is the fact that the optimizer that is used in the version you posted is RMSProp, I don't think either @lucasb-eyer or me have ever used this optimizer during training and I have no idea what epsilon is used for in that optimizer, however, I do guess that it will need a completely different learning rate.

Nevertheless, I think you have ignored some key hints that @lucasb-eyer has given you. The image size of 160x160 surely isn't optimal, I think trying something like 256x256 or even 320x320 would be way more appropriate. Also you should definitely try a smaller K, in the "easiest" case try P=16, K=2. That should make the batch hard loss a lot more easy to optimize. Even though it will not converge to some great score in the end I guess.

Finally on an unrelated note, we pretty much always used the resnet 50 and didn't get real improvements with the resnet 101, apart from it taking more memory and longer to train.

maxisme · 2018-04-28T14:47:26Z

I have learn't two important things in the last week. Check your dataset about 5,000,000 times or write tests 😉 . Don't let iteration count impact your assumption at all. I have used the parameters and as a last effort let it run over night:

tech = 'batch_hard'# 'batch_hard' 'batch_sample' 'batch_all'
arch = 'resnet_v1_50'
batch_k = 2
batch_p = 18
learning_rate = 3e-5
epsilon = 1e-8
optimizer_name = 'Adam' # Adam MO RMS
train_iterations = 50000
decay_start_iteration = 20000
checkpoint_frequency = 1000
net_input_size = (256, 256)
embedding_dim = 128
margin = 'soft' # 'soft'
metric='euclidean' #sqeuclidean
output_model = config.feature_model_dir + "tmp"
out_dir = output_model + "/save/"
log_every = 5

by iteration 25,000 pretty much every mean loss is below 0.7 and averaging 0.312 average min loss is < 0.05 but slightly worryingly the max loss has not changed at all 1-4. Looking forward to implementation!

lucasb-eyer · 2018-04-29T08:25:04Z

Check your dataset about 5,000,000 times or write tests 😉

Does that mean you indeed still had an error in your data?

Would you mind sharing a screenshot of a training curve?
I forgot that one, but for much larger datasets it is indeed good to train much longer, and start decaying much later!

lucasb-eyer · 2018-04-29T08:26:23Z

And you might want to look at the samples causing such huge max-loss after long training, they will either be extremely hard cases (and thus interesting to look at) or errors/noise in the data.

maxisme · 2018-04-29T15:13:37Z

@lucasb-eyer I will definitely do that. No just thought it is worth a mention as that was the cause for a bug in the face localisation model I was training haha! Yes will send it over now just training with slightly different params:

margin: soft 
tech: batch_hard 
arch: resnet_v1_50 
batch_p: 18 
lr: 3e-05 
input: (256, 256) 
metric:euclidean 
epsilon: 1e-07 
optimizer: Adam

I tested over 50,000 iterations with the params:

margin: soft 
tech: batch_hard 
arch: resnet_v1_50 
batch_p: 18 
lr: 0.001 
input: (256, 256) 
metric:euclidean 
epsilon: 1e-08 
optimizer: Adam

and the final mean was the dreaded 0.69

maxisme · 2018-04-29T15:47:01Z

Okay here are the graphs over 30,000 iterations sorry for the lack of x axis excel is rubbish.

Mean Loss

Max Loss

Min Loss

CSV

With the actual data and iterations (just replace .txt with .csv):

loss.txt

lucasb-eyer · 2018-05-03T13:38:26Z

Thanks, I think that looks better right? I consider this to mean the original issue of "Unable to approach loss of less than 0.7 [...]" to be solved and so am closing the issue, but feel free to re-open if you disagree.

maxisme · 2018-09-23T11:17:58Z

And you might want to look at the samples causing such huge max-loss after long training, they will either be extremely hard cases (and thus interesting to look at) or errors/noise in the data.

Hey, Sorry to bring this back to life! How would I go about debugging which files were the 'extremely hard cases'?

lucasb-eyer · 2018-10-28T16:00:27Z

It's those which always violate the margin. Edit: or even better, those which are always selected as the hard examples in the batch-hard loss.

maxisme mentioned this issue Apr 23, 2018

Failed in training fine-grained categorization dataset CUB-200-2011 #26

Open

lucasb-eyer added the discussion (not bug) label Apr 24, 2018

lucasb-eyer closed this as completed May 3, 2018

glimperg mentioned this issue Mar 4, 2020

Embeddings collapse to one point using BatchHardMiner KevinMusgrave/pytorch-metric-learning#20

Closed

Unable to approach loss of less than 0.7 even when testing multiple learning rates. #30

Unable to approach loss of less than 0.7 even when testing multiple learning rates. #30

Comments

maxisme commented Apr 23, 2018

lucasb-eyer commented Apr 23, 2018

maxisme commented Apr 23, 2018

lucasb-eyer commented Apr 23, 2018

maxisme commented Apr 23, 2018

lucasb-eyer commented Apr 23, 2018

Pandoro commented Apr 23, 2018

maxisme commented Apr 23, 2018 • edited Loading

lucasb-eyer commented Apr 23, 2018

maxisme commented Apr 23, 2018 • edited Loading

maxisme commented Apr 23, 2018 • edited Loading

Pandoro commented Apr 23, 2018 • edited Loading

maxisme commented Apr 23, 2018 • edited Loading

lucasb-eyer commented Apr 23, 2018

maxisme commented Apr 23, 2018

maxisme commented Apr 23, 2018

lucasb-eyer commented Apr 23, 2018

maxisme commented Apr 23, 2018

lucasb-eyer commented Apr 23, 2018

maxisme commented Apr 23, 2018

maxisme commented Apr 23, 2018

maxisme commented Apr 24, 2018 • edited Loading

lucasb-eyer commented Apr 24, 2018

lucasb-eyer commented Apr 24, 2018

maxisme commented Apr 24, 2018

maxisme commented Apr 24, 2018 • edited Loading

maxisme commented Apr 24, 2018 • edited Loading

maxisme commented Apr 24, 2018 • edited Loading

maxisme commented Apr 24, 2018 • edited Loading

lucasb-eyer commented Apr 24, 2018

lucasb-eyer commented Apr 24, 2018

maxisme commented Apr 24, 2018

maxisme commented Apr 24, 2018

lucasb-eyer commented Apr 24, 2018

maxisme commented Apr 24, 2018

maxisme commented Apr 24, 2018

lucasb-eyer commented Apr 25, 2018

maxisme commented Apr 25, 2018 • edited Loading

maxisme commented Apr 25, 2018 • edited Loading

maxisme commented Apr 26, 2018

Pandoro commented Apr 26, 2018

maxisme commented Apr 26, 2018 • edited Loading

lucasb-eyer commented Apr 26, 2018

maxisme commented Apr 26, 2018

lucasb-eyer commented Apr 26, 2018

lucasb-eyer commented Apr 26, 2018

maxisme commented Apr 26, 2018

maxisme commented Apr 26, 2018

Pandoro commented Apr 26, 2018

maxisme commented Apr 28, 2018 • edited Loading

lucasb-eyer commented Apr 29, 2018 • edited Loading

lucasb-eyer commented Apr 29, 2018

maxisme commented Apr 29, 2018

maxisme commented Apr 29, 2018 • edited Loading

Mean Loss

Max Loss

Min Loss

CSV

lucasb-eyer commented May 3, 2018

maxisme commented Sep 23, 2018

lucasb-eyer commented Oct 28, 2018 • edited Loading

maxisme commented Apr 23, 2018 •

edited

Loading

maxisme commented Apr 23, 2018 •

edited

Loading

maxisme commented Apr 23, 2018 •

edited

Loading

Pandoro commented Apr 23, 2018 •

edited

Loading

maxisme commented Apr 23, 2018 •

edited

Loading

maxisme commented Apr 24, 2018 •

edited

Loading

maxisme commented Apr 24, 2018 •

edited

Loading

maxisme commented Apr 24, 2018 •

edited

Loading

maxisme commented Apr 24, 2018 •

edited

Loading

maxisme commented Apr 24, 2018 •

edited

Loading

maxisme commented Apr 25, 2018 •

edited

Loading

maxisme commented Apr 25, 2018 •

edited

Loading

maxisme commented Apr 26, 2018 •

edited

Loading

maxisme commented Apr 28, 2018 •

edited

Loading

lucasb-eyer commented Apr 29, 2018 •

edited

Loading

maxisme commented Apr 29, 2018 •

edited

Loading

lucasb-eyer commented Oct 28, 2018 •

edited

Loading