Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active Learning Yields Poor Results in Multi-Label Task #191

Open
shadikhamsehh opened this issue Sep 10, 2024 · 0 comments
Open

Active Learning Yields Poor Results in Multi-Label Task #191

shadikhamsehh opened this issue Sep 10, 2024 · 0 comments

Comments

@shadikhamsehh
Copy link

I am using modAL for an active learning project in multi-label classification. My implementation is in PyTorch, and I use DinoV2 as the backbone model.
For the same dataset, I apply both active learning (using minimum confidence and average confidence strategies) and random sampling. I select the same number of samples in both strategies, but the results from random sampling are significantly better than those from the active learning approach. I would like to know if this discrepancy might be due to an issue with my code or the modAL library's handling of multi-label classification. Below is my active learning loop:

for i in range(n_queries):
    if i == 12:
        n_instances = X_pool.shape[0]
    else:
        n_instances = batch(int(np.ceil(np.power(10, POWER))), BATCH_SIZE)

    print(f"\nQuery {i + 1}: Requesting {n_instances} samples from a pool of size {X_pool.shape[0]}")

    if X_pool.shape[0] < n_instances:
        print("Not enough samples left in the pool to query the desired number of instances.")
        break

    query_idx, _ = learner.query(X_pool, n_instances=n_instances)
    query_idx = np.unique(query_idx)

    if len(query_idx) == 0:
        print("No indices were selected, which may indicate an issue with the query function or pool.")
        continue

    # Add the newly selected samples to the cumulative training set
    cumulative_X_train.append(X_pool[query_idx])
    cumulative_y_train.append(y_pool[query_idx])

    # Concatenate all the samples to form the cumulative training data
    X_train_cumulative = np.concatenate(cumulative_X_train, axis=0)
    y_train_cumulative = np.concatenate(cumulative_y_train, axis=0)

    learner.teach(X_train_cumulative, y_train_cumulative)

    # Log the selected sample names
    selected_sample_names = train_df.loc[query_idx, "image"].tolist()
    print(f"Selected samples in Query {i + 1}: {selected_sample_names}")
    with open(samples_log_file, mode='a', newline='') as f:
        writer = csv.writer(f)
        writer.writerow([i + 1] + selected_sample_names)

    # Remove the selected samples from the pool
    X_pool = np.delete(X_pool, query_idx, axis=0)
    y_pool = np.delete(y_pool, query_idx, axis=0)

    # Evaluate the model
    y_pred = learner.predict(X_test_np)
    accuracy = accuracy_score(y_test_np, y_pred)
    f1 = f1_score(y_test_np, y_pred, average='macro')
    acc_test_data.append(accuracy)
    f1_test_data.append(f1)
    print(f"Accuracy after query {i + 1}: {accuracy}")
    print(f"F1 Score after query {i + 1}: {f1}")

    # Early stopping logic
    if f1 > best_f1_score:
        best_f1_score = f1
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            print(f"Stopping early after {i + 1} queries due to no improvement in F1 score.")
            break

    total_samples += len(query_idx)
    print(f"Total samples used for training after query {i + 1}: {total_samples}")
    POWER += 0.25
    torch.cuda.empty_cache()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant