Bag of visual words giving same images as output

pkj0382 · December 18, 2024, 4:19pm

Hi everyone,

I am trying to implement bag of visual words(BoVW) for place recognition. I am using this dataset.
Here are the steps that I am following:

Extract ORB feature/descriptors from images in the dataset
Use descriptors from step 1 and create K=1000 clusters using k-means. This will act as my vocabulary
For each image, create histogram based on the occurrence of descriptors of vocabulary
Now for finding similar place images for a query image, use cosine distance. Lower the cosine distance, better is the similarity.

To test this, I use one of the image from the dataset as a query image. However, regardless of the query image I use, I always get the same images. I suspect there might be an issue with my implementation. Please find the code below:

#include <opencv2/opencv.hpp>
#include <iostream>
#include <opencv2/features2d.hpp>

bool readimage(std::string image_name, cv::Mat& img, bool show_image) {
    img = cv::imread(image_name, cv::IMREAD_COLOR);
    if(img.empty())
    {
        std::cerr << "Could not read the image: " << image_name << std::endl;
        return false;
    }

    if (show_image) {
        cv::imshow("Display_window", img);
        cv::waitKey(0);
    }
    return true;
}

void extract_orb_feature_descriptors(cv::Mat img, std::vector<cv::KeyPoint>& kp, cv::Mat& desc, int nfeatures=500)  {
    cv::Mat gray_img;
    if (img.channels() > 1) {
        cv::cvtColor(img, gray_img, cv::COLOR_BGR2GRAY);
    } else {
        gray_img = img;
    }

    cv::Ptr<cv::ORB> orb;
    nfeatures = 1000;
    orb = cv::ORB::create(nfeatures); // use other default parameters
    // Detect keypoints and compute descriptors
    orb->detectAndCompute(gray_img, cv::noArray(), kp, desc);
    // Debug: Check if descriptors are extracted
    if (desc.empty()) {
        std::cerr << "No descriptors extracted for the image." << std::endl;
    }
}
/*
void extract_surf_feature_descriptors(cv::Mat img, std::vector<cv::KeyPoint>& kp, cv::Mat& desc)  {
    cv::Mat gray_img;
    if (img.channels() > 1) {
        cv::cvtColor(img, gray_img, cv::COLOR_BGR2GRAY);
    } else {
        gray_img = img;
    }
    int minHessian = 400;
    cv::Ptr<cv::xfeatures2d::SURF> surf;
    surf = cv::xfeatures2d::SURF::create(minHessian); // use other default parameters
    // Detect keypoints and compute descriptors
    surf->detectAndCompute(gray_img, cv::noArray(), kp, desc);
}
*/


cv::Mat kMeansClustering(const std::vector<cv::Mat>& Descs, int numClusters) {
    // Flatten the descriptors into a single matrix for clustering
    cv::Mat allDesc;
    for (auto desc : Descs) {
        allDesc.push_back(desc);  
    }
    allDesc.convertTo(allDesc, CV_32F);

    cv::Mat labels, centers;
    cv::kmeans(allDesc, numClusters, labels, cv::TermCriteria(cv::TermCriteria::EPS + cv::TermCriteria::MAX_ITER, 100, 0.2), 3,  cv::KMEANS_PP_CENTERS, centers);
    return centers;
}

// Compute BoW Histogram (Term Frequency)
cv::Mat computeBoWHistogram(const cv::Mat& descriptors, const cv::Mat& vocabulary)  {
    cv::Mat histogram = cv::Mat::zeros(1, vocabulary.rows, CV_32F);
    cv::Mat descriptors32F;
    descriptors.convertTo(descriptors32F, CV_32F); // Ensure it's in the correct type

    for (int i = 0; i < descriptors32F.rows; i++) {
        float minDist = FLT_MAX;
        int closestCentroid = -1;

        // Find the closest centroid (visual word)
        for (int j = 0; j < vocabulary.rows; j++) {
            float dist = cv::norm(descriptors32F.row(i), vocabulary.row(j), cv::NORM_L2);  // Euclidean distance
            if (dist < minDist) {
                minDist = dist;
                closestCentroid = j;
            }
        }
        // Increment the histogram count for the closest centroid
        histogram.at<float>(0, closestCentroid)++;
    }
    // Normalize histogram
    cv::normalize(histogram, histogram, 1, 0, cv::NORM_L1);
    
    return histogram;
}

// not using
// Compute TF-IDF for a collection of descriptors (documents)
cv::Mat computeTFIDF(const std::vector<cv::Mat>& desc_vec, const cv::Mat& vocabulary) {
    int numDocuments = desc_vec.size();
    int numTerms = vocabulary.rows;

    // Document Frequency (DF) and Inverse Document Frequency (IDF)
    cv::Mat df = cv::Mat::zeros(1, numTerms, CV_32F);
    cv::Mat idf = cv::Mat::zeros(1, numTerms, CV_32F);

    // Initialize a matrix to store the TF-IDF for each document
    cv::Mat tfidf(numDocuments, numTerms, CV_32F, cv::Scalar(0));

    // Calculate document frequency (DF) for each term
    for (int i = 0; i < numDocuments; i++) {
        cv::Mat histogram = computeBoWHistogram(desc_vec[i], vocabulary);

        // Count terms that appear in each document
        for (int termIdx = 0; termIdx < numTerms; termIdx++) {
            if (histogram.at<float>(0, termIdx) > 0) {
                df.at<float>(0, termIdx) += 1;  // Increment DF for this term
            }
        }
    }

    // Calculate the IDF for each term
    for (int i = 0; i < numTerms; i++) {
        // Avoid division by zero by adding 1 to DF
        float idfValue = log((float)numDocuments / (1 + df.at<float>(0, i)));
        idf.at<float>(0, i) = idfValue;
    }

    // Calculate TF-IDF for each document
    for (int i = 0; i < numDocuments; i++) {
        cv::Mat histogram = computeBoWHistogram(desc_vec[i], vocabulary);

        // Compute TF-IDF for each term in the document
        for (int termIdx = 0; termIdx < numTerms; termIdx++) {
            float tf = histogram.at<float>(0, termIdx) / (float)desc_vec[i].rows;  // TF = term frequency
            float tfidfValue = tf * idf.at<float>(0, termIdx);  // TF-IDF = TF * IDF
            tfidf.at<float>(i, termIdx) = tfidfValue;
        }
    }
    return tfidf;
}

// Compute Cosine Distance between two histograms
float computeCosineDistance(const cv::Mat& hist1, const cv::Mat& hist2) {
    // Normalize histograms
    cv::Mat normHist1, normHist2;
    cv::normalize(hist1, normHist1, 1, 0, cv::NORM_L1);
    cv::normalize(hist2, normHist2, 1, 0, cv::NORM_L1);
    
    // Dot product
    float dotProduct = normHist1.dot(normHist2);

    // Avoid division by zero in magnitudes
    float normHist1Magnitude = cv::norm(normHist1, cv::NORM_L2);
    float normHist2Magnitude = cv::norm(normHist2, cv::NORM_L2);
    if (normHist1Magnitude == 0 || normHist2Magnitude == 0) {
        std::cerr << "Zero magnitude in one of the histograms. Returning maximum distance." << std::endl;
        return 1.0f; // Maximum cosine distance
    }

    // Cosine similarity
    float cosineSim = dotProduct / (normHist1Magnitude * normHist2Magnitude);

    // Cosine distance (1 - Cosine similarity)
    return 1.0f - cosineSim;
}

std::vector<int> give_NIndices_of_similar_images(const cv::Mat& histograms, cv::Mat queryhist, int nIndices ) {
    std::vector<int> indices;
    std::vector<std::pair<int, float>> distances;
    float dist=0.0f;
    for (size_t i = 0; i < histograms.rows; i++) {
        dist = computeCosineDistance(queryhist, histograms.row(i));
        std::cout << dist << std::endl;
        distances.push_back(std::make_pair(i, dist));
    }
    // Sort by distance (ascending order)
    std::sort(distances.begin(), distances.end(), [](const std::pair<int, float>& a, const std::pair<int, float>& b) {
        return a.second < b.second;
    });


    for (int i = 0; i < nIndices; i++) {
        int imgIdx = distances[i].first;
        std::cout << "Image index: " << imgIdx << " with distance: " << distances[i].second << std::endl;
        indices.push_back(imgIdx);
    }
    return indices;
}


int main(int argc, char** argv) {
    std::string pattern_name = "../images-freiburg-x10/data1/imageCompressedCam0_*0.png";
    std::vector<std::string> images;
    cv::glob(pattern_name, images, false);
    cv::Mat img_, desc1;
    std::vector<cv::KeyPoint> kp;
    std::vector<cv::Mat> desc_vec;
    int nCluster = 1000;
    for (auto img_n : images) {
        std::cout << "Processing image: " << img_n << std::endl;
        bool success = readimage(img_n, img_, false);
        if (success) {
            extract_orb_feature_descriptors(img_, kp, desc1);
            // std::cout << kp.size() << " " << desc1.size() << std::endl;
            desc_vec.push_back(desc1);
        }
    }
    cv::Mat vocabulary = kMeansClustering(desc_vec, nCluster);
    //std::cout << vocabulary.size() << std::endl;
    // Compute the BoW histogram for each image
    cv::Mat hist, hist_all;
    for (size_t i = 0; i < images.size(); i++) {
        hist = computeBoWHistogram(desc_vec[i], vocabulary);
        //std::cout << hist.size() << std::endl;
        //std::cout << hist << std::endl;
        
        hist_all.push_back(hist);
    }
    std::cout << hist_all.size() << std::endl;

    //cv::Mat tfidf = computeTFIDF(desc_vec, vocabulary);
    //std::cout << tfidf.size() << std::endl;

    // query image
    std::string query_img_name = "../images-freiburg-x10/data1/imageCompressedCam0_0005740.png";
    cv::Mat query_img;
    readimage(query_img_name, query_img, false);
    cv::Mat query_desc;
    std::vector<cv::KeyPoint> query_kp;
    extract_orb_feature_descriptors(img_, query_kp, query_desc); 
    cv::Mat query_hist = computeBoWHistogram(query_desc, vocabulary);
    auto indices = give_NIndices_of_similar_images(hist_all, query_hist, 5);
    std::vector<cv::Mat> result_images;
    result_images.push_back(query_img);
    for (int idx : indices) {
        result_images.push_back(cv::imread(images[idx]));
    }
    cv::Mat combined;
    cv::hconcat(result_images, combined);
    cv::imshow("Query and Top Matches", combined);
    cv::waitKey(0);
    return 0;
}

Could someone help me identify where I might be going wrong?

berak · December 18, 2024, 4:31pm

whatever dataset you work on, imo you cannot use ORB features (rather use SIFT).
opencv’s BoW implementation only has a KMeans clusterer for training, and there is no meaningful L2 distance (used there) for ORB, a binary feature.
to make it work with ORB, you’d need a KMajority(or such) clusterer , something that works properly with HAMMING distance.

clearly, don’t steal from idiot posts on SO !!

crackwitz · December 18, 2024, 5:09pm

the clusters/histograms need weighting. look up “tf-idf”.

that is, unless there’s magic in opencv that claims to do this for you.

don’t just go by what someone wrote in the paper, go by their proof of concept implementation

pkj0382 · December 19, 2024, 4:25pm

I appreciate your support. I am now using SIFT.

Also, I have corrected the mistake pointed by you.

cv::Mat allDesc32F;
allDesc.convertTo(allDesc32F, CV_32F); // for kmean

berak · December 19, 2024, 4:37pm

what ? not at all.
using SIFT will already return float32 features, thus it does not need further conversion.

just remove it, it’s a leftover of the ill-fated idea of converting binary features to something that works with L2/kmeans

again, 32*8 bits in an ORB descriptor do NOT form ‘numbers’, they’re just ‘bitstrings’

pkj0382 · December 19, 2024, 6:46pm

Thank you for your suggestion. TF-IDF = TF * IDF = n_id/n_d * log(N/N_i).
I have implemented TF-IDF as two functions: one that returns IDF and the other that calculates TF-IDF. I created a histogram containing the TF-IDF values of all images in the dataset.
Now, for the query image, I use the same IDF values from the dataset but calculate the TF based on the query image. This approach seems to work, but is it correct? Or I have to update IDF as well?

Topic		Replies	Views
Attempt at an artwork classifier using a bag of visual words model keeps predicting the same class Python features2d , ml	2	497	June 1, 2021
Image Similarity features2d	3	1197	October 22, 2021
Why are the ORB features of frames read in a loop different from those extracted individually? Python	0	26	December 8, 2024
Detect / re-recognize objects by shape objdetect	2	459	August 16, 2021
Better detecting feature and/or improving matches between images features2d	17	4689	May 23, 2022

Bag of visual words giving same images as output

Related topics