Bag of visual words giving same images as output

Hi everyone,

I am trying to implement bag of visual words(BoVW) for place recognition. I am using this dataset.
Here are the steps that I am following:

  1. Extract ORB feature/descriptors from images in the dataset
  2. Use descriptors from step 1 and create K=1000 clusters using k-means. This will act as my vocabulary
  3. For each image, create histogram based on the occurrence of descriptors of vocabulary
  4. Now for finding similar place images for a query image, use cosine distance. Lower the cosine distance, better is the similarity.

To test this, I use one of the image from the dataset as a query image. However, regardless of the query image I use, I always get the same images. I suspect there might be an issue with my implementation. Please find the code below:

#include <opencv2/opencv.hpp>
#include <iostream>
#include <opencv2/features2d.hpp>

bool readimage(std::string image_name, cv::Mat& img, bool show_image) {
    img = cv::imread(image_name, cv::IMREAD_COLOR);
    if(img.empty())
    {
        std::cerr << "Could not read the image: " << image_name << std::endl;
        return false;
    }

    if (show_image) {
        cv::imshow("Display_window", img);
        cv::waitKey(0);
    }
    return true;
}

void extract_orb_feature_descriptors(cv::Mat img, std::vector<cv::KeyPoint>& kp, cv::Mat& desc, int nfeatures=500)  {
    cv::Mat gray_img;
    if (img.channels() > 1) {
        cv::cvtColor(img, gray_img, cv::COLOR_BGR2GRAY);
    } else {
        gray_img = img;
    }

    cv::Ptr<cv::ORB> orb;
    nfeatures = 1000;
    orb = cv::ORB::create(nfeatures); // use other default parameters
    // Detect keypoints and compute descriptors
    orb->detectAndCompute(gray_img, cv::noArray(), kp, desc);
    // Debug: Check if descriptors are extracted
    if (desc.empty()) {
        std::cerr << "No descriptors extracted for the image." << std::endl;
    }
}
/*
void extract_surf_feature_descriptors(cv::Mat img, std::vector<cv::KeyPoint>& kp, cv::Mat& desc)  {
    cv::Mat gray_img;
    if (img.channels() > 1) {
        cv::cvtColor(img, gray_img, cv::COLOR_BGR2GRAY);
    } else {
        gray_img = img;
    }
    int minHessian = 400;
    cv::Ptr<cv::xfeatures2d::SURF> surf;
    surf = cv::xfeatures2d::SURF::create(minHessian); // use other default parameters
    // Detect keypoints and compute descriptors
    surf->detectAndCompute(gray_img, cv::noArray(), kp, desc);
}
*/


cv::Mat kMeansClustering(const std::vector<cv::Mat>& Descs, int numClusters) {
    // Flatten the descriptors into a single matrix for clustering
    cv::Mat allDesc;
    for (auto desc : Descs) {
        allDesc.push_back(desc);  
    }
    allDesc.convertTo(allDesc, CV_32F);

    cv::Mat labels, centers;
    cv::kmeans(allDesc, numClusters, labels, cv::TermCriteria(cv::TermCriteria::EPS + cv::TermCriteria::MAX_ITER, 100, 0.2), 3,  cv::KMEANS_PP_CENTERS, centers);
    return centers;
}

// Compute BoW Histogram (Term Frequency)
cv::Mat computeBoWHistogram(const cv::Mat& descriptors, const cv::Mat& vocabulary)  {
    cv::Mat histogram = cv::Mat::zeros(1, vocabulary.rows, CV_32F);
    cv::Mat descriptors32F;
    descriptors.convertTo(descriptors32F, CV_32F); // Ensure it's in the correct type

    for (int i = 0; i < descriptors32F.rows; i++) {
        float minDist = FLT_MAX;
        int closestCentroid = -1;

        // Find the closest centroid (visual word)
        for (int j = 0; j < vocabulary.rows; j++) {
            float dist = cv::norm(descriptors32F.row(i), vocabulary.row(j), cv::NORM_L2);  // Euclidean distance
            if (dist < minDist) {
                minDist = dist;
                closestCentroid = j;
            }
        }
        // Increment the histogram count for the closest centroid
        histogram.at<float>(0, closestCentroid)++;
    }
    // Normalize histogram
    cv::normalize(histogram, histogram, 1, 0, cv::NORM_L1);
    
    return histogram;
}

// not using
// Compute TF-IDF for a collection of descriptors (documents)
cv::Mat computeTFIDF(const std::vector<cv::Mat>& desc_vec, const cv::Mat& vocabulary) {
    int numDocuments = desc_vec.size();
    int numTerms = vocabulary.rows;

    // Document Frequency (DF) and Inverse Document Frequency (IDF)
    cv::Mat df = cv::Mat::zeros(1, numTerms, CV_32F);
    cv::Mat idf = cv::Mat::zeros(1, numTerms, CV_32F);

    // Initialize a matrix to store the TF-IDF for each document
    cv::Mat tfidf(numDocuments, numTerms, CV_32F, cv::Scalar(0));

    // Calculate document frequency (DF) for each term
    for (int i = 0; i < numDocuments; i++) {
        cv::Mat histogram = computeBoWHistogram(desc_vec[i], vocabulary);

        // Count terms that appear in each document
        for (int termIdx = 0; termIdx < numTerms; termIdx++) {
            if (histogram.at<float>(0, termIdx) > 0) {
                df.at<float>(0, termIdx) += 1;  // Increment DF for this term
            }
        }
    }

    // Calculate the IDF for each term
    for (int i = 0; i < numTerms; i++) {
        // Avoid division by zero by adding 1 to DF
        float idfValue = log((float)numDocuments / (1 + df.at<float>(0, i)));
        idf.at<float>(0, i) = idfValue;
    }

    // Calculate TF-IDF for each document
    for (int i = 0; i < numDocuments; i++) {
        cv::Mat histogram = computeBoWHistogram(desc_vec[i], vocabulary);

        // Compute TF-IDF for each term in the document
        for (int termIdx = 0; termIdx < numTerms; termIdx++) {
            float tf = histogram.at<float>(0, termIdx) / (float)desc_vec[i].rows;  // TF = term frequency
            float tfidfValue = tf * idf.at<float>(0, termIdx);  // TF-IDF = TF * IDF
            tfidf.at<float>(i, termIdx) = tfidfValue;
        }
    }
    return tfidf;
}

// Compute Cosine Distance between two histograms
float computeCosineDistance(const cv::Mat& hist1, const cv::Mat& hist2) {
    // Normalize histograms
    cv::Mat normHist1, normHist2;
    cv::normalize(hist1, normHist1, 1, 0, cv::NORM_L1);
    cv::normalize(hist2, normHist2, 1, 0, cv::NORM_L1);
    
    // Dot product
    float dotProduct = normHist1.dot(normHist2);

    // Avoid division by zero in magnitudes
    float normHist1Magnitude = cv::norm(normHist1, cv::NORM_L2);
    float normHist2Magnitude = cv::norm(normHist2, cv::NORM_L2);
    if (normHist1Magnitude == 0 || normHist2Magnitude == 0) {
        std::cerr << "Zero magnitude in one of the histograms. Returning maximum distance." << std::endl;
        return 1.0f; // Maximum cosine distance
    }

    // Cosine similarity
    float cosineSim = dotProduct / (normHist1Magnitude * normHist2Magnitude);

    // Cosine distance (1 - Cosine similarity)
    return 1.0f - cosineSim;
}

std::vector<int> give_NIndices_of_similar_images(const cv::Mat& histograms, cv::Mat queryhist, int nIndices ) {
    std::vector<int> indices;
    std::vector<std::pair<int, float>> distances;
    float dist=0.0f;
    for (size_t i = 0; i < histograms.rows; i++) {
        dist = computeCosineDistance(queryhist, histograms.row(i));
        std::cout << dist << std::endl;
        distances.push_back(std::make_pair(i, dist));
    }
    // Sort by distance (ascending order)
    std::sort(distances.begin(), distances.end(), [](const std::pair<int, float>& a, const std::pair<int, float>& b) {
        return a.second < b.second;
    });


    for (int i = 0; i < nIndices; i++) {
        int imgIdx = distances[i].first;
        std::cout << "Image index: " << imgIdx << " with distance: " << distances[i].second << std::endl;
        indices.push_back(imgIdx);
    }
    return indices;
}


int main(int argc, char** argv) {
    std::string pattern_name = "../images-freiburg-x10/data1/imageCompressedCam0_*0.png";
    std::vector<std::string> images;
    cv::glob(pattern_name, images, false);
    cv::Mat img_, desc1;
    std::vector<cv::KeyPoint> kp;
    std::vector<cv::Mat> desc_vec;
    int nCluster = 1000;
    for (auto img_n : images) {
        std::cout << "Processing image: " << img_n << std::endl;
        bool success = readimage(img_n, img_, false);
        if (success) {
            extract_orb_feature_descriptors(img_, kp, desc1);
            // std::cout << kp.size() << " " << desc1.size() << std::endl;
            desc_vec.push_back(desc1);
        }
    }
    cv::Mat vocabulary = kMeansClustering(desc_vec, nCluster);
    //std::cout << vocabulary.size() << std::endl;
    // Compute the BoW histogram for each image
    cv::Mat hist, hist_all;
    for (size_t i = 0; i < images.size(); i++) {
        hist = computeBoWHistogram(desc_vec[i], vocabulary);
        //std::cout << hist.size() << std::endl;
        //std::cout << hist << std::endl;
        
        hist_all.push_back(hist);
    }
    std::cout << hist_all.size() << std::endl;

    //cv::Mat tfidf = computeTFIDF(desc_vec, vocabulary);
    //std::cout << tfidf.size() << std::endl;

    // query image
    std::string query_img_name = "../images-freiburg-x10/data1/imageCompressedCam0_0005740.png";
    cv::Mat query_img;
    readimage(query_img_name, query_img, false);
    cv::Mat query_desc;
    std::vector<cv::KeyPoint> query_kp;
    extract_orb_feature_descriptors(img_, query_kp, query_desc); 
    cv::Mat query_hist = computeBoWHistogram(query_desc, vocabulary);
    auto indices = give_NIndices_of_similar_images(hist_all, query_hist, 5);
    std::vector<cv::Mat> result_images;
    result_images.push_back(query_img);
    for (int idx : indices) {
        result_images.push_back(cv::imread(images[idx]));
    }
    cv::Mat combined;
    cv::hconcat(result_images, combined);
    cv::imshow("Query and Top Matches", combined);
    cv::waitKey(0);
    return 0;
}

Could someone help me identify where I might be going wrong?

whatever dataset you work on, imo you cannot use ORB features (rather use SIFT).
opencv’s BoW implementation only has a KMeans clusterer for training, and there is no meaningful L2 distance (used there) for ORB, a binary feature.
to make it work with ORB, you’d need a KMajority(or such) clusterer , something that works properly with HAMMING distance.

clearly, don’t steal from idiot posts on SO !!

the clusters/histograms need weighting. look up “tf-idf”.

that is, unless there’s magic in opencv that claims to do this for you.

don’t just go by what someone wrote in the paper, go by their proof of concept implementation

1 Like

I appreciate your support. I am now using SIFT.

Also, I have corrected the mistake pointed by you.

cv::Mat allDesc32F;
allDesc.convertTo(allDesc32F, CV_32F); // for kmean

what ? not at all.
using SIFT will already return float32 features, thus it does not need further conversion.

just remove it, it’s a leftover of the ill-fated idea of converting binary features to something that works with L2/kmeans

again, 32*8 bits in an ORB descriptor do NOT form ‘numbers’, they’re just ‘bitstrings’

1 Like

Thank you for your suggestion. TF-IDF = TF * IDF = n_id/n_d * log(N/N_i).
I have implemented TF-IDF as two functions: one that returns IDF and the other that calculates TF-IDF. I created a histogram containing the TF-IDF values of all images in the dataset.
Now, for the query image, I use the same IDF values from the dataset but calculate the TF based on the query image. This approach seems to work, but is it correct? Or I have to update IDF as well?