Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

METHODS AND SYSTEMS FOR INDEXING EMBEDDING VECTORS REPRESENTING DISJOINT CLASSES AT ABOVE-BILLION SCALE FOR FAST HIGH-RECALL RETRIEVAL

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Publication Date:
    January 2, 2025
  • Additional Information
    • Document Number:
      20250005896
    • Appl. No:
      18/214782
    • Application Filed:
      June 27, 2023
    • Abstract:
      This disclosure provides novel methods and systems for indexing embedding vectors representing disjoint classes at above-billion scale for fast high-recall retrieval. The disclosed methods and systems solved above-billion scale image (e.g., face image) search problems by building a new vector database, the new Nearest Neighbor Database (NNDB) based on a novel architecture for indexing vectors.
    • Assignees:
      Clearview AI, Inc. (New York, NY, US)
    • Claim:
      1. A method of indexing vectors of image data, comprising: a) distributing from a vector store a plurality of batches of vectors of the image data respectively to a first set of nodes, wherein each of the first set of nodes receives a batch of vectors; b) deduplicating vectors in the batch of the vectors in each of the first set of nodes by removing duplicated vectors to obtain a deduplicated vector batch, wherein the duplicated vectors have a similarity score with another vector in the batch of vectors that is equal to or greater than a similarity threshold; c) generating a vector index based on the deduplicated vector batch in each of the first set of nodes; d) performing steps a) to b) to obtain an additional batch-deduplicated vector batch, then deduplicating against the vector index, wherein the duplicated vectors have a similarity score with another vector in the vector index that is equal to or greater than the similarity threshold, and combining the index-deduplicated vector batch into the vector index in each of the set of nodes; e) repeating step d) until there is no remaining vector in the vector store; f) selecting one node from the first set of nodes, and replicating the vector index of the selected node in each of a second set of nodes, wherein the second set of nodes have fewer nodes than the first set of nodes; g) reconstructing vectors from the vector index of each of the first set of nodes excluding the selected node, combining the reconstructed vectors, and saving the combined reconstructed vectors to a new vector store; h) performing steps a), c), and e) to obtain a second vector index of each of the second set of nodes; and i) repeating steps f) to h) until only one node remains to obtain a final vector index of the image data.
    • Claim:
      2. The method of claim 1, wherein the vector index is generated based on a Hierarchical Navigable Small Worlds (HNSW) vector index data structure.
    • Claim:
      3. The method of claim 1, wherein the batch of vectors comprise vectors representing disjoint classes.
    • Claim:
      4. The method of claim 1, wherein the batch of vectors comprise k-dimensional feature vectors.
    • Claim:
      5. The method of claim 1, wherein the vector store comprises vectors at an above-billion scale.
    • Claim:
      6. The method of claim 1, wherein the second set of nodes has half the number of nodes of the first set of nodes.
    • Claim:
      7. The method of claim 1, wherein the similarity threshold is between about 0.2 and about 0.6.
    • Claim:
      8. The method of claim 7, wherein the similarity threshold is between about 0.35 and about 0.4.
    • Claim:
      9. The method of claim 1, wherein the image data comprises iris, face, fingerprint, hand/palm scan, or vein scan image data.
    • Claim:
      10. The method of claim 1, wherein prior to step a), the method comprises generating a k-dimensional feature vector for each of images in the image data.
    • Claim:
      11. The method of claim 10, wherein the k-dimensional feature vector is generated by a deep neural network (DNN).
    • Claim:
      12. The method of claim 1, wherein at step b), deduplicating vectors in the batch of the vectors comprises comparing each vector in the batch of vectors to each vector in the deduplicated vector batch, such that the deduplicated vector batch grows incrementally by adding the non-duplicated vectors from comparisons.
    • Claim:
      13. The method of claim 1, wherein at step d), deduplicating the additional batch-deduplicated vector batch against the vector index comprises approximating nearest-neighbor comparison using a Hierarchical Navigable Small Worlds (HNSW) vector index data structure.
    • Claim:
      14. A method of searching an image, comprising: receiving a query image; generating a query vector based on one or more features of the query image; generating, according to the method of claim 1, a vector index for vectors of image data in a database as an assigner index; generating an on-disk vector database, wherein bucket placements of the on-disk vector database are guided by the assigner index; and identifying candidate vectors through the assigner index and the vector database, wherein the query vector and the candidate vectors have similarity scores equal to or greater than a similarity threshold.
    • Claim:
      15. The method of claim 14, comprising linking a candidate image from the candidate vector.
    • Claim:
      16. A system for indexing vectors of image data, comprising: a non-transitory, computer-readable memory; one or more processors; and a computer-readable medium containing programming instructions that, when executed by the one or more processors, configure the system to: 1) distribute from a vector store a plurality of batches of vectors of the image data respectively to a first set of nodes, wherein each of the first set of nodes receives a batch of vectors; 2) deduplicate vectors in the batch of the vectors in each of the first set of nodes by removing duplicated vectors to obtain a deduplicated vector batch, wherein the duplicated vectors have a similarity score with another vector in the batch of vectors that is equal to or greater than a similarity threshold; 3) generate a vector index based on the deduplicated vector batch in each of the first set of nodes; 4) perform steps 1) to 2) to obtain an additional batch-deduplicated vector batch, then deduplicating against the vector index, wherein the duplicated vectors have a similarity score with another vector in the vector index that is equal to or greater than the similarity threshold, and combining the index-deduplicated vector batch into the vector index in each of the set of nodes; 5) repeat step 4) until there is no remaining vector in the vector store; 6) select one node from the first set of nodes, and replicate the vector index of the selected node in each of a second set of nodes, wherein the second set of nodes have fewer nodes than the first set of nodes; 7) reconstruct vectors from the vector index of each of the first set of nodes excluding the selected node, combining the reconstructed vectors, and saving the combined reconstructed vectors to a new vector store; 8) perform steps 1), 3), and 5) to obtain a second vector index of each of the second set of nodes; and 9) repeat steps 6) to 8) until only one node remains to obtain a final vector index of the image data.
    • Claim:
      17. The system of claim 16, wherein the vector index is generated based on a Hierarchical Navigable Small Worlds (HNSW) vector index data structure.
    • Claim:
      18. The system of claim 16, wherein the batch of vectors comprise vectors representing disjoint classes.
    • Claim:
      19. The system of claim 16, wherein the batch of vectors comprise k-dimensional feature vectors.
    • Claim:
      20. The system of claim 16, wherein the vector store comprises vectors at an above-billion scale.
    • Claim:
      21. The system of claim 16, wherein the second set of nodes has half the number of nodes of the first set of nodes.
    • Claim:
      22. The system of claim 16, wherein the similarity threshold is between about 0.2 and about 0.6.
    • Claim:
      23. The system of claim 22, wherein the similarity threshold is between about 0.35 and about 0.4.
    • Claim:
      24. The system of claim 16, wherein the image data comprises iris, face, fingerprint, hand/palm scan, or vein scan image data.
    • Claim:
      25. The system of claim 16, wherein the system is configured, prior to step 1), to generate a k-dimensional feature vector for each of images in the image data.
    • Claim:
      26. The system of claim 25, wherein the k-dimensional feature vector is generated by a deep neural network (DNN).
    • Claim:
      27. The system of claim 16, wherein at step 2) for deduplicating vectors in the batch of the vectors, the system is configured to compare each vector in the batch of vectors to each vector in the deduplicated vector batch, such that the deduplicated vector batch grows incrementally by adding the non-duplicated vectors from comparisons.
    • Claim:
      28. The system of claim 16, wherein at step 4) for deduplicating the additional batch-deduplicated vector batch against the vector index, the system is configured to approximate nearest-neighbor comparison using a Hierarchical Navigable Small Worlds (HNSW) vector index data structure.
    • Claim:
      29. A system of searching an image, comprising: a vector indexing engine comprising the system for indexing vectors of claim 16; a non-transitory, computer-readable memory; one or more processors; and a computer-readable medium containing programming instructions that, when executed by the one or more processors, configure the system to: receive a query image; generate a query vector based on one or more features of the query image; generate, by the vector indexing engine, a vector index for vectors of image data in a database as an assigner index; generate an on-disk vector database, wherein bucket placements of the on-disk vector database are guided by the assigner index; and identify candidate vectors through the assigner index and the vector database, wherein the query vector and the candidate vectors have similarity scores equal to or greater than a similarity threshold.
    • Claim:
      30. The system of searching an image of claim 29, configured to link a candidate image from the candidate vector.
    • Current International Class:
      06; 06
    • Accession Number:
      edspap.20250005896