January 26, 2023
Mining for Meaning, Part 2: Database Indexing
3 MIN READ

In the previous post, we discussed how to translate a variety of questions on semantic similarity into a single, tractable mathematical problem: which vector is closest to a query vector?

Figure: Assuming a 2-dimensional Euclidean space, we need to find the data point that is closest to the given query point.

One way to address the nearest-vector question is, of course, to compare the query vector with each data vector. For a 2-dimensional Euclidean space this would translate to applying the distance formula for each resulting pair of vectors; for most modern embeddings, the cosine similarity measure is preferred as a more accurate representation of semantic proximity (in fact, cosine similarity is equivalent to the Euclidean distance between normalized vectors). Regardless of the similarity measure used, this method is slow and doesn’t scale well at all — we need to iterate over each point in the dataset, calculate the distance/similarity between the query point and this data point, and then identify the closest data point. Effectively, this approach scales linearly with the size of the dataset. Can we do better?

The key to solving this problem is realizing that we can build a structure ahead of time, well before the query comes in, that sits with the database and allows us to answer the query quickly. Such a structure is called an index. Indexes are by no means specific to semantic search — they have been used for decades to speed up database queries by orders of magnitude. If you have a simple database table that maps your friends’ names to their addresses and phone numbers, you may build an index on the name column to allow queries (e.g. “What is Fred’s address?”) to be answered quickly. An index is a sorted data structure, and in the same way that sorted lists can be scanned in logarithmic time via binary search, database indexes allow the database engine to perform a log-time search rather than a linear scan through all records. Most database indexes are B-trees — hierarchical, rapidly-navigable data structures that allow the database engine to quickly locate the row corresponding to a query.

While traditional B-tree indexes are well suited to exact-match queries (“find the address corresponding to the name Fred”), they do not address similarity or closeness. But semantic similarity search, by definition, requires similar matches rather than exact ones. So, how can we build an index on a billion-vector dataset to efficiently answer the ‘find the closest vector’ query? Or, more generally, how can we find the k-closest vectors efficiently?

In the final part of this post, we’ll talk about some approaches to building the right index for semantic similarity search.

At Quilt.AI, we use machine learning models to analyze semantic relationships between text, images, and ideas. Reach out to us at [email protected] for more information.

south_east

synthesizing vast data into actionable insights that reflect each market's unique cultural and economic backdrop

south_east

grasping the distinct consumer perspectives that these diverse regions offer

Curated digital profiles:

-Instagram, Twitter, and TikTok (US)

-Weibo and Douyin (China)

Pulled 400 million unique searches to estimate the growth of each segment

Used Quilt.AI’s Sphere language and image capabilities to categorise lifestyle areas into specific segments

Glamour Seekers

These consumers are confident, bold, and comfortable with modern masculinity. They also often turn to social media to express their personal style and interests.

Actionable Insight: Collaborate with high-profile fashion influencers to create vibrant, trend-setting campaigns that resonate with this segment's desire for attention and admiration.

Vanity Vanguards

Highly image-driven, these individuals often seek validation through their appearance and are likely to engage heavily with both grooming and fashion products.

Actionable Insight:Leverage digital marketing strategies that feature before-and-after visuals and testimonials that showcase the transformative power of the products

Conscious Icons

These men aim to be recognized as modern, open-minded, and sensitive – embodying the image of "the woke good guy" in today's society by actively participating in movements related to activism and gender equality.

Actionable Insight:Design marketing campaigns that highlight their participation in these movements, showcasing products that enable them to express and amplify their desired social identities.

Youthful Trendsetters

They value beauty while still maintaining traditional masculine ideals of what it means to be good-looking. These men also tend to seek out methods of maintaining their youthful appearances.

Actionable Insight:Market products that boost physical appeal and suit active lifestyles, and focus on dynamic marketing that highlights masculine elegance.

Trusted Patrons

Despite seeing gender in traditionally binary terms, these men aren’t afraid of behaving in more feminine manners. They own their uniqueness and tend to be deeply loyal to brands that affirm their identity.

Actionable Insight:Focusing on brand narratives that celebrate individuality and personal expression will better engage this segment. Brands can also offer personalized services to maintain their commitment.

Innovation Advocates

Despite seeing gender in traditionally binary terms, these men aren’t afraid of behaving in more feminine manners. They own their uniqueness and tend to be deeply loyal to brands that affirm their identity.

Actionable Insight:Focusing on brand narratives that celebrate individuality and personal expression will better engage this segment. Brands can also offer personalized services to maintain their commitment.

Visuals illustrated are to bring concepts to life only.
Visuals illustrated are to bring concepts to life only.
Visuals illustrated are to bring concepts to life only.
Request a Sphere demo
Transform the way you understand online data
Try now
arrow_upward