January 25, 2023
The Who: Filtering Personal Data
4 MIN READ

The Internet is the world’s largest repository of user-generated data. Webpages, social media, forums, reviews, blog posts, and search data, when analyzed at scale, can reveal profound insights into consumer preferences and behavior. We at Quilt.AI specialize in interpreting the Internet to lead organizations toward better business decisions.

The Problem

How do we make sure that the information we gather is anonymized and does not infringe on the privacy of the individual? Although all of the information we collect is publicly available, the open Web holds a large quantity of personally-identifiable information (PII) including names, phone numbers, and email addresses.

We are extra-careful about personal privacy for both ethical and compliance reasons. Of course, we do not want nor need PII in order to extract insights for a given demographic — we only need aggregate data. We are not in the business of the individual; we are in the business of the cohort. So, how do we discard PII seamlessly while retaining important information?

The Solution

To address the issue of data anonymization, we needed to build a PII filter. When an item of content is pulled from the Internet, the PII filter should discard any sensitive data that might exist in that content item.

An engineering team may choose to manually process each content item to perform

  1. a keyword lookup using a table containing all possible values for specific PII categories (e.g., a table containing all possible names), and
  2. a pattern search to identify phone numbers and email addresses.

This solution is obviously not very robust — a previously-unseen name would not be recognized by such a filter. Furthermore, this solution is not context-aware — a sentence like “My name is Apple” should indicate that ‘Apple’ is a PII item and, therefore to be discarded, but a simple lookup would not achieve this.

With recent advances in machine learning (ML) and specifically in natural language processing, it is possible to filter PII in a more contextual way. For our use case, we found the right tool in Presidio, an open-source Python library from Microsoft that offers pre-trained models for identifying and removing PII from the text.

Using Presidio

In order to use Presidio, we need two packages: presidio-analyzer and presidio-anonymizer. The former is responsible for the heavy processing and outputs a format that is used by the latter to anonymize and replace information within a sentence with the appropriate tag. Both packages can be installed with the usual pip commands:

After both packages are installed, we need to download a model. Presidio can use either Spacy (default) or Stanza. When looking into the different models available from these repositories, we decided to stay with the default Spacy, and to use the default English model, which can be downloaded with:

After the model is downloaded, we need to run it and specify the entities we want to detect as well as the language our input text is in. All entities and languages supported by each model can be checked on their respective repo websites. For our test case, we’ll be using PERSON and EMAIL_ADDRESS entities and the English language.

Let’s instantiate the model, then pass a sentence to it, and see the results:

Output:

As we can see, the analyzer outputs a list containing all identified PII entities, including their location within the sentence.

After this, we instantiate our Presidio anonymizer and feed the results of the analyzer to it:

The final result is the text with masked PII entities.


Concluding Thoughts

Once we have our anonymized text, we can proceed with our analytics (sentiment, semiotics etc.) with no risk of infringing on individual privacy.

An open question remains around the “lossiness” of the PII filter. Since sentences such as “I love Luke but hate Anakin” would be transformed to “I love <PERSON> but hate <PERSON>”, do we actually dilute our insights when using the PII filter? While the intuitive answer would be in the affirmative, it is interesting to note that for large real-world Internet datasets we did not find a large qualitative difference in the quality of insights obtained. This is likely attributable to the nature of our datasets — we choose data about brands, places, and experiences and not data about people. Nevertheless, an intelligent masking system that distinguishes between PERSON1 and PERSON2 might be useful to explore.

At Quilt.AI, we use machine learning to extract cultural meaning from publicly-available, anonymized Internet data. Reach out to us at [email protected] for more information!

south_east

synthesizing vast data into actionable insights that reflect each market's unique cultural and economic backdrop

south_east

grasping the distinct consumer perspectives that these diverse regions offer

Curated digital profiles:

-Instagram, Twitter, and TikTok (US)

-Weibo and Douyin (China)

Pulled 400 million unique searches to estimate the growth of each segment

Used Quilt.AI’s Sphere language and image capabilities to categorise lifestyle areas into specific segments

Glamour Seekers

These consumers are confident, bold, and comfortable with modern masculinity. They also often turn to social media to express their personal style and interests.

Actionable Insight: Collaborate with high-profile fashion influencers to create vibrant, trend-setting campaigns that resonate with this segment's desire for attention and admiration.

Vanity Vanguards

Highly image-driven, these individuals often seek validation through their appearance and are likely to engage heavily with both grooming and fashion products.

Actionable Insight:Leverage digital marketing strategies that feature before-and-after visuals and testimonials that showcase the transformative power of the products

Conscious Icons

These men aim to be recognized as modern, open-minded, and sensitive – embodying the image of "the woke good guy" in today's society by actively participating in movements related to activism and gender equality.

Actionable Insight:Design marketing campaigns that highlight their participation in these movements, showcasing products that enable them to express and amplify their desired social identities.

Youthful Trendsetters

They value beauty while still maintaining traditional masculine ideals of what it means to be good-looking. These men also tend to seek out methods of maintaining their youthful appearances.

Actionable Insight:Market products that boost physical appeal and suit active lifestyles, and focus on dynamic marketing that highlights masculine elegance.

Trusted Patrons

Despite seeing gender in traditionally binary terms, these men aren’t afraid of behaving in more feminine manners. They own their uniqueness and tend to be deeply loyal to brands that affirm their identity.

Actionable Insight:Focusing on brand narratives that celebrate individuality and personal expression will better engage this segment. Brands can also offer personalized services to maintain their commitment.

Innovation Advocates

Despite seeing gender in traditionally binary terms, these men aren’t afraid of behaving in more feminine manners. They own their uniqueness and tend to be deeply loyal to brands that affirm their identity.

Actionable Insight:Focusing on brand narratives that celebrate individuality and personal expression will better engage this segment. Brands can also offer personalized services to maintain their commitment.

Visuals illustrated are to bring concepts to life only.
Visuals illustrated are to bring concepts to life only.
Visuals illustrated are to bring concepts to life only.
Request a Sphere demo
Transform the way you understand online data
Try now
arrow_upward