Deduplication of pages by intent
improved
Deduplication is a key feature for Similar.ai. We list out the clusters of pages which answer the same user need and for each cluster of dupes, choose the best page. When we integrate with our clients, they typically add a 301 redirect between a duplicate and its new canonical. In this way we get more 'juice from the squeeze': more traffic from less pages. Our goal is to have one page for each search intent. A search intent can be expressed by 100s or 1,000s of keywords (check out our demo of categorising keywords into a search intent).
In this feature we made two updates:
  • We grouped pages with the same intent instead of listing which pages matched which keyword,
  • We used our new machine learning classifiers to match pages and keywords to intents, and expressed these intents as entities in our knowledge graph.
There are also two big advantages:
  • We can find pages with completely different names, pages which miss out superfluous words, pages which use synonyms and pages with misspellings
  • Since a page often targets many keywords, it could belong to many keywords, but you can only redirect to one canonical page. This can no longer happen.
For instance,
  • volkswagen mk1 golf
    for automotive in the UK:
image
  • chesterfield zetels
    for homeware in Belgium
image
  • bmw x5 7-seater
    for automotive in the UK:
image
  • dames t-shirts
    in clothing in the Netherlands:
image
  • gucci riemen
    for clothing in the Netherlands
image