Linked Entity Recognition (LER) Explainer

Modified on Thu, 05 Jan 2023 at 08:28 AM

Linked Entity Recognition (LER) is one method Schema App uses for their Entity Linking service. This document describes the Linked Entity Recognition feature offered by Schema App, what it is, what it does, and how Schema App implements it. It provides content considerations and best practices, and notes current limitations and future enhancements for this feature.


TABLE OF CONTENTS


What is Linked Entity Recognition?

Linked Entity Recognition (LER) is the automated process of identifying named entities in text and then linking them to external identifiers from authoritative knowledge bases (like Wikipedia and the Google Knowledge Graph). These identifiers are automatically embedded within your schema markup.


What does Linked Entity Recognition do?

Once embedded in your markup, these entities provide additional semantic value to your metadata. They help Google (and other web crawlers) better understand what your content is about by pulling in known entities. This reduces ambiguity in the interpretation of your content and supports more accurate matching to user queries.


How does Linked Entity Recognition work?

LER can be applied to Schema App’s Highlighter templates. We find linked entities using an API. If an entity is recognized, the API will often (but not always!) provide you with entities from the following sources, connected to your content using the sameAs property:

How is Linked Entity Recognition implemented?

First, you’ll consult with your Customer Success Manager (CSM) to find a pageset that follows the Best Practices for Content Structure. The best options are usually areas of a webpage which have longer-form descriptions, or categories that mention entities (e.g. people, places, brands or organizations).


Your CSM will then test some entities across the pageset to see if running them through the API yields any matches. If they do, the LER highlight will be added to the Highlighter template for your chosen pageset. The highlight will be appended to a Schema.org type for the chosen entity.


For example, if a series of blog posts are about different brands of products, the LER highlight would be added to the Product’s Brand schema. Then, the existing Brand schema would be enhanced with a sameAs property, linking it to Wikipedia, and/or Google Knowledge Graph entities. 




General Content Considerations

1. Use standardized names

When possible, use terms that can be found on authorities like Wikipedia or Wikidata. Provide additional content that includes language familiar to users. This way your content is optimized for both entity SEO and content SEO.

2. Capitalization Is Important!

Proper nouns are differentiated from common nouns with the same name by capitalization. Ensure that proper nouns are consistently capitalized to facilitate matches with Linked Entity Recognition.

Example: Apple, the “American multinational technology company” and apple, the “fruit of the apple tree”.

3. Only include entities that are relevant to your content (E-A-T)

While the inclusion of entities in your schema can improve site performance, we caution against overuse of terms that are not directly related to page contents. Entities should either be:

  1. The primary entity on the page

  2. Linked to the primary entity with an accurate property

4. Consider surrounding content

What other entities are relevant to your primary entity? Take “Visa” for example. When surrounded by other keywords like “country” “passport” and “citizenship”, it is identified as a travel visa, whereas keywords such as “debit”, “transaction”, or “payment”  make it clear that the entity being mentioned is related to the card payment organization Visa Inc..

This approach isn’t that different from keyword clusters in content SEO. The only difference is that it takes NLP APIs into account alongside human users searching for content. 


Best Practices for Content Structure

In its current iteration, LER is most successful when it follows these best practices. At the moment, these are largely dictated by the constraints listed in the “Current Limitations” section (below).


1. Differentiation

Ensure entities are differentiated in the HTML. The exact character string can be matched by:

  • 1. Nesting the entity in a particular HTML element (e.g. <strong>, <a>) which CSMs can target with an XPath

  • 2. Keeping the entity in the same location and surrounded by the same/similar character strings (e.g. Preferred rates for University of Guelph alumni; Preferred rates for University of Toronto alumni). Rules can be applied in the XPath to omit extraneous characters provided they are the same across a pageset.

2. Strict Type Checking

LER looks for entities with a predetermined type in order to provide exact matches with 100% confidence. This means LER works best on content where the same type of entity is found in the same location across a pageset.

  • E.g. Ecommerce site with branded products follows the formula “[BRAND][PRODUCT NAME]”: 

    <body>
      <h1>DEWALT Universal Mitre Saw Stand</h1>
      … 
    </body>


  • E.g. Insurance agency with products that target specific locations “[LOCATION] [INSURANCE TYPE]”

    <body>
      <h2>Illinois Car Insurance</h1>
      … 
    </body>


LER V1 Limitations

1. Limitations of API

The API used for Linked Entity Recognition will occasionally match to entities that are incorrect (for example, matching to Hamilton the person, rather than Hamilton the place). In cases like this, it’s best to omit the impacted URL from the Highlighter pageset to avoid deploying inaccurate markup. This is a limitation of the API itself, not Schema App. 

2. Content structure

Patterns to find entities in content (XPaths) mean LER is limited to content that is similarly structured. As a result, content structure may need to be modified to accommodate LER implementation.

3. Reporting

Current reporting allows us to survey entire page sets where an LER highlight has been implemented on a Highlighter template. However, not all URLs will receive matches through LER. Pages must be reviewed on an individual basis to confirm whether or not entities have been applied to the markup (an enhancement is currently planned to resolve this).


LER V2 Enhancement

An enhancement is currently planned to improve the following aspects of LER:

  1. Find multiple entities (not just one per highlight)

  2. Increase flexibility of entity location across content

  3. Provide flexibility of type (no longer predetermined, but does require the use of a property with the range of Thing to permit any category)

  4. Provide additional information about an entity, such as:

    1. Name - Name of the Thing

    2. Type -  Category of the Thing

    3. URI – “Dereferenceable” in that we can reference it and know exactly which entity we’re referencing (more accurate than a “name” which can be shared across many entities and from many sources)

    4. Salience - “The salience score for an entity provides information about the importance or centrality of that entity to the entire document text. 

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article