Data Item Similarity Report

Modified on Sat, 10 Jan at 3:53 PM

The Data Item Similarity report identifies items in your Knowledge Graph that are highly similar to each other based on how they are connected. This report is designed for data quality. It helps you detect:

  • Potential duplicates

  • Near-duplicates

  • Overlapping or redundant entities

  • Items that should be merged, linked, or clarified


Part of Knowledge Graph Health Reports


What This Report Shows

Each row represents a pair of items that the system has identified as being similar.

Similarity is determined by comparing:

  • The items’ relationships

  • Their position in the Knowledge Graph

  • The overlap in how they connect to other items

In short:

Two items are similar if they are connected to many of the same things.


How to Read the Report

Each row shows one item and its most similar counterpart.

Columns Explained

Similarity Score

A score between 0 and 1 indicating how similar the two items are.

  • Higher score = more overlap in relationships

  • Scores above ~0.3 typically indicate meaningful similarity

  • Scores closer to 1 suggest near-duplicates

This is a relative signal, not a definitive verdict.


Entity Types

The type of the primary item.

Examples:

  • Event

  • Person

  • Thing

  • ProfilePage


Similar Entity Types

The type of the matched item.

Matching types often indicate duplication.
Mismatched types can signal modeling or authoring issues.


Entity Name

The name of the primary item.


Similar Entity Name

The name of the most similar item.

This is often where issues become immediately obvious (e.g. two events with different dates but identical structure).


Entity IRI

The unique identifier of the primary item. 


Similar Entity IRI

The unique identifier of the similar item.

These confirm that the system is comparing distinct graph items, not aliases.


Actions

Contextual actions you can take, such as:

  • Reviewing the items side by side

  • Editing or consolidating entities

  • Correcting relationships or types


Why Similarity Matters for Data Quality

High similarity usually indicates one of the following:

1. Duplicate Entities

Two separate items represent the same real-world thing.

Example:

  • Two versions of the same event

  • Multiple entities for the same person


2. Fragmented Modeling

The same concept is split across multiple items, each partially connected.

This weakens:

  • Entity authority

  • Graph clarity

  • Downstream insights


3. Legitimate Variants (But Needs Clarity)

Some similar items are valid but require:

  • Clear differentiation

  • Stronger contextual relationships

  • More precise naming or typing


What You Should Do Next

For each similar pair, decide one of three actions:

1. Merge

If both items represent the same real-world thing:

  • Consolidate into a single item

  • Preserve the best relationships and properties

  • Remove or redirect the duplicate


2. Differentiate

If both items are valid but distinct:

  • Strengthen distinguishing relationships

  • Improve names or descriptions

  • Add clarifying properties (dates, roles, context)


3. Ignore (Intentionally)

Some similarity is expected (e.g. recurring events or series).
In these cases:

  • Confirm the similarity is intentional

  • No action may be required


Best Practices

  • Review this report after resolving orphan nodes

  • Focus first on high similarity scores

  • Prioritize items with:

    • The same type

    • Very similar names

    • High importance in the graph

  • Use this report regularly as part of graph hygiene


How This Fits with Other Health Reports

Together, these reports help you move from having a Knowledge Graph to maintaining a high-quality one.


Summary

The Data Item Similarity report helps you:

  • Detect duplicates and near-duplicates

  • Improve modeling precision

  • Strengthen Knowledge Graph integrity

If importance tells you what matters, similarity tells you where quality is at risk.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article