Skip to content

feat(metrics): add new metric and semantic model entities#18134

Open
ani-malgari wants to merge 1 commit into
masterfrom
feature/cat-2540_implement-new-entity-models-for-metric-and-semantic-model
Open

feat(metrics): add new metric and semantic model entities#18134
ani-malgari wants to merge 1 commit into
masterfrom
feature/cat-2540_implement-new-entity-models-for-metric-and-semantic-model

Conversation

@ani-malgari

Copy link
Copy Markdown
Contributor

Summary

Introduces metric and semanticModel as first-class DataHub entities. This is the
schema-layer foundation for the upcoming Metrics Catalog feature (already scaffolded
in the frontend behind the metricsEnabled feature flag).

The model follows the "lean OSI-core" shape from the internal Metrics RFC: OSI-shaped
core fields plus a vendor-namespaced customExtensions bag for anything
platform-specific, rather than promoting every platform quirk to a first-class field.
Governance and lineage reuse existing native aspects (ownership, domains,
upstreamLineage, etc.).

Entities

metric — key (platform, id)urn:li:metric:(dbt,total_revenue).

Aspects:

  • metricInfo — name, description, semantic-model ref, multi-dialect SQL expression,
    AI context, recoverability, vendor extensions.
  • metricRelationshipsparentMetric (glossary-style tree pointer, powers the
    sidebar), derivedFrom (metric→metric lineage via Edge with isLineage: true),
    relatedMetrics (curated "see also" edges, non-lineage).
  • Native aspects: upstreamLineage, ownership, domains, globalTags,
    glossaryTerms, institutionalMemory, structuredProperties, status,
    deprecation, lifecycleStage, dataPlatformInstance, subTypes, forms,
    testResults, documentation, browsePaths, browsePathsV2, applications,
    container, incidentsSummary, displayProperties, assetSettings,
    versionProperties, access.

semanticModel — key (platform, id)urn:li:semanticModel:(dbt,sales_orders).

Aspects:

  • semanticModelInfo — name, description, sourcePlatform (URN, mirrors
    datasetKey.platform), AI context, datasets[] (with fields and dimensions),
    join relationships[], vendor extensions.
  • Same governance aspects as above (minus lineage/incident-oriented ones).

Both keys treat platform as required and immutable (part of the URN). If the source
platform is unknown at ingest time, the ingestion layer is expected to assign a stable
fallback (e.g. datahub for native / SDK-authored metrics).

Design notes

  • platform on the key, not dataPlatformInstance. Platform is a structural
    identity partition ("this is a Snowflake metric vs. a dbt metric — they are
    distinct entities per the MVP no-dedup decision"), whereas
    dataPlatformInstance is a decorative aspect for search faceting and policy
    scoping. Same distinction datasets already make.
  • Plain Urn fields, not typed URN typerefs. Cross-entity references
    (MetricInfo.semanticModel, MetricRelationships.parentMetric, etc.) use plain
    Urn with @Relationship.entityTypes doing the type gating. Matches the
    Domain / DataProduct pattern; can promote to typed URNs later without an
    aspect version bump.

Out of scope (follow-up PRs)

  • GraphQL types + resolvers, EntityType enum entries, EntityTypeMapper /
    UrnToEntityMapper wiring.
  • Frontend MetricEntity / SemanticModelEntity registered in
    buildEntityRegistryV2. Stub MetricsPage.tsx already tracks this.
  • Ingestion sources (dbt semantic layer, Snowflake semantic views, Databricks
    metric views, OSI upload).
  • SQL compiler / recoverability computation.
  • Policy / access-control specifics.

Checklist

  • PR conforms to the Contributing Guideline (PR Title Format)
  • No related public issue; internal RFC drives the design
  • Tests: N/A (schema-only; validated via codegen and registry build)
  • Docs added: metadata-models/docs/entities/metric.md and
    metadata-models/docs/entities/semanticModel.md
  • Not a breaking change (net-new entities; metricsEnabled flag defaults false)

@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 2, 2026
@ani-malgari ani-malgari requested a review from gabe-lyons July 2, 2026 00:23
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
23888 1 23887 142
View the top 2 failed test(s) by shortest run time
com.linkedin.metadata.graph.LineageGraphFiltersTest::testForEntityType
Stack Traces | 0.099s run time
java.lang.AssertionError: Sets differ: expected [mlModelGroup, dataProcess, dataJob, mlModel, mlPrimaryKey, dataProcessInstance, mlFeature, chart, dashboard, dataset] but got [mlModelGroup, metric, dataProcess, dataJob, mlModel, mlFeature, dataProcessInstance, dataset, chart, dashboard, mlPrimaryKey]
	at org.testng.Assert.fail(Assert.java:111)
	at org.testng.Assert.assertEquals(Assert.java:2037)
	at org.testng.Assert.assertEquals(Assert.java:1964)
	at com.linkedin.metadata.graph.LineageGraphFiltersTest.testForEntityType(LineageGraphFiltersTest.java:38)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:664)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:227)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:957)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:200)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:148)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at org.testng.TestRunner.privateRun(TestRunner.java:848)
	at org.testng.TestRunner.run(TestRunner.java:621)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:443)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:437)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:397)
	at org.testng.SuiteRunner.run(SuiteRunner.java:336)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:95)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1280)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1200)
	at org.testng.TestNG.runSuites(TestNG.java:1114)
	at org.testng.TestNG.run(TestNG.java:1082)
	at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.runTests(TestNGTestClassProcessor.java:153)
	at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:95)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:63)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:92)
	at jdk.proxy2/jdk.proxy2.$Proxy6.stop(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker$3.run(TestWorker.java:200)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.executeAndMaintainThreadName(TestWorker.java:132)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:103)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:63)
	at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:122)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:72)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
siblings-v2/siblings.spec.ts::siblings › will combine results in search
Stack Traces | 16.3s run time
expect(locator).toBeVisible() failed

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Comment on lines +11 to +19
* The platform or semantic layer that owns this metric (e.g. "dbt",
* "snowflake"). Used to group metrics by platform in search facets.
*
* REQUIRED. This field is part of the metric URN and therefore
* immutable once written. Callers must always populate it; if the
* source platform is unknown at ingest time the ingestion layer is
* responsible for assigning a stable fallback value (e.g. "datahub"
* for native / SDK-authored metrics)
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the part i need alignment for. @gabe-lyons @asikowitz @chriscollins3456 - do you see any issues with this? Having platform preserves stateless idempotent ingestion and easy sync.


Metrics are identified by two fields:

- **`namespace`** — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo if we are having people truly say dbt or snowflake we should at least support an optional platformurn. it feels wrong to put in platforms as unstructured text


- **`namespace`** — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`).
Searchable as a keyword so the sidebar can group metrics by platform.
- **`id`** — the metric name within that namespace (e.g. `total_revenue`, `daily_active_users`).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we handling nested metrics? are we letting metrics be nested? thinking about how we deal with glossary, where we support multi-layer nesting, it feels odd to not support that for metrics. is there a particular reason to not support multi layer nesting for metrics?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will be supporting nested metrics, it's handled in the MetricRelationships.pdl

| IsPartOf | outbound | `metric` | `metricRelationships` |
| DerivedFrom | outbound | `metric` | `metricRelationships` |
| RelatedTo | outbound | `metric` | `metricRelationships` |
| Upstream data | outbound | `dataset` | `upstreamLineage` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally this would be schemaField right, not datasets?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be datasets as well as SchemaField. Ex: upstreamLineage: { upstreams: [orders], fineGrained: [orders.amount] }

- **`namespace`** — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`).
- **`id`** — the model name within that namespace (e.g. `orders_model`, `customer_360`).

An example URN: `urn:li:semanticModel:(dbt,orders_model)`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we either should remove data platforms like dbt or snowflake from namespace examples or use platform urns with data platform instances. this middle ground is worst of both worlds.

/**
* Unique identifier for the metric within its platform.
*/
id: string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above

@Searchable = {
"fieldType": "KEYWORD"
}
platform: string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should align with datasets because semantic models should always be linked to a platform

@Searchable = {
"fieldType": "KEYWORD"
}
platform: string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do something like platform: Urn and have one option be logical, similar to what we did with logical datasets, but then we'd need another namespace string

* not expressible in the OSI core schema (Proposal B §B.5).
* The content field must be a valid JSON string.
*/
record CustomExtension {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use structured properties? this seems like an overly complex 1-off

@ani-malgari ani-malgari Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're using CustomExtension for 1:1 parity with OSI model, and ease of maintenance. Also CustomExtensions are embedded inside multiple nested fields within the SemanticModel (Datasets, relationships, Fields, metrics). We want to support bidirectional sync where Datahub is the source of truth, and maintaining the OSI structure will be easy to handle. Also for ingestion, most of the OSI-supported platforms have connectors which will help get to this structure.

Comment on lines +6 to +13
enum Dialect {
SNOWFLAKE
DATABRICKS
DBT
ANSI_SQL
DATAHUB
UNKNOWN
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need both platform and dialect, this seems redundant

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagine now we'll need to link dialects to icons, dialects to human readable names, etc, this is the same set as data platforms.

@ani-malgari ani-malgari Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dialect again is part of OSI schema and relates to the metric expression. A metric's expression can be written in any of the above formats, and this enum only relates to the expression (no linking to icons or platform metadata).

/**
* The dialect in which this expression is written.
*/
dialect: Dialect

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above this can just be platform

@Aspect = {
"name": "metricRelationships"
}
record MetricRelationships {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im concerned about us baking in the set of possible relationships. @asikowitz is doing some work to enable custom relationship types, this work is probably just a subset of that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i'm happy to hear more from @asikowitz. The purpose of this aspect is to support nested metrics (tree), derived metrics (lineage) and support edge metadata between metrics.

/**
* A named field (measure or dimension) defined inside a SemanticModel dataset.
*/
record Field {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i notice this doesn't seem to have a type? like numeric, string, time, etc?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will people want to add glossary terms, tags, etc to this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned we're rebuilding schemaField object here and will need to eventually keep adding all the stuff we have in that model.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout. The Field here is the OSI field which is a dimension or a fact: a scalar, row-level attribute defined by its expression over one or more physical columns.
Ex. a derived dimension DATE_TRUNC('month', orders.order_date) or first_name || ' ' || last_name.

we don't duplicate SchemaField: a field carries only the semantic enrichment (expression, dimension/is_time, label, ai_context) and not the physical attributes. Those stay on SchemaField.

* A logical dataset referenced by a SemanticModel, together with its schema
* and primary/unique key definitions.
*/
record ModelDataset {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoa why is this not just a datasetUrn?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModelDataset contains semantic enrichment (alias name, dimensions etc), and it's source of truth is the semantic model.


/**
* Join relationships between the datasets of this semantic model.
*/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we already model join relationships

/**
* Vendor-namespaced extension blobs for platform-specific metadata.
*/
customExtensions: optional array[CustomExtension]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again i think this should just be structured properties

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, this could also be the generic CustomProperties aspect that most entities "inherit" in the Info/Properties aspect

* A join relationship between two logical datasets defined within a SemanticModel.
* Named SemanticModelRelationship to avoid colliding with common relationship models.
*/
record SemanticModelRelationship {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pleaes confirm no other such model exists

@datahub-connector-tests

Copy link
Copy Markdown

Connector Tests Results

All connector tests passed for commit 616b651

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both MetricKey and SemanticModelKey, should env: FabricType be included in the key also? the same as we do for datasets

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about typical audit info such as createdAt or lastUpdatedAt?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about typical audit info such as createdAt or lastUpdatedAt?

@sgomezvillamor

Copy link
Copy Markdown
Contributor

How these new entities will be linked to the existing entities?

Eg, should Upstream be updated to include SemanticModel entities too?

/**
* The upstream dataset the lineage points to
*/
@Relationship = {
"name": "DownstreamOf",
"entityTypes": [ "dataset" ],
"isLineage": true,
"createdOn": "upstreams/*/created/time"
"createdActor": "upstreams/*/created/actor"
"updatedOn": "upstreams/*/auditStamp/time"
"updatedActor": "upstreams/*/auditStamp/actor"
"properties": "upstreams/*/properties"
"via": "upstreams/*/query"
}
@Searchable = {
"fieldName": "upstreams",
"hasValuesFieldName": "hasUpstreams",
"fieldType": "URN",
"queryByDefault": false
}
dataset: DatasetUrn
/**

@maggiehays maggiehays added the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants