feat(metrics): add new metric and semantic model entities by ani-malgari · Pull Request #18134 · datahub-project/datahub

ani-malgari · 2026-07-02T00:21:02Z

Summary

Introduces metric and semanticModel as first-class DataHub entities. This is the
schema-layer foundation for the upcoming Metrics Catalog feature (already scaffolded
in the frontend behind the metricsEnabled feature flag).

The model follows the "lean OSI-core" shape from the internal Metrics RFC: OSI-shaped
core fields plus a vendor-namespaced customExtensions bag for anything
platform-specific, rather than promoting every platform quirk to a first-class field.
Governance and lineage reuse existing native aspects (ownership, domains,
upstreamLineage, etc.).

Entities

metric — key (platform, id) → urn:li:metric:(dbt,total_revenue).

Aspects:

metricInfo — name, description, semantic-model ref, multi-dialect SQL expression,
AI context, recoverability, vendor extensions.
metricRelationships — parentMetric (glossary-style tree pointer, powers the
sidebar), derivedFrom (metric→metric lineage via Edge with isLineage: true),
relatedMetrics (curated "see also" edges, non-lineage).
Native aspects: upstreamLineage, ownership, domains, globalTags,
glossaryTerms, institutionalMemory, structuredProperties, status,
deprecation, lifecycleStage, dataPlatformInstance, subTypes, forms,
testResults, documentation, browsePaths, browsePathsV2, applications,
container, incidentsSummary, displayProperties, assetSettings,
versionProperties, access.

semanticModel — key (platform, id) → urn:li:semanticModel:(dbt,sales_orders).

Aspects:

semanticModelInfo — name, description, sourcePlatform (URN, mirrors
datasetKey.platform), AI context, datasets[] (with fields and dimensions),
join relationships[], vendor extensions.
Same governance aspects as above (minus lineage/incident-oriented ones).

Both keys treat platform as required and immutable (part of the URN). If the source
platform is unknown at ingest time, the ingestion layer is expected to assign a stable
fallback (e.g. datahub for native / SDK-authored metrics).

Design notes

platform on the key, not dataPlatformInstance. Platform is a structural
identity partition ("this is a Snowflake metric vs. a dbt metric — they are
distinct entities per the MVP no-dedup decision"), whereas
dataPlatformInstance is a decorative aspect for search faceting and policy
scoping. Same distinction datasets already make.
Plain Urn fields, not typed URN typerefs. Cross-entity references
(MetricInfo.semanticModel, MetricRelationships.parentMetric, etc.) use plain
Urn with @Relationship.entityTypes doing the type gating. Matches the
Domain / DataProduct pattern; can promote to typed URNs later without an
aspect version bump.

Out of scope (follow-up PRs)

GraphQL types + resolvers, EntityType enum entries, EntityTypeMapper /
UrnToEntityMapper wiring.
Frontend MetricEntity / SemanticModelEntity registered in
buildEntityRegistryV2. Stub MetricsPage.tsx already tracks this.
Ingestion sources (dbt semantic layer, Snowflake semantic views, Databricks
metric views, OSI upload).
SQL compiler / recoverability computation.
Policy / access-control specifics.

Checklist

PR conforms to the Contributing Guideline (PR Title Format)
No related public issue; internal RFC drives the design
Tests: N/A (schema-only; validated via codegen and registry build)
Docs added: metadata-models/docs/entities/metric.md and
metadata-models/docs/entities/semanticModel.md
Not a breaking change (net-new entities; metricsEnabled flag defaults false)

codecov · 2026-07-02T00:23:52Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
23888	1	23887	142

View the top 2 failed test(s) by shortest run time

com.linkedin.metadata.graph.LineageGraphFiltersTest::testForEntityType

Stack Traces | 0.099s run time

java.lang.AssertionError: Sets differ: expected [mlModelGroup, dataProcess, dataJob, mlModel, mlPrimaryKey, dataProcessInstance, mlFeature, chart, dashboard, dataset] but got [mlModelGroup, metric, dataProcess, dataJob, mlModel, mlFeature, dataProcessInstance, dataset, chart, dashboard, mlPrimaryKey]
	at org.testng.Assert.fail(Assert.java:111)
	at org.testng.Assert.assertEquals(Assert.java:2037)
	at org.testng.Assert.assertEquals(Assert.java:1964)
	at com.linkedin.metadata.graph.LineageGraphFiltersTest.testForEntityType(LineageGraphFiltersTest.java:38)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:664)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:227)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:957)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:200)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:148)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at org.testng.TestRunner.privateRun(TestRunner.java:848)
	at org.testng.TestRunner.run(TestRunner.java:621)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:443)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:437)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:397)
	at org.testng.SuiteRunner.run(SuiteRunner.java:336)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:95)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1280)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1200)
	at org.testng.TestNG.runSuites(TestNG.java:1114)
	at org.testng.TestNG.run(TestNG.java:1082)
	at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.runTests(TestNGTestClassProcessor.java:153)
	at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:95)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:63)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:92)
	at jdk.proxy2/jdk.proxy2.$Proxy6.stop(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker$3.run(TestWorker.java:200)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.executeAndMaintainThreadName(TestWorker.java:132)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:103)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:63)
	at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:122)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:72)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)

siblings-v2/siblings.spec.ts::siblings › will combine results in search

Stack Traces | 16.3s run time

expect(locator).toBeVisible() failed

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

ani-malgari · 2026-07-02T00:28:46Z

+   * The platform or semantic layer that owns this metric (e.g. "dbt",
+   * "snowflake"). Used to group metrics by platform in search facets.
+   *
+   * REQUIRED. This field is part of the metric URN and therefore
+   * immutable once written. Callers must always populate it; if the
+   * source platform is unknown at ingest time the ingestion layer is
+   * responsible for assigning a stable fallback value (e.g. "datahub"
+   * for native / SDK-authored metrics)
+   */


this is the part i need alignment for. @gabe-lyons @asikowitz @chriscollins3456 - do you see any issues with this? Having platform preserves stateless idempotent ingestion and easy sync.

gabe-lyons · 2026-07-02T00:29:20Z

+
+Metrics are identified by two fields:
+
+- **`namespace`** — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`).


imo if we are having people truly say dbt or snowflake we should at least support an optional platformurn. it feels wrong to put in platforms as unstructured text

gabe-lyons · 2026-07-02T00:31:23Z

+
+- **`namespace`** — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`).
+  Searchable as a keyword so the sidebar can group metrics by platform.
+- **`id`** — the metric name within that namespace (e.g. `total_revenue`, `daily_active_users`).


how are we handling nested metrics? are we letting metrics be nested? thinking about how we deal with glossary, where we support multi-layer nesting, it feels odd to not support that for metrics. is there a particular reason to not support multi layer nesting for metrics?

we will be supporting nested metrics, it's handled in the MetricRelationships.pdl

gabe-lyons · 2026-07-02T00:32:06Z

+| IsPartOf      | outbound  | `metric`        | `metricRelationships` |
+| DerivedFrom   | outbound  | `metric`        | `metricRelationships` |
+| RelatedTo     | outbound  | `metric`        | `metricRelationships` |
+| Upstream data | outbound  | `dataset`       | `upstreamLineage`     |


generally this would be schemaField right, not datasets?

it will be datasets as well as SchemaField. Ex: upstreamLineage: { upstreams: [orders], fineGrained: [orders.amount] }

gabe-lyons · 2026-07-02T00:42:39Z

+- **`namespace`** — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`).
+- **`id`** — the model name within that namespace (e.g. `orders_model`, `customer_360`).
+
+An example URN: `urn:li:semanticModel:(dbt,orders_model)`.


i think we either should remove data platforms like dbt or snowflake from namespace examples or use platform urns with data platform instances. this middle ground is worst of both worlds.

gabe-lyons · 2026-07-02T00:42:55Z

+  /**
+  * Unique identifier for the metric within its platform.
+  */
+  id: string


see my comment above

gabe-lyons · 2026-07-02T00:43:47Z

+  @Searchable = {
+    "fieldType": "KEYWORD"
+  }
+  platform: string


this should align with datasets because semantic models should always be linked to a platform

gabe-lyons · 2026-07-02T00:44:31Z

+  @Searchable = {
+    "fieldType": "KEYWORD"
+  }
+  platform: string


we can do something like platform: Urn and have one option be logical, similar to what we did with logical datasets, but then we'd need another namespace string

gabe-lyons · 2026-07-02T00:45:12Z

+ * not expressible in the OSI core schema (Proposal B §B.5).
+ * The content field must be a valid JSON string.
+ */
+record CustomExtension {


why not just use structured properties? this seems like an overly complex 1-off

we're using CustomExtension for 1:1 parity with OSI model, and ease of maintenance. Also CustomExtensions are embedded inside multiple nested fields within the SemanticModel (Datasets, relationships, Fields, metrics). We want to support bidirectional sync where Datahub is the source of truth, and maintaining the OSI structure will be easy to handle. Also for ingestion, most of the OSI-supported platforms have connectors which will help get to this structure.

gabe-lyons · 2026-07-02T00:46:01Z

+enum Dialect {
+  SNOWFLAKE
+  DATABRICKS
+  DBT
+  ANSI_SQL
+  DATAHUB
+  UNKNOWN
+}


why do we need both platform and dialect, this seems redundant

imagine now we'll need to link dialects to icons, dialects to human readable names, etc, this is the same set as data platforms.

dialect again is part of OSI schema and relates to the metric expression. A metric's expression can be written in any of the above formats, and this enum only relates to the expression (no linking to icons or platform metadata).

gabe-lyons · 2026-07-02T00:46:38Z

+  /**
+   * The dialect in which this expression is written.
+   */
+  dialect: Dialect


see comment above this can just be platform

gabe-lyons · 2026-07-02T00:48:26Z

+@Aspect = {
+  "name": "metricRelationships"
+}
+record MetricRelationships {


im concerned about us baking in the set of possible relationships. @asikowitz is doing some work to enable custom relationship types, this work is probably just a subset of that.

sure, i'm happy to hear more from @asikowitz. The purpose of this aspect is to support nested metrics (tree), derived metrics (lineage) and support edge metadata between metrics.

gabe-lyons · 2026-07-02T00:49:23Z

+/**
+ * A named field (measure or dimension) defined inside a SemanticModel dataset.
+ */
+record Field {


i notice this doesn't seem to have a type? like numeric, string, time, etc?

will people want to add glossary terms, tags, etc to this?

I'm concerned we're rebuilding schemaField object here and will need to eventually keep adding all the stuff we have in that model.

Good callout. The Field here is the OSI field which is a dimension or a fact: a scalar, row-level attribute defined by its expression over one or more physical columns.
Ex. a derived dimension DATE_TRUNC('month', orders.order_date) or first_name || ' ' || last_name.

we don't duplicate SchemaField: a field carries only the semantic enrichment (expression, dimension/is_time, label, ai_context) and not the physical attributes. Those stay on SchemaField.

gabe-lyons · 2026-07-02T00:52:48Z

+ * A logical dataset referenced by a SemanticModel, together with its schema
+ * and primary/unique key definitions.
+ */
+record ModelDataset {


whoa why is this not just a datasetUrn?

ModelDataset contains semantic enrichment (alias name, dimensions etc), and it's source of truth is the semantic model.

gabe-lyons · 2026-07-02T00:53:09Z

+
+  /**
+   * Join relationships between the datasets of this semantic model.
+   */


i think we already model join relationships

gabe-lyons · 2026-07-02T00:53:22Z

+  /**
+   * Vendor-namespaced extension blobs for platform-specific metadata.
+   */
+  customExtensions: optional array[CustomExtension]


again i think this should just be structured properties

alternatively, this could also be the generic CustomProperties aspect that most entities "inherit" in the Info/Properties aspect

gabe-lyons · 2026-07-02T00:53:41Z

+ * A join relationship between two logical datasets defined within a SemanticModel.
+ * Named SemanticModelRelationship to avoid colliding with common relationship models.
+ */
+record SemanticModelRelationship {


pleaes confirm no other such model exists

datahub-connector-tests · 2026-07-02T01:00:43Z

Connector Tests Results

All connector tests passed for commit 616b651

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

sgomezvillamor · 2026-07-02T07:33:42Z

both MetricKey and SemanticModelKey, should env: FabricType be included in the key also? the same as we do for datasets

sgomezvillamor · 2026-07-02T07:37:46Z

what about typical audit info such as createdAt or lastUpdatedAt?

sgomezvillamor · 2026-07-02T07:38:02Z

what about typical audit info such as createdAt or lastUpdatedAt?

sgomezvillamor · 2026-07-02T07:44:13Z

How these new entities will be linked to the existing entities?

Eg, should Upstream be updated to include SemanticModel entities too?

datahub/metadata-models/src/main/pegasus/com/linkedin/dataset/Upstream.pdl

Lines 26 to 48 in 15462be

    
           /** 
        
            * The upstream dataset the lineage points to 
        
            */ 
        
           @Relationship = { 
        
             "name": "DownstreamOf", 
        
             "entityTypes": [ "dataset" ], 
        
             "isLineage": true, 
        
             "createdOn": "upstreams/*/created/time" 
        
             "createdActor": "upstreams/*/created/actor" 
        
             "updatedOn": "upstreams/*/auditStamp/time" 
        
             "updatedActor": "upstreams/*/auditStamp/actor" 
        
             "properties": "upstreams/*/properties" 
        
             "via": "upstreams/*/query" 
        
           } 
        
           @Searchable = { 
        
             "fieldName": "upstreams", 
        
             "hasValuesFieldName": "hasUpstreams", 
        
             "fieldType": "URN", 
        
             "queryByDefault": false 
        
           } 
        
           dataset: DatasetUrn 
        
           /**

feat(metrics): create new metric and semantic model entities

616b651

github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 2, 2026

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 00:22 View deployment

ani-malgari requested a review from gabe-lyons July 2, 2026 00:23

ani-malgari requested review from alexsku, asikowitz, chriscollins3456 and jjoyce0510 July 2, 2026 00:24

ani-malgari commented Jul 2, 2026

View reviewed changes

gabe-lyons reviewed Jul 2, 2026

View reviewed changes

vercel Bot deployed to Preview July 2, 2026 00:31 View deployment

gabe-lyons reviewed Jul 2, 2026

View reviewed changes

sgomezvillamor reviewed Jul 2, 2026

View reviewed changes

maggiehays added the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Jul 2, 2026


		Metrics are identified by two fields:

		- `namespace` — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`).

Uh oh!

Conversation

ani-malgari commented Jul 2, 2026

Summary

Entities

Design notes

Out of scope (follow-up PRs)

Checklist

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ani-malgari Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ani-malgari Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

datahub-connector-tests Bot commented Jul 2, 2026

Connector Tests Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 2, 2026 •

edited

Loading

ani-malgari Jul 2, 2026 •

edited

Loading

ani-malgari Jul 2, 2026 •

edited

Loading