feat(metrics): add new metric and semantic model entities#18134
feat(metrics): add new metric and semantic model entities#18134ani-malgari wants to merge 1 commit into
Conversation
❌ 1 Tests Failed:
View the top 2 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
| * The platform or semantic layer that owns this metric (e.g. "dbt", | ||
| * "snowflake"). Used to group metrics by platform in search facets. | ||
| * | ||
| * REQUIRED. This field is part of the metric URN and therefore | ||
| * immutable once written. Callers must always populate it; if the | ||
| * source platform is unknown at ingest time the ingestion layer is | ||
| * responsible for assigning a stable fallback value (e.g. "datahub" | ||
| * for native / SDK-authored metrics) | ||
| */ |
There was a problem hiding this comment.
this is the part i need alignment for. @gabe-lyons @asikowitz @chriscollins3456 - do you see any issues with this? Having platform preserves stateless idempotent ingestion and easy sync.
|
|
||
| Metrics are identified by two fields: | ||
|
|
||
| - **`namespace`** — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`). |
There was a problem hiding this comment.
imo if we are having people truly say dbt or snowflake we should at least support an optional platformurn. it feels wrong to put in platforms as unstructured text
|
|
||
| - **`namespace`** — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`). | ||
| Searchable as a keyword so the sidebar can group metrics by platform. | ||
| - **`id`** — the metric name within that namespace (e.g. `total_revenue`, `daily_active_users`). |
There was a problem hiding this comment.
how are we handling nested metrics? are we letting metrics be nested? thinking about how we deal with glossary, where we support multi-layer nesting, it feels odd to not support that for metrics. is there a particular reason to not support multi layer nesting for metrics?
There was a problem hiding this comment.
we will be supporting nested metrics, it's handled in the MetricRelationships.pdl
| | IsPartOf | outbound | `metric` | `metricRelationships` | | ||
| | DerivedFrom | outbound | `metric` | `metricRelationships` | | ||
| | RelatedTo | outbound | `metric` | `metricRelationships` | | ||
| | Upstream data | outbound | `dataset` | `upstreamLineage` | |
There was a problem hiding this comment.
generally this would be schemaField right, not datasets?
There was a problem hiding this comment.
it will be datasets as well as SchemaField. Ex: upstreamLineage: { upstreams: [orders], fineGrained: [orders.amount] }
| - **`namespace`** — typically the platform or project name (e.g. `dbt`, `snowflake`, `my_project`). | ||
| - **`id`** — the model name within that namespace (e.g. `orders_model`, `customer_360`). | ||
|
|
||
| An example URN: `urn:li:semanticModel:(dbt,orders_model)`. |
There was a problem hiding this comment.
i think we either should remove data platforms like dbt or snowflake from namespace examples or use platform urns with data platform instances. this middle ground is worst of both worlds.
| /** | ||
| * Unique identifier for the metric within its platform. | ||
| */ | ||
| id: string |
| @Searchable = { | ||
| "fieldType": "KEYWORD" | ||
| } | ||
| platform: string |
There was a problem hiding this comment.
this should align with datasets because semantic models should always be linked to a platform
| @Searchable = { | ||
| "fieldType": "KEYWORD" | ||
| } | ||
| platform: string |
There was a problem hiding this comment.
we can do something like platform: Urn and have one option be logical, similar to what we did with logical datasets, but then we'd need another namespace string
| * not expressible in the OSI core schema (Proposal B §B.5). | ||
| * The content field must be a valid JSON string. | ||
| */ | ||
| record CustomExtension { |
There was a problem hiding this comment.
why not just use structured properties? this seems like an overly complex 1-off
There was a problem hiding this comment.
we're using CustomExtension for 1:1 parity with OSI model, and ease of maintenance. Also CustomExtensions are embedded inside multiple nested fields within the SemanticModel (Datasets, relationships, Fields, metrics). We want to support bidirectional sync where Datahub is the source of truth, and maintaining the OSI structure will be easy to handle. Also for ingestion, most of the OSI-supported platforms have connectors which will help get to this structure.
| enum Dialect { | ||
| SNOWFLAKE | ||
| DATABRICKS | ||
| DBT | ||
| ANSI_SQL | ||
| DATAHUB | ||
| UNKNOWN | ||
| } |
There was a problem hiding this comment.
why do we need both platform and dialect, this seems redundant
There was a problem hiding this comment.
imagine now we'll need to link dialects to icons, dialects to human readable names, etc, this is the same set as data platforms.
There was a problem hiding this comment.
dialect again is part of OSI schema and relates to the metric expression. A metric's expression can be written in any of the above formats, and this enum only relates to the expression (no linking to icons or platform metadata).
| /** | ||
| * The dialect in which this expression is written. | ||
| */ | ||
| dialect: Dialect |
There was a problem hiding this comment.
see comment above this can just be platform
| @Aspect = { | ||
| "name": "metricRelationships" | ||
| } | ||
| record MetricRelationships { |
There was a problem hiding this comment.
im concerned about us baking in the set of possible relationships. @asikowitz is doing some work to enable custom relationship types, this work is probably just a subset of that.
There was a problem hiding this comment.
sure, i'm happy to hear more from @asikowitz. The purpose of this aspect is to support nested metrics (tree), derived metrics (lineage) and support edge metadata between metrics.
| /** | ||
| * A named field (measure or dimension) defined inside a SemanticModel dataset. | ||
| */ | ||
| record Field { |
There was a problem hiding this comment.
i notice this doesn't seem to have a type? like numeric, string, time, etc?
There was a problem hiding this comment.
will people want to add glossary terms, tags, etc to this?
There was a problem hiding this comment.
I'm concerned we're rebuilding schemaField object here and will need to eventually keep adding all the stuff we have in that model.
There was a problem hiding this comment.
Good callout. The Field here is the OSI field which is a dimension or a fact: a scalar, row-level attribute defined by its expression over one or more physical columns.
Ex. a derived dimension DATE_TRUNC('month', orders.order_date) or first_name || ' ' || last_name.
we don't duplicate SchemaField: a field carries only the semantic enrichment (expression, dimension/is_time, label, ai_context) and not the physical attributes. Those stay on SchemaField.
| * A logical dataset referenced by a SemanticModel, together with its schema | ||
| * and primary/unique key definitions. | ||
| */ | ||
| record ModelDataset { |
There was a problem hiding this comment.
whoa why is this not just a datasetUrn?
There was a problem hiding this comment.
ModelDataset contains semantic enrichment (alias name, dimensions etc), and it's source of truth is the semantic model.
|
|
||
| /** | ||
| * Join relationships between the datasets of this semantic model. | ||
| */ |
There was a problem hiding this comment.
i think we already model join relationships
| /** | ||
| * Vendor-namespaced extension blobs for platform-specific metadata. | ||
| */ | ||
| customExtensions: optional array[CustomExtension] |
There was a problem hiding this comment.
again i think this should just be structured properties
There was a problem hiding this comment.
alternatively, this could also be the generic CustomProperties aspect that most entities "inherit" in the Info/Properties aspect
| * A join relationship between two logical datasets defined within a SemanticModel. | ||
| * Named SemanticModelRelationship to avoid colliding with common relationship models. | ||
| */ | ||
| record SemanticModelRelationship { |
There was a problem hiding this comment.
pleaes confirm no other such model exists
Connector Tests ResultsAll connector tests passed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
There was a problem hiding this comment.
both MetricKey and SemanticModelKey, should env: FabricType be included in the key also? the same as we do for datasets
There was a problem hiding this comment.
what about typical audit info such as createdAt or lastUpdatedAt?
There was a problem hiding this comment.
what about typical audit info such as createdAt or lastUpdatedAt?
|
How these new entities will be linked to the existing entities? Eg, should |
Summary
Introduces
metricandsemanticModelas first-class DataHub entities. This is theschema-layer foundation for the upcoming Metrics Catalog feature (already scaffolded
in the frontend behind the
metricsEnabledfeature flag).The model follows the "lean OSI-core" shape from the internal Metrics RFC: OSI-shaped
core fields plus a vendor-namespaced
customExtensionsbag for anythingplatform-specific, rather than promoting every platform quirk to a first-class field.
Governance and lineage reuse existing native aspects (
ownership,domains,upstreamLineage, etc.).Entities
metric— key(platform, id)→urn:li:metric:(dbt,total_revenue).Aspects:
metricInfo— name, description, semantic-model ref, multi-dialect SQL expression,AI context, recoverability, vendor extensions.
metricRelationships—parentMetric(glossary-style tree pointer, powers thesidebar),
derivedFrom(metric→metric lineage viaEdgewithisLineage: true),relatedMetrics(curated "see also" edges, non-lineage).upstreamLineage,ownership,domains,globalTags,glossaryTerms,institutionalMemory,structuredProperties,status,deprecation,lifecycleStage,dataPlatformInstance,subTypes,forms,testResults,documentation,browsePaths,browsePathsV2,applications,container,incidentsSummary,displayProperties,assetSettings,versionProperties,access.semanticModel— key(platform, id)→urn:li:semanticModel:(dbt,sales_orders).Aspects:
semanticModelInfo— name, description,sourcePlatform(URN, mirrorsdatasetKey.platform), AI context,datasets[](with fields and dimensions),join
relationships[], vendor extensions.Both keys treat
platformas required and immutable (part of the URN). If the sourceplatform is unknown at ingest time, the ingestion layer is expected to assign a stable
fallback (e.g.
datahubfor native / SDK-authored metrics).Design notes
platformon the key, notdataPlatformInstance. Platform is a structuralidentity partition ("this is a Snowflake metric vs. a dbt metric — they are
distinct entities per the MVP no-dedup decision"), whereas
dataPlatformInstanceis a decorative aspect for search faceting and policyscoping. Same distinction datasets already make.
Urnfields, not typed URN typerefs. Cross-entity references(
MetricInfo.semanticModel,MetricRelationships.parentMetric, etc.) use plainUrnwith@Relationship.entityTypesdoing the type gating. Matches theDomain/DataProductpattern; can promote to typed URNs later without anaspect version bump.
Out of scope (follow-up PRs)
EntityTypeenum entries,EntityTypeMapper/UrnToEntityMapperwiring.MetricEntity/SemanticModelEntityregistered inbuildEntityRegistryV2. StubMetricsPage.tsxalready tracks this.metric views, OSI upload).
Checklist
metadata-models/docs/entities/metric.mdandmetadata-models/docs/entities/semanticModel.mdmetricsEnabledflag defaultsfalse)