From Batch to Real-Time: How Incremental Refresh, CDC & Event Streams Power Enterprise Semantic Models in Microsoft Fabric
Introduction — Why Real-Time + Large Models Matter
In asset-heavy operations — water utilities, infrastructure networks, transport assets, energy grids, environmental monitoring — data arrives continuously. Meters tick. Sensors report states. Work orders get created. Asset failures occur unexpectedly.
Traditional BI pipelines refresh once a day — but operations don’t wait 24 hours.
This is why modern data platforms increasingly rely on:
-
Incremental refresh
-
Change Data Capture (CDC)
-
Event streams
-
Lakehouse pipelines and streaming ingestion
-
Direct Lake + semantic models
Microsoft Fabric merges all these capabilities into one unified architecture, where real-time data engineering and enterprise semantic models work together seamlessly.
In this blog, we explore how these components interact, what technical choices matter, and how platform engineers actually implement them in production.
1. The Challenge — Keeping Large Semantic Models Fresh Without Breaking Them
A large semantic model may contain:
-
Billions of transactional or sensor rows
-
Years of asset history
-
High-velocity telemetry
-
Multiple domains (assets, operations, licensing, financials)
-
Hundreds of users
-
Dozens of report pages
-
Real-time dashboards and alerts
A full refresh of such a model is usually:
❌ Too slow
❌ Too expensive
❌ Too disruptive
❌ Too fragile (schema drift will break everything)
Incremental + real-time ingestion solves this.
2. Incremental Refresh — The Backbone for Large Data Models
Incremental refresh is the first step toward scaling operational analytics. It works by refreshing only:
-
Recent data (e.g., last 3 days/week/month)
-
Data within a dynamic partition window
-
Optionally detecting changes using “detect data changes” columns
How it works technically
When you configure incremental refresh in Power BI / Fabric:
-
Power Query defines two parameters:
-
RangeStart -
RangeEnd
-
-
Data is partitioned (usually per day/month depending on volume).
-
The engine automatically:
-
Rebuilds only partitions inside the “refresh window”
-
Loads new rows
-
Updates existing rows (if CDC style logic is enabled)
-
Retains historical partitions unchanged
-
Why it matters for engineers
✔ Massive reduction in refresh time
✔ Better memory usage (important in Fabric capacities)
✔ More reliable
✔ Allows near real-time models
✔ Enables hybrid approaches (e.g. Direct Lake)
Practical tips for engineering teams
-
Ensure fact tables have a date or datetime column suitable for partitioning
-
Avoid partitioning on high-granularity timestamps (e.g., seconds) — use date or hour
-
Use “detect data changes” on a modified timestamp column if possible
-
Monitor refresh using XMLA DMVs
-
Split very large fact tables into multiple logical partitions/domains if refresh windows become too big
-
Always test refresh policies in a dev model before pushing to prod
3. Introducing Real-Time: CDC, Event Streams & Streaming Pipelines
Incremental refresh handles historical data efficiently —
real-time ingestion ensures your data is never stale.
This is where Microsoft Fabric becomes very powerful for operational analytics.
3.1 Change Data Capture (CDC)
CDC allows you to capture:
-
INSERTs
-
UPDATEs
-
DELETEs
directly from source systems.
In Fabric, you can trigger CDC through:
-
Lakehouse pipelines connecting to SQL databases
-
Azure SQL CDC features
-
Eventstreams capturing log-based changes
-
External systems writing Delta Lake–compatible CDC logs
CDC → Lakehouse → Semantic Model Flow
-
CDC captures new/changed rows from source
-
Writes changes to the Lakehouse bronze layer
-
A pipeline merges changes into silver/gold Delta tables
-
Incremental refresh picks up those deltas
-
Semantic model reflects the changes almost instantly
Best practices
-
If the source system supports CDC — always use it rather than reloading data
-
Ensure Delta Lake merge logic handles duplicates, late-arriving data, and deletes
-
Use surrogate keys in dimensions to handle type 2 SCD scenarios
-
Include appropriate metadata:
_commit_timestamp,_operation,_lsn
3.2 Event Streams (Real-Time Fabric)
Event Streams in Fabric provide a low-latency stream ingestion service for:
-
Telemetry (IoT sensors, SCADA)
-
Asset health metrics
-
Environmental monitoring
-
Vehicle pings
-
Maintenance event triggers
-
Alerts or notifications
Technically, Event Streams support multiple targets:
-
KQL database
-
Lakehouse Delta tables
-
Power BI streaming datasets
-
Event hubs / service buses
-
Notebooks or transformations
Use Case Example — Water Utility Telemetry
Incoming sensor data (e.g., flow, pressure, turbidity) is captured every few seconds:
-
Device → Event Hub → Event Stream
-
Event Stream → Delta Lake table
-
Delta Lake → Direct Lake semantic model
-
Dashboard updates with second-level latency
Best practices
-
Partition streaming Delta tables by date or hour
-
Enforce schema with contracts — streaming sources often drift
-
Use KQL DB for ultra-fast queries, then materialize to lakehouse
-
Avoid writing too many small files (use stream compaction in pipelines)
3.3 Pipelines as the Orchestrator Layer
Pipelines in Fabric do what ADF did + more:
-
Orchestrate batch + streaming + CDC refreshes
-
Execute merge operations into gold Delta tables
-
Trigger semantic model refreshes
-
Apply validation rules
-
Load dimensions using SCD logic
-
Send alerts if job failures occur
4. When Real-Time + Incremental Meets Semantic Models
This is the most important architectural interaction.
A semantic model can now draw on:
1. Historical fact tables (imported or Direct Lake)
-
Managed via incremental refresh
-
Efficient and stable
2. Near-real-time Delta tables (Direct Lake)
-
Updated via CDC, event streams, or pipelines
-
No need to “refresh dataset”; data updates instantly
-
Cached intelligently by the engine
3. Real-time dashboards (direct stream)
-
For telemetry, alerts, dashboards that require <5-sec latency
Together, this forms a hybrid semantic model:
| Layer | Purpose |
|---|---|
| Import | Stable, aggregated history |
| Direct Lake | Near real-time operational data |
| Streaming visuals | Instant telemetry/alerts |
Why this matters technically
-
Import tables do not need constant refresh
-
Direct Lake avoids full ingestion into VertiPaq
-
File changes in Delta Lake immediately reflect in semantic model
-
Refresh windows stay extremely small
-
Capacity use is dramatically reduced
-
DAX measures can reference both historical + real-time simultaneously
5. Things That Fail — And How Engineers Fix Them
When working with real-time + incremental + semantic models, here are the failure points:
5.1 Schema drift
Streaming systems often add fields unexpectedly.
➡ Fix using schema validation pipelines
➡ Convert unsupported types in silver layer
5.2 Small file problem
Streaming ingestion → thousands of tiny Parquet files
➡ Use Fabric’s file compaction pipelines
➡ Optimise delta tables (OPTIMIZE in notebooks)
5.3 Late-arriving CDC updates
Sometimes updates arrive globally out of order
➡ Use merge logic sorted by _commit_timestamp
➡ Deduplicate using row hash keys
5.4 Semantic model refresh failures
Likely caused by:
-
Partition too large
-
Memory pressure
-
Cardinality spikes
➡ Reduce incremental window
➡ Create smaller partitions
➡ Use Date-Hour partitioning
➡ Move hot data to Direct Lake instead of Import
5.5 Direct Lake fallback to Dual/Import
If Delta tables are mis-optimised:
-
Too many small files
-
Unsupported data types
-
Missing VORDER / Z-Order
➡ Refactor pipeline to enforce performant Delta Lake tables
6. Practical Guidance — For Fabric Engineers
For Fact Tables
-
Partition by date
-
Store in Delta format
-
Compact files daily/hourly
-
Keep history in Import mode
-
Use Direct Lake for hot data window
For Dimension Tables
-
Use SCD2 logic for asset lifecycle
-
Use surrogate keys
-
Version dimensions monthly
-
Apply OLS/RLS at dimension level
For Pipelines
-
Build simple reusable components (CDC merge, compaction, SCD load)
-
Trigger semantic model refresh only when gold tables update
-
Use alerts and failure pipelines
For Semantic Models
-
Use hybrid import + direct lake
-
Use incremental refresh for historical partitions
-
Turn on “Large semantic model” format
-
Avoid snowflake schemas, use star
-
Use Tabular Editor for governance and CI/CD
7. Conclusion — The New Standard for Operational Analytics
Microsoft Fabric + Power BI semantic models deliver something no past Microsoft stack could:
▶ Batch + Streaming + CDC + Real-Time + Lakehouse + BI in one integrated ecosystem.
For asset-heavy operations, this architecture enables:
-
Near-real-time dashboards
-
Operational alerts
-
24/7 monitoring
-
Up-to-date KPIs
-
Scalable history analytics
-
Robust, governed semantic models
-
Lower cost and higher reliability
This is the future of enterprise analytics — not nightly refreshes, but continuously updated semantic models feeding the entire organisation.
Comments
Post a Comment