Blog Details

Home
/ Blog Details

From Raw Meeting Transcripts to Structured Minutes: Training DeBERTa for 'What Happened?' and 'What Changed?'

A practical pipeline for turning transcripts into structured minutes using DeBERTa classifiers.

MFBy MuFaw Team

21 Jan 2026

NLPDeBERTaMeeting MinutesClassificationTopic SegmentationTransformers

Meetings generate a lot of text and not much clarity. Even when you have a transcript, you still don't have minutes: the abstract summary, the action items, the decisions, the problems raised, and the progress updates--cleanly separated and easy to scan.

This project tackles that gap with two focused NLP models:

1. A meeting-segment classifier that labels each utterance as one of:

abstract, actions, decisions, problems, progress

A topic-shift detector that flags where the meeting switches to a new topic (useful for

segmentation and agenda structure)

Both models are trained using a modern transformer backbone (microsoft/deberta-v3-base) with practical training choices that matter in real pipelines: class imbalance handling, early stopping, and context-aware inputs.

The Problem: Transcripts Aren't Minutes

A transcript is a chronological dump. Minutes are structured artifacts:

Abstract: the "what this meeting was about"
Actions: tasks and assignments
Decisions: agreed outcomes
Problems: blockers and risks
Progress: updates on ongoing work

If you can reliably label utterances into these buckets, you can auto-generate minutes that are actually usable--especially when combined with topic segmentation.

Data: AMI + ICSI Meeting Corpora (JSON Utterances)

The training data is assembled from two well-known meeting datasets:

AMI meeting corpus

ICSI meeting corpus

Utterances are loaded from JSON files organized by categories. Each record keeps:

meeting/utterance id
raw text label/type (filtered to the five valid classes)

The goal is not "summarize everything" immediately. The goal is to create clean intermediate structure (labels + segments) that downstream summarization can use reliably.

Text Prep: Simple, Controlled, Repeatable

The preprocessing is intentionally conservative:

lowercase normalization

whitespace cleanup
punctuation/special character stripping

This reduces noise without doing aggressive transformations that might destroy meaning.

There's also optional feature engineering (counts for question marks, exclamations, and keyword hits like will/should/need for actions, agreed/decide for decisions, issue/risk for problems). Even if the transformer does most of the heavy lifting, these features are useful for diagnostics and can support hybrid baselines if needed.

Model 1: Multi-Class Utterance Classification (DeBERTa v3)

Why DeBERTa v3?

Distilled models are fast, but they tend to drop accuracy on nuanced classification (especially "decision vs action vs progress"). DeBERTa v3 is a solid accuracy-first baseline without going into giant, expensive models.

Key training choices

MODEL: microsoft/deberta-v3-base
MAX_LENGTH: 256 (more context per utterance)
LR: 1e-5 (stable convergence for a stronger encoder)
EPOCHS: up to 10 with early stopping
Mixed precision (fp16) when GPU is available

Handling class imbalance (this matters)

Meeting labels are rarely balanced (e.g., "actions" can be sparse; "progress" might dominate depending on dataset). This project does two things:

Dataset balancing via downsampling/upsampling per class 2. Class-weighted loss via a custom WeightedTrainer that applies

CrossEntropyLoss(weight=...)

That combination reduces the "always predict the most common label" failure mode.

Evaluation metrics

The pipeline computes:

accuracy

weighted precision/recall/F1
macro F1 (important when class imbalance exists)

The model is saved to:

./models/enhanced_action_decision_classifier (Despite the folder name, the classifier is trained over the full multi-class set: abstract/actions/decisions/problems/progress.)

Model 2: Topic Shift Detection (Binary Classification)

Minutes are more readable when grouped by topic. So the second part builds a topic-shift dataset and trains a binary classifier:

label = 1 topic shift detected
label = 0 same topic continues

Creating labels (heuristics-based)

Instead of relying on manual boundary annotations, the dataset is generated using practical heuristics:

Lexical overlap between previous and current utterance o low overlap suggests a topic jump
Length ratio changes
- sudden changes can correlate with transitions (e.g., short pivot phrases)
Shift indicator phrases
- e.g., "moving on", "next topic", "different matter", "new agenda"

Context-aware input

Each training example includes context using a separator token:

prev_text + " [SEP] " + curr_text

This is important: topic shift is not purely about the current line--it's about how it relates to what came right before it.

The model is saved to:

./models/enhanced_topic_shift_detector

What You Get: A Clean "Minutes Substrate"

Once you have:

utterance-level labels (abstract/actions/decisions/problems/progress)
topic boundaries (shift/no-shift)

You can generate minutes in a structured way:

Segment transcript by topic shifts 2. Within each segment, group utterances by label 3. Render minutes:

Abstract (top)
Decisions (bulleted)
Actions (task list format)
Problems/Risks
Progress updates

This approach avoids the common summarization failure where a model blends everything into a vague paragraph.

Practical Integration: Where This Fits in a Real Pipeline

A realistic end-to-end flow looks like:

Speech-to-text transcription (ASR) 2. Utterance splitting (timestamp-based or punctuation-based) 3. Topic shift detection to segment the meeting 4. Utterance classification inside segments 5. Summarize per segment + compile minutes template

Even if you swap out the summarizer later, the structure from steps (3-4) remains valuable and stable.

Limitations (Be Honest About What's Real)

Topic-shift labels are heuristic-generated, not ground truth. They're useful, but you should expect noise.
Using only local context (prev [SEP] curr) is a baseline; some transitions require a longer window.
Meeting corpora vary in style; generalization to your organization's meeting style will usually require some fine-tuning.

Next Steps That Actually Improve This

If you want this to be production-grade (or at least "reliably good"):

Add longer context windows (e.g., rolling 3-5 utterances)
Use confidence thresholds + "needs review" bucket
Train topic shifts with a small manually labeled set (even a few hundred boundaries helps a lot)
Multi-task learning: classify label + detect shift jointly
Add speaker/turn features (speaker changes often correlate with topic transitions)

Summary

This project builds a strong foundation for automated meeting minutes by focusing on structure first:

A DeBERTa-based classifier turns raw utterances into meaningful categories.
A second DeBERTa model detects topic boundaries to make minutes readable and organized.
Class imbalance is addressed both by balancing and class-weighted loss.
Models are saved as reusable artifacts, ready to plug into a larger meeting-to-minutes workflow.

Join our newsletter!

Enter your email to receive our latest newsletter.

Don't worry, we don't spam

Blog Details

From Raw Meeting Transcripts to Structured Minutes: Training DeBERTa for 'What Happened?' and 'What Changed?'

This project tackles that gap with two focused NLP models:

1. A meeting-segment classifier that labels each utterance as one of:

abstract, actions, decisions, problems, progress

segmentation and agenda structure)

The Problem: Transcripts Aren't Minutes

A transcript is a chronological dump. Minutes are structured artifacts:

Data: AMI + ICSI Meeting Corpora (JSON Utterances)

The training data is assembled from two well-known meeting datasets:

Utterances are loaded from JSON files organized by categories. Each record keeps:

Text Prep: Simple, Controlled, Repeatable

The preprocessing is intentionally conservative:

lowercase normalization

Key training choices

Handling class imbalance (this matters)

CrossEntropyLoss(weight=...)

Evaluation metrics

The pipeline computes:

accuracy

The model is saved to:

Model 2: Topic Shift Detection (Binary Classification)

Creating labels (heuristics-based)

Context-aware input

Each training example includes context using a separator token:

prev_text + " [SEP] " + curr_text

The model is saved to:

What You Get: A Clean "Minutes Substrate"

Once you have:

You can generate minutes in a structured way:

Practical Integration: Where This Fits in a Real Pipeline

A realistic end-to-end flow looks like:

Limitations (Be Honest About What's Real)

Next Steps That Actually Improve This

If you want this to be production-grade (or at least "reliably good"):

Summary

Join our newsletter!

Popular Articles

Beyond Accuracy: The Art & Science of Tr

Diffusion Models, Explained (and How The

Edge AI vs. Cloud-Based Models: When Eac

Related Articles

Beyond Accuracy: The Art & Science of Truly Understanding Your AI Models

Diffusion Models, Explained (and How They Compare to GANs)

Edge AI vs. Cloud-Based Models: When Each Makes Sense