Blog Details

From Raw Meeting Transcripts to Structured Minutes: Training DeBERTa for 'What Happened?' and 'What Changed?'

From Raw Meeting Transcripts to Structured Minutes: Training DeBERTa for 'What Happened?' and 'What Changed?'

A practical pipeline for turning transcripts into structured minutes using DeBERTa classifiers.

MFBy MuFaw Team
21 Jan 2026
NLPDeBERTaMeeting MinutesClassificationTopic SegmentationTransformers

Meetings generate a lot of text and not much clarity. Even when you have a transcript, you still don't have minutes: the abstract summary, the action items, the decisions, the problems raised, and the progress updates--cleanly separated and easy to scan.

This project tackles that gap with two focused NLP models:

1. A meeting-segment classifier that labels each utterance as one of:

abstract, actions, decisions, problems, progress

  1. A topic-shift detector that flags where the meeting switches to a new topic (useful for

segmentation and agenda structure)

Both models are trained using a modern transformer backbone (microsoft/deberta-v3-base) with practical training choices that matter in real pipelines: class imbalance handling, early stopping, and context-aware inputs.

The Problem: Transcripts Aren't Minutes

A transcript is a chronological dump. Minutes are structured artifacts:

  • Abstract: the "what this meeting was about"
  • Actions: tasks and assignments
  • Decisions: agreed outcomes
  • Problems: blockers and risks
  • Progress: updates on ongoing work

If you can reliably label utterances into these buckets, you can auto-generate minutes that are actually usable--especially when combined with topic segmentation.

Data: AMI + ICSI Meeting Corpora (JSON Utterances)

The training data is assembled from two well-known meeting datasets:

  • AMI meeting corpus

ICSI meeting corpus

Utterances are loaded from JSON files organized by categories. Each record keeps:

  • meeting/utterance id
  • raw text label/type (filtered to the five valid classes)

The goal is not "summarize everything" immediately. The goal is to create clean intermediate structure (labels + segments) that downstream summarization can use reliably.

Text Prep: Simple, Controlled, Repeatable

The preprocessing is intentionally conservative:

lowercase normalization

  • whitespace cleanup
  • punctuation/special character stripping

This reduces noise without doing aggressive transformations that might destroy meaning.

There's also optional feature engineering (counts for question marks, exclamations, and keyword hits like will/should/need for actions, agreed/decide for decisions, issue/risk for problems). Even if the transformer does most of the heavy lifting, these features are useful for diagnostics and can support hybrid baselines if needed.

Model 1: Multi-Class Utterance Classification (DeBERTa v3)

Why DeBERTa v3?

Distilled models are fast, but they tend to drop accuracy on nuanced classification (especially "decision vs action vs progress"). DeBERTa v3 is a solid accuracy-first baseline without going into giant, expensive models.

Key training choices

  • MODEL: microsoft/deberta-v3-base
  • MAX_LENGTH: 256 (more context per utterance)
  • LR: 1e-5 (stable convergence for a stronger encoder)
  • EPOCHS: up to 10 with early stopping
  • Mixed precision (fp16) when GPU is available

Handling class imbalance (this matters)

Meeting labels are rarely balanced (e.g., "actions" can be sparse; "progress" might dominate depending on dataset). This project does two things:

  1. Dataset balancing via downsampling/upsampling per class 2. Class-weighted loss via a custom WeightedTrainer that applies

CrossEntropyLoss(weight=...)

That combination reduces the "always predict the most common label" failure mode.

Evaluation metrics

The pipeline computes:

accuracy

  • weighted precision/recall/F1
  • macro F1 (important when class imbalance exists)

The model is saved to:

  • ./models/enhanced_action_decision_classifier (Despite the folder name, the classifier is trained over the full multi-class set: abstract/actions/decisions/problems/progress.)

Model 2: Topic Shift Detection (Binary Classification)

Minutes are more readable when grouped by topic. So the second part builds a topic-shift dataset and trains a binary classifier:

  • label = 1 topic shift detected
  • label = 0 same topic continues

Creating labels (heuristics-based)

Instead of relying on manual boundary annotations, the dataset is generated using practical heuristics:

  • Lexical overlap between previous and current utterance o low overlap suggests a topic jump
  • Length ratio changes
    • sudden changes can correlate with transitions (e.g., short pivot phrases)
  • Shift indicator phrases
    • e.g., "moving on", "next topic", "different matter", "new agenda"

Context-aware input

Each training example includes context using a separator token:

prev_text + " [SEP] " + curr_text

This is important: topic shift is not purely about the current line--it's about how it relates to what came right before it.

The model is saved to:

  • ./models/enhanced_topic_shift_detector

What You Get: A Clean "Minutes Substrate"

Once you have:

  • utterance-level labels (abstract/actions/decisions/problems/progress)
  • topic boundaries (shift/no-shift)

You can generate minutes in a structured way:

  1. Segment transcript by topic shifts 2. Within each segment, group utterances by label 3. Render minutes:
  • Abstract (top)
  • Decisions (bulleted)
  • Actions (task list format)
  • Problems/Risks
  • Progress updates

This approach avoids the common summarization failure where a model blends everything into a vague paragraph.

Practical Integration: Where This Fits in a Real Pipeline

A realistic end-to-end flow looks like:

  1. Speech-to-text transcription (ASR) 2. Utterance splitting (timestamp-based or punctuation-based) 3. Topic shift detection to segment the meeting 4. Utterance classification inside segments 5. Summarize per segment + compile minutes template

Even if you swap out the summarizer later, the structure from steps (3-4) remains valuable and stable.

Limitations (Be Honest About What's Real)

  • Topic-shift labels are heuristic-generated, not ground truth. They're useful, but you should expect noise.
  • Using only local context (prev [SEP] curr) is a baseline; some transitions require a longer window.
  • Meeting corpora vary in style; generalization to your organization's meeting style will usually require some fine-tuning.

Next Steps That Actually Improve This

If you want this to be production-grade (or at least "reliably good"):

  • Add longer context windows (e.g., rolling 3-5 utterances)
  • Use confidence thresholds + "needs review" bucket
  • Train topic shifts with a small manually labeled set (even a few hundred boundaries helps a lot)
  • Multi-task learning: classify label + detect shift jointly
  • Add speaker/turn features (speaker changes often correlate with topic transitions)

Summary

This project builds a strong foundation for automated meeting minutes by focusing on structure first:

  • A DeBERTa-based classifier turns raw utterances into meaningful categories.
  • A second DeBERTa model detects topic boundaries to make minutes readable and organized.
  • Class imbalance is addressed both by balancing and class-weighted loss.
  • Models are saved as reusable artifacts, ready to plug into a larger meeting-to-minutes workflow.

Join our newsletter!

Enter your email to receive our latest newsletter.

Don't worry, we don't spam

Related Articles