Meetings generate a lot of text and not much clarity. Even when you have a transcript, you still don't have minutes: the abstract summary, the action items, the decisions, the problems raised, and the progress updates--cleanly separated and easy to scan.
This project tackles that gap with two focused NLP models:
1. A meeting-segment classifier that labels each utterance as one of:
abstract, actions, decisions, problems, progress
- A topic-shift detector that flags where the meeting switches to a new topic (useful for
segmentation and agenda structure)
Both models are trained using a modern transformer backbone (microsoft/deberta-v3-base) with practical training choices that matter in real pipelines: class imbalance handling, early stopping, and context-aware inputs.
The Problem: Transcripts Aren't Minutes
A transcript is a chronological dump. Minutes are structured artifacts:
- Abstract: the "what this meeting was about"
- Actions: tasks and assignments
- Decisions: agreed outcomes
- Problems: blockers and risks
- Progress: updates on ongoing work
If you can reliably label utterances into these buckets, you can auto-generate minutes that are actually usable--especially when combined with topic segmentation.
Data: AMI + ICSI Meeting Corpora (JSON Utterances)
The training data is assembled from two well-known meeting datasets:
ICSI meeting corpus
Utterances are loaded from JSON files organized by categories. Each record keeps:
- meeting/utterance id
- raw text label/type (filtered to the five valid classes)
The goal is not "summarize everything" immediately. The goal is to create clean intermediate structure (labels + segments) that downstream summarization can use reliably.
Text Prep: Simple, Controlled, Repeatable
The preprocessing is intentionally conservative:
lowercase normalization
- whitespace cleanup
- punctuation/special character stripping
This reduces noise without doing aggressive transformations that might destroy meaning.
There's also optional feature engineering (counts for question marks, exclamations, and keyword hits like will/should/need for actions, agreed/decide for decisions, issue/risk for problems). Even if the transformer does most of the heavy lifting, these features are useful for diagnostics and can support hybrid baselines if needed.
Model 1: Multi-Class Utterance Classification (DeBERTa v3)
Why DeBERTa v3?
Distilled models are fast, but they tend to drop accuracy on nuanced classification (especially "decision vs action vs progress"). DeBERTa v3 is a solid accuracy-first baseline without going into giant, expensive models.
Key training choices
- MODEL: microsoft/deberta-v3-base
- MAX_LENGTH: 256 (more context per utterance)
- LR: 1e-5 (stable convergence for a stronger encoder)
- EPOCHS: up to 10 with early stopping
- Mixed precision (fp16) when GPU is available
Handling class imbalance (this matters)
Meeting labels are rarely balanced (e.g., "actions" can be sparse; "progress" might dominate depending on dataset). This project does two things:
- Dataset balancing via downsampling/upsampling per class 2. Class-weighted loss via a custom WeightedTrainer that applies
CrossEntropyLoss(weight=...)
That combination reduces the "always predict the most common label" failure mode.
Evaluation metrics
The pipeline computes:
accuracy
- weighted precision/recall/F1
- macro F1 (important when class imbalance exists)
The model is saved to:
- ./models/enhanced_action_decision_classifier (Despite the folder name, the classifier is trained over the full multi-class set: abstract/actions/decisions/problems/progress.)
Model 2: Topic Shift Detection (Binary Classification)
Minutes are more readable when grouped by topic. So the second part builds a topic-shift dataset and trains a binary classifier:
- label = 1 topic shift detected
- label = 0 same topic continues
Creating labels (heuristics-based)
Instead of relying on manual boundary annotations, the dataset is generated using practical heuristics:
- Lexical overlap between previous and current utterance o low overlap suggests a topic jump
- Length ratio changes
- sudden changes can correlate with transitions (e.g., short pivot phrases)
- Shift indicator phrases
- e.g., "moving on", "next topic", "different matter", "new agenda"
Context-aware input
Each training example includes context using a separator token:
prev_text + " [SEP] " + curr_text
This is important: topic shift is not purely about the current line--it's about how it relates to what came right before it.
The model is saved to:
- ./models/enhanced_topic_shift_detector
What You Get: A Clean "Minutes Substrate"
Once you have:
- utterance-level labels (abstract/actions/decisions/problems/progress)
- topic boundaries (shift/no-shift)
You can generate minutes in a structured way:
- Segment transcript by topic shifts 2. Within each segment, group utterances by label 3. Render minutes:
- Abstract (top)
- Decisions (bulleted)
- Actions (task list format)
- Problems/Risks
- Progress updates
This approach avoids the common summarization failure where a model blends everything into a vague paragraph.
Practical Integration: Where This Fits in a Real Pipeline
A realistic end-to-end flow looks like:
- Speech-to-text transcription (ASR) 2. Utterance splitting (timestamp-based or punctuation-based) 3. Topic shift detection to segment the meeting 4. Utterance classification inside segments 5. Summarize per segment + compile minutes template
Even if you swap out the summarizer later, the structure from steps (3-4) remains valuable and stable.
Limitations (Be Honest About What's Real)
- Topic-shift labels are heuristic-generated, not ground truth. They're useful, but you should expect noise.
- Using only local context (prev [SEP] curr) is a baseline; some transitions require a longer window.
- Meeting corpora vary in style; generalization to your organization's meeting style will usually require some fine-tuning.
Next Steps That Actually Improve This
If you want this to be production-grade (or at least "reliably good"):
- Add longer context windows (e.g., rolling 3-5 utterances)
- Use confidence thresholds + "needs review" bucket
- Train topic shifts with a small manually labeled set (even a few hundred boundaries helps a lot)
- Multi-task learning: classify label + detect shift jointly
- Add speaker/turn features (speaker changes often correlate with topic transitions)
Summary
This project builds a strong foundation for automated meeting minutes by focusing on structure first:
- A DeBERTa-based classifier turns raw utterances into meaningful categories.
- A second DeBERTa model detects topic boundaries to make minutes readable and organized.
- Class imbalance is addressed both by balancing and class-weighted loss.
- Models are saved as reusable artifacts, ready to plug into a larger meeting-to-minutes workflow.