ML/NLP / 2026-03
IAB 3.0 Classification and Language Detection Pipeline Proposal
This proposal lays out how to replace an external classification API with in-house language detection and IAB 3.0 classification pipelines while preserving the existing data lakehouse architecture. The goal is to scale classification coverage across a 1.75 billion item media corpus.
Items without IAB 3.0 coverage
Close language-detection and IAB 3.0 classification coverage gaps across a large media corpus without depending on external API calls.
Proposed two staged pipelines: a language detection cascade and a multilingual IAB 3.0 classifier trained on historical labels.
The plan maps 1.6B additional language detections and 1.14B additional IAB classifications into the existing lakehouse pattern.
- AWS Glue
- Iceberg
- S3
- Celery
- GCLD3
- FastText
- XLM-RoBERTa
- IAB 3.0
Current architecture
The data lakehouse follows a staging, core, and curated pattern. Staging tables hold raw ingested data and may contain duplicates. Core tables are deduplicated through AWS Glue and contain one row per entity. Curated tables provide enriched and filtered downstream views.
- The core content table contains 1.75B rows and 57 columns.
- The core publisher table contains 13.7M publisher entities.
- The transcript table contains 83.7M transcript rows.
- The curated content view contains roughly 960M rows.
Coverage gap
| Field | Coverage | Source |
|---|---|---|
| Legacy IAB categories | 1.739B items / 99.6% | Existing NLP pipeline |
| IAB 3.0 categories | 600.951M items / 34.4% | External classification API |
| Primary IAB 3.0 category | 600.944M items / 34.4% | External classification API |
| NLP topics | 604.563M items / 34.6% | External classification API |
| Detected text language | 141M items / 8.0% | Current detection |
Proposed pipelines
- Language Detection Scheduler: a Celery-based cascade to detect text language for 1.6B additional content items.
- In-House IAB 3.0 Classification: train a multilingual model on 601M historical labels and serve local inference.
- Topic Extraction Model: separate topic prediction for the most frequent topics first.
- Publisher Rollup at Scale: run existing rollup logic across all 13.7M publisher entities.
Expected impact
- Language detection coverage moves from roughly 8% toward 95%.
- IAB 3.0 coverage moves from 34.4% toward full corpus coverage.
- About 1.14B additional content items gain IAB 3.0 classification.
- External classification API cost and dependency are removed for future classification.
Detailed methodology and results
Supporting methodology, figures, and tables are rendered here as native page content with the same visual system as the rest of this website.
Data Engineering Team
Data Lakehouse Architecture
Medallion Architecture: Staging Core Curated
Raw ingested data. May contain duplicates. One table per pipeline source.
Deduplicated via AWS Glue. One row per entity. Merges fields from all staging sources.
Enriched, filtered views for downstream consumers. Subset of core with renamed fields.
Celery Workers fetch from APIs
S3 Parquet .append() writes
Accumulator Append to Iceberg
Glue Merge Dedupe Core
Curated Views Filtered + renamed
Current Content Classification Coverage
Two generations of IAB categories exist in the content corpus
1.75B
13.7M
83.7M
101
Two IAB classification systems on the content corpus
| Field | Taxonomy | Content items covered | Coverage | Source |
|---|---|---|---|---|
| iab_categories | Legacy IAB | 1,738,628,314 | 99.6% | Existing NLP pipeline |
| iab_categories_3 | IAB 3.0 | 600,950,789 | 34.4% | external classification API |
| primary_category_3 | IAB 3.0 | 600,944,111 | 34.4% | external classification API |
| nlp_topics | Topics | 604,562,957 | 34.6% | external classification API |
Legacy IAB example
IAB 3.0 example (external classification API)
Legacy IAB
The Two Gaps to Close
Gap 1: Language Detection
nlp_text_lang_code on content_items. Only 8% of 1.75B content items have a detected text language.
Gap 2: IAB 3.0 Classification
iab_categories_3 on content_items. external classification API has classified 34.4% . The remaining 1.14B need coverage.
These gaps are linked.
Why Does the Language Gap Exist?
Three compounding factors
1. Language Detection is Coupled to Content Suitability
2. No Fallback for Text Detection
3. Short Text Problem
Content item Text title + description
GCLD3 90% confidence?
nlp_text_lang_code detected!
NULL forever No fallback no FastText, no langid
GCLD3 Failures: Fixable or Not?
Text length comparison for the external classification API subset (948K content items)
Comparing text length (title + description) where GCLD3 succeeded vs failed :
Detected
Undetected
61% of undetected content items have 90+ chars
39% have 90 chars
The Language Mismatch Problem
Creator-declared language often differs from actual text language
Two language fields exist on content_items:
| Field | Source | Coverage |
|---|---|---|
| primary_lang_code | content item platform API / GCLD3 / FastText | 94.8% (1.65B) |
| nlp_text_lang_code | GCLD3 only (no fallback) | 8.1% (141M) |
primary_lang_code answers: "What language is this content item meant to be in?" nlp_text_lang_code answers: "What language is the actual text written in?"
For IAB classification, the system needs to know what language the text actually is
When they disagree (labeled subset: 119K content items)
| Actual Text | Declared | Count |
|---|---|---|
| English | Hindi | 36,709 |
| English | Telugu | 9,731 |
| English | Urdu | 6,282 |
| English | Tamil | 6,043 |
| English | Bengali | 5,155 |
| Spanish | English | 3,183 |
| Arabic | English | 2,793 |
Pattern: Indian creators write titles/descriptions in English but set their publisher language to Hindi, Telugu, etc.
How external classification API Works Today
The pipeline that produced 601M IAB 3.0 classifications
Content item Title + Description + Tags
external classification API External NLP service
IAB 3.0 Categories + Topics + Confidence Scores
What external classification API produces per content item
- Primary IAB 3.0 category e.g. "Food Drink Cooking"
- All IAB 3.0 categories multi-label, hierarchical
- Topics Wikipedia-linked concepts with scores
- Confidence scores per category and topic
DLH pipeline flow
content_classification_scheduler Find unprocessed content items
content_classification_worker Call external classification API
Accumulator merge into content_items
Current staging tables
| Table | Rows |
|---|---|
| staging_classifier_labels | 947,774 |
| staging_classifier_labels_confidence | 947,774 |
| staging_publisher_classifier | 1 |
948K in staging
Training Data for an In-House Model
External classification has produced a large labeled dataset
601M
948K
101
Training data available
Top IAB 3.0 categories (by content item count)
Entertainment Music
Sports
Religion Spirituality
Content item Gaming
Technology Computing
Food Drink
Automotive
Language Distribution (content corpus)
101 languages detected across 141M content items with nlp_text_lang_code
English
Hindi
Spanish
Arabic
Portuguese
Bengali
Chinese
Russian
Japanese
Vietnamese
Korean
Thai
French
Turkish
Telugu
Tamil
German
Marathi
Italian
Other 81 languages
Only 35% is English
The Solution: Two New Pipelines
Built on the same DLH patterns already in production
Pipeline 1
Language Detection Scheduler
Detect text language for 1.6B content items with NULL nlp_text_lang_code . Multi-tier cascade: GCLD3 FastText langid.
Pipeline 2
In-House IAB 3.0 Classification Model
Train a multilingual transformer on 601M external classification API-labeled content items. Replace external API with local inference.
Reuse
Publisher Rollup (Existing Logic)
Weighted aggregation of content item-level scores to publisher level. Already built. Just swap the input source.
Dependency:
Pipeline 1: Language Detection Scheduler
Same Celery patterns, new multi-tier detection cascade
3-Tier Detection Cascade
TIER 1
GCLD3
Same detector, lowered threshold (50%)
TIER 2
FastText
lid.176.bin works on 10+ chars
TIER 3
langid
Lightweight final fallback
New staging table schema
Each detection is tagged with its method and confidence
Pipeline 2: In-House IAB 3.0 Classification
Train on external classification API labels, serve via local inference
Model approach
- Multilingual transformer (e.g. XLM-RoBERTa) as base model
- Multi-label classification for IAB 3.0 categories
- 601M labeled samples from external classification API for training
- 101 languages in training data
- Confidence scores per label (same as external classification API output)
Pipeline integration (same DLH pattern)
What changes vs. current external classification API pipeline
| external classification API (current) | In-House (new) | |
|---|---|---|
| Inference | External API call | Local model |
| Cost | Per-call API fee | $0 per call |
| Latency | Network round-trip | Local GPU |
| Languages | 19 supported | 101 detected |
| Control | External dependency | Full ownership |
Key point:
End-to-End: Current vs. New Pipeline
Current flow (external classification API)
content_items title + desc + tags
external classification API External paid service
staging_classifier_labels
Glue merge
New flow (In-House two pipelines)
content_items 1.6B with NULL lang
Pipeline 1 GCLD3 FastText langid
staging_language_detection
Glue merge nlp_text_lang_code
content_items 1.14B with no IAB 3.0
Pipeline 2 Local model inference
staging_iab_labels
Glue merge iab_categories_3
content_items content item IAB 3.0 data
Publisher Rollup Existing weighted logic
staging_publisher_iab
Glue merge publishers
Every step uses the same DLH patterns
Before After
Current content corpus
After Both Pipelines
Implementation Roadmap
| Phase | Deliverable | Impact | DLH Components |
|---|---|---|---|
| Phase 1 | Language Detection Scheduler | Detect text language for 1.6B content items. 3-tier cascade. Unblocks classification. | New IcebergTable, scheduler task, worker task, accumulator, Glue job |
| Phase 2 | IAB 3.0 Classification Model | Train multilingual transformer on 601M external classification API labels. Multi-label IAB 3.0. | Model training (offline), new auditor class, replaces classifier worker task |
| Phase 3 | Topic Extraction Model | Separate model for topic prediction. Scope to top-N frequent topics initially. | New model, extends Phase 2 worker or separate worker task |
| Phase 4 | Publisher Rollup at Scale | Run existing rollup across all 13.7M publishers (currently only 1 done). | Existing publisher classifier task + auditor. No code change needed. |
Each phase is independently deployable
Expected Gains
1.6B
1.14B
13.7M
What to build
- Language detection scheduler (Celery + 3-tier cascade)
- In-house IAB 3.0 model (multilingual transformer)
- Topic extraction model (Phase 3)
- New staging tables + Glue merge jobs
What to retain
- Entire DLH Medallion Architecture (Staging Core Curated)
- Celery task patterns (scheduler worker accumulator)
- S3 + Iceberg + Glue pipeline
- Publisher rollup logic (existing code, no changes)
- Redis locking, queue management, all infrastructure
Bottom line: