ML/NLP / 2026-03

IAB 3.0 Classification and Language Detection Pipeline Proposal

This proposal lays out how to replace an external classification API with in-house language detection and IAB 3.0 classification pipelines while preserving the existing data lakehouse architecture. The goal is to scale classification coverage across a 1.75 billion item media corpus.

Published Research Native report page

Core items 1.75B

Publishers 13.7M

Transcripts 83.7M

IAB 3.0 gap 65.6%

Items without IAB 3.0 coverage

Problem

Close language-detection and IAB 3.0 classification coverage gaps across a large media corpus without depending on external API calls.

Approach

Proposed two staged pipelines: a language detection cascade and a multilingual IAB 3.0 classifier trained on historical labels.

Result

The plan maps 1.6B additional language detections and 1.14B additional IAB classifications into the existing lakehouse pattern.

Technologies

AWS Glue
Iceberg
S3
Celery
GCLD3
FastText
XLM-RoBERTa
IAB 3.0

Current architecture

The data lakehouse follows a staging, core, and curated pattern. Staging tables hold raw ingested data and may contain duplicates. Core tables are deduplicated through AWS Glue and contain one row per entity. Curated tables provide enriched and filtered downstream views.

The core content table contains 1.75B rows and 57 columns.
The core publisher table contains 13.7M publisher entities.
The transcript table contains 83.7M transcript rows.
The curated content view contains roughly 960M rows.

Coverage gap

Field	Coverage	Source
Legacy IAB categories	1.739B items / 99.6%	Existing NLP pipeline
IAB 3.0 categories	600.951M items / 34.4%	External classification API
Primary IAB 3.0 category	600.944M items / 34.4%	External classification API
NLP topics	604.563M items / 34.6%	External classification API
Detected text language	141M items / 8.0%	Current detection

Proposed pipelines

Language Detection Scheduler: a Celery-based cascade to detect text language for 1.6B additional content items.
In-House IAB 3.0 Classification: train a multilingual model on 601M historical labels and serve local inference.
Topic Extraction Model: separate topic prediction for the most frequent topics first.
Publisher Rollup at Scale: run existing rollup logic across all 13.7M publisher entities.

Expected impact

Language detection coverage moves from roughly 8% toward 95%.
IAB 3.0 coverage moves from 34.4% toward full corpus coverage.
About 1.14B additional content items gain IAB 3.0 classification.
External classification API cost and dependency are removed for future classification.

Detailed methodology and results

Supporting methodology, figures, and tables are rendered here as native page content with the same visual system as the rest of this website.

Data Engineering Team

Data Lakehouse Architecture

Medallion Architecture: Staging Core Curated

Raw ingested data. May contain duplicates. One table per pipeline source.

Deduplicated via AWS Glue. One row per entity. Merges fields from all staging sources.

Enriched, filtered views for downstream consumers. Subset of core with renamed fields.

Celery Workers fetch from APIs

S3 Parquet .append() writes

Accumulator Append to Iceberg

Glue Merge Dedupe Core

Curated Views Filtered + renamed

Current Content Classification Coverage

Two generations of IAB categories exist in the content corpus

1.75B

13.7M

83.7M

101

Two IAB classification systems on the content corpus

Field	Taxonomy	Content items covered	Coverage	Source
iab_categories	Legacy IAB	1,738,628,314	99.6%	Existing NLP pipeline
iab_categories_3	IAB 3.0	600,950,789	34.4%	external classification API
primary_category_3	IAB 3.0	600,944,111	34.4%	external classification API
nlp_topics	Topics	604,562,957	34.6%	external classification API

Legacy IAB example

IAB 3.0 example (external classification API)

Legacy IAB

The Two Gaps to Close

Gap 1: Language Detection

nlp_text_lang_code on content_items. Only 8% of 1.75B content items have a detected text language.

Gap 2: IAB 3.0 Classification

iab_categories_3 on content_items. external classification API has classified 34.4% . The remaining 1.14B need coverage.

These gaps are linked.

Why Does the Language Gap Exist?

Three compounding factors

1. Language Detection is Coupled to Content Suitability

2. No Fallback for Text Detection

3. Short Text Problem

Content item Text title + description

GCLD3 90% confidence?

nlp_text_lang_code detected!

NULL forever No fallback no FastText, no langid

GCLD3 Failures: Fixable or Not?

Text length comparison for the external classification API subset (948K content items)

Comparing text length (title + description) where GCLD3 succeeded vs failed :

Detected

Undetected

61% of undetected content items have 90+ chars

39% have 90 chars

The Language Mismatch Problem

Creator-declared language often differs from actual text language

Two language fields exist on content_items:

Field	Source	Coverage
primary_lang_code	content item platform API / GCLD3 / FastText	94.8% (1.65B)
nlp_text_lang_code	GCLD3 only (no fallback)	8.1% (141M)

primary_lang_code answers: "What language is this content item meant to be in?" nlp_text_lang_code answers: "What language is the actual text written in?"

For IAB classification, the system needs to know what language the text actually is

When they disagree (labeled subset: 119K content items)

Actual Text	Declared	Count
English	Hindi	36,709
English	Telugu	9,731
English	Urdu	6,282
English	Tamil	6,043
English	Bengali	5,155
Spanish	English	3,183
Arabic	English	2,793

Pattern: Indian creators write titles/descriptions in English but set their publisher language to Hindi, Telugu, etc.

How external classification API Works Today

The pipeline that produced 601M IAB 3.0 classifications

Content item Title + Description + Tags

external classification API External NLP service

IAB 3.0 Categories + Topics + Confidence Scores

What external classification API produces per content item

Primary IAB 3.0 category e.g. "Food Drink Cooking"
All IAB 3.0 categories multi-label, hierarchical
Topics Wikipedia-linked concepts with scores
Confidence scores per category and topic

DLH pipeline flow

content_classification_scheduler Find unprocessed content items

content_classification_worker Call external classification API

Accumulator merge into content_items

Current staging tables

Table	Rows
staging_classifier_labels	947,774
staging_classifier_labels_confidence	947,774
staging_publisher_classifier	1

948K in staging

Training Data for an In-House Model

External classification has produced a large labeled dataset

601M

948K

101

Training data available

Top IAB 3.0 categories (by content item count)

Entertainment Music

Sports

Religion Spirituality

Content item Gaming

Technology Computing

Food Drink

Automotive

Language Distribution (content corpus)

101 languages detected across 141M content items with nlp_text_lang_code

English

Hindi

Spanish

Arabic

Portuguese

Bengali

Chinese

Russian

Japanese

Vietnamese

Korean

Thai

French

Turkish

Telugu

Tamil

German

Marathi

Italian

Other 81 languages

Only 35% is English

The Solution: Two New Pipelines

Built on the same DLH patterns already in production

Pipeline 1

Language Detection Scheduler

Detect text language for 1.6B content items with NULL nlp_text_lang_code . Multi-tier cascade: GCLD3 FastText langid.

Pipeline 2

In-House IAB 3.0 Classification Model

Train a multilingual transformer on 601M external classification API-labeled content items. Replace external API with local inference.

Reuse

Publisher Rollup (Existing Logic)

Weighted aggregation of content item-level scores to publisher level. Already built. Just swap the input source.

Dependency:

Pipeline 1: Language Detection Scheduler

Same Celery patterns, new multi-tier detection cascade

3-Tier Detection Cascade

TIER 1

GCLD3

Same detector, lowered threshold (50%)

TIER 2

FastText

lid.176.bin works on 10+ chars

TIER 3

langid

Lightweight final fallback

New staging table schema

Each detection is tagged with its method and confidence

Pipeline 2: In-House IAB 3.0 Classification

Train on external classification API labels, serve via local inference

Model approach

Multilingual transformer (e.g. XLM-RoBERTa) as base model
Multi-label classification for IAB 3.0 categories
601M labeled samples from external classification API for training
101 languages in training data
Confidence scores per label (same as external classification API output)

Pipeline integration (same DLH pattern)

What changes vs. current external classification API pipeline

	external classification API (current)	In-House (new)
Inference	External API call	Local model
Cost	Per-call API fee	$0 per call
Latency	Network round-trip	Local GPU
Languages	19 supported	101 detected
Control	External dependency	Full ownership

Key point:

End-to-End: Current vs. New Pipeline

Current flow (external classification API)

content_items title + desc + tags

external classification API External paid service

staging_classifier_labels

Glue merge

New flow (In-House two pipelines)

content_items 1.6B with NULL lang

Pipeline 1 GCLD3 FastText langid

staging_language_detection

Glue merge nlp_text_lang_code

content_items 1.14B with no IAB 3.0

Pipeline 2 Local model inference

staging_iab_labels

Glue merge iab_categories_3

content_items content item IAB 3.0 data

Publisher Rollup Existing weighted logic

staging_publisher_iab

Glue merge publishers

Every step uses the same DLH patterns

Before After

Current content corpus

After Both Pipelines

Implementation Roadmap

Phase	Deliverable	Impact	DLH Components
Phase 1	Language Detection Scheduler	Detect text language for 1.6B content items. 3-tier cascade. Unblocks classification.	New IcebergTable, scheduler task, worker task, accumulator, Glue job
Phase 2	IAB 3.0 Classification Model	Train multilingual transformer on 601M external classification API labels. Multi-label IAB 3.0.	Model training (offline), new auditor class, replaces classifier worker task
Phase 3	Topic Extraction Model	Separate model for topic prediction. Scope to top-N frequent topics initially.	New model, extends Phase 2 worker or separate worker task
Phase 4	Publisher Rollup at Scale	Run existing rollup across all 13.7M publishers (currently only 1 done).	Existing publisher classifier task + auditor. No code change needed.

Each phase is independently deployable

Expected Gains

1.6B

1.14B

13.7M

What to build

Language detection scheduler (Celery + 3-tier cascade)
In-house IAB 3.0 model (multilingual transformer)
Topic extraction model (Phase 3)
New staging tables + Glue merge jobs

What to retain

Entire DLH Medallion Architecture (Staging Core Curated)
Celery task patterns (scheduler worker accumulator)
S3 + Iceberg + Glue pipeline
Publisher rollup logic (existing code, no changes)
Redis locking, queue management, all infrastructure

Bottom line: