CLDR ↔ Fedora Language Alignment¶
Purpose¶
This notebook prepares CLDR-based language alignment for Fedora localization analysis, including speaker-based and region-based analysis with visualizations.
Key Concepts¶
CLDR as Reference Universe: The Unicode Common Locale Data Repository (CLDR) provides authoritative language and territory data, including speaker populations. We use CLDR as the canonical "language universe" against which Fedora's localization coverage is measured.
Fedora Coverage Projection: Fedora's translated languages (detected from
stats/fXX/*.csvfiles) are projected onto the CLDR universe. This allows us to:- Identify which CLDR languages Fedora covers
- Find gaps (CLDR languages without Fedora translations)
- Discover Fedora-specific locales not in CLDR
Intentional Missing Languages: Languages missing from either dataset are preserved intentionally—they represent opportunities for expansion or edge cases worth investigating.
Abstract Voronoi Diagram: The Voronoi visualization in this notebook is conceptual, NOT geographic. It uses abstract 2D coordinates derived from speaker counts and regions to show relative language importance and coverage patterns. This helps visualize language diversity beyond English dominance.
Analysis Components¶
- Speaker-based analysis: Languages bucketed by speaker count (<1M, 1–10M, 10–100M, >100M)
- Region-based analysis: Languages grouped by macro-region (Africa, Americas, Asia, Europe, Oceania)
- Visualizations: Bar charts, scatter plots, and abstract Voronoi diagrams (grayscale + hatching)
Setup and Imports¶
import json
import hashlib
from pathlib import Path
from typing import Optional, Dict, List, Set, Tuple
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.collections import PolyCollection
from scipy.spatial import Voronoi
# Configure display options
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', None)
# Matplotlib defaults for accessibility (grayscale-friendly)
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=['#333333', '#666666', '#999999', '#CCCCCC'])
Configuration¶
Define paths to CLDR data and Fedora statistics. All paths are relative to this notebook's location.
# Base paths
NOTEBOOK_DIR = Path(".").resolve()
PROJECT_ROOT = NOTEBOOK_DIR.parent
# CLDR data paths (committed files in data/CLDR-raw)
CLDR_DIR = PROJECT_ROOT / "data" / "CLDR-raw"
TERRITORY_INFO_PATH = CLDR_DIR / "territoryInfo.json"
LANGUAGE_DATA_PATH = CLDR_DIR / "languageData.json"
TERRITORIES_PATH = CLDR_DIR / "territories.json"
# Fedora stats path (may or may not exist)
FEDORA_STATS_DIR = NOTEBOOK_DIR / "stats"
DEFAULT_FEDORA_RELEASE = "f43" # Most recent release to scan
# Region definitions based on CLDR territory containment (UN M49 codes)
# These are hardcoded to avoid dependency on territoryContainment.json
REGION_MAPPING = {
# Africa (002)
'Africa': {'DZ', 'EG', 'LY', 'MA', 'SD', 'TN', 'EH', 'BJ', 'BF', 'CV', 'CI', 'GM', 'GH', 'GN', 'GW', 'LR', 'ML', 'MR', 'NE', 'NG', 'SN', 'SL', 'TG', 'AO', 'CM', 'CF', 'TD', 'CG', 'CD', 'GQ', 'GA', 'ST', 'BI', 'KM', 'DJ', 'ER', 'ET', 'KE', 'MG', 'MW', 'MU', 'YT', 'MZ', 'RE', 'RW', 'SC', 'SO', 'SS', 'TZ', 'UG', 'ZM', 'ZW', 'BW', 'SZ', 'LS', 'NA', 'ZA'},
# Americas (019)
'Americas': {'AI', 'AG', 'AW', 'BS', 'BB', 'BQ', 'VG', 'KY', 'CU', 'CW', 'DM', 'DO', 'GD', 'GP', 'HT', 'JM', 'MQ', 'MS', 'PR', 'BL', 'KN', 'LC', 'MF', 'PM', 'VC', 'SX', 'TT', 'TC', 'VI', 'BZ', 'CR', 'SV', 'GT', 'HN', 'MX', 'NI', 'PA', 'AR', 'BO', 'BV', 'BR', 'CL', 'CO', 'EC', 'FK', 'GF', 'GY', 'PY', 'PE', 'GS', 'SR', 'UY', 'VE', 'BM', 'CA', 'GL', 'US', 'UM'},
# Asia (142)
'Asia': {'KZ', 'KG', 'TJ', 'TM', 'UZ', 'CN', 'HK', 'MO', 'KP', 'JP', 'MN', 'KR', 'TW', 'AF', 'BD', 'BT', 'IN', 'IR', 'MV', 'NP', 'PK', 'LK', 'BN', 'KH', 'ID', 'LA', 'MY', 'MM', 'PH', 'SG', 'TH', 'TL', 'VN', 'AM', 'AZ', 'BH', 'CY', 'GE', 'IQ', 'IL', 'JO', 'KW', 'LB', 'OM', 'PS', 'QA', 'SA', 'SY', 'TR', 'AE', 'YE'},
# Europe (150)
'Europe': {'BY', 'BG', 'CZ', 'HU', 'MD', 'PL', 'RO', 'RU', 'SK', 'UA', 'AX', 'DK', 'EE', 'FO', 'FI', 'GG', 'IS', 'IE', 'IM', 'JE', 'LV', 'LT', 'NO', 'SJ', 'SE', 'GB', 'AL', 'AD', 'BA', 'HR', 'GI', 'GR', 'VA', 'IT', 'MT', 'ME', 'MK', 'PT', 'SM', 'RS', 'SI', 'ES', 'AT', 'BE', 'FR', 'DE', 'LI', 'LU', 'MC', 'NL', 'CH'},
# Oceania (009)
'Oceania': {'AU', 'CX', 'CC', 'HM', 'NZ', 'NF', 'FJ', 'NC', 'PG', 'SB', 'VU', 'GU', 'KI', 'MH', 'FM', 'NR', 'MP', 'PW', 'AS', 'CK', 'PF', 'NU', 'PN', 'WS', 'TK', 'TO', 'TV', 'WF'}
}
print(f"Project root: {PROJECT_ROOT}")
print(f"CLDR directory: {CLDR_DIR}")
print(f"CLDR directory exists: {CLDR_DIR.exists()}")
def load_cldr_data(cldr_dir: Path) -> Tuple[Dict, Dict]:
"""
Load CLDR JSON files from the specified directory.
Args:
cldr_dir: Path to the CLDR-raw directory
Returns:
Tuple of (territory_info_data, language_data)
Raises:
FileNotFoundError: If required CLDR files don't exist
"""
territory_info_path = cldr_dir / "territoryInfo.json"
language_data_path = cldr_dir / "languageData.json"
if not territory_info_path.exists():
raise FileNotFoundError(f"CLDR file not found: {territory_info_path}")
if not language_data_path.exists():
raise FileNotFoundError(f"CLDR file not found: {language_data_path}")
with open(territory_info_path, 'r', encoding='utf-8') as f:
territory_info_data = json.load(f)
with open(language_data_path, 'r', encoding='utf-8') as f:
language_data = json.load(f)
return territory_info_data, language_data
def extract_cldr_languages(territory_info_data: Dict, language_data: Dict) -> Set[str]:
"""
Extract the complete set of CLDR language codes from both data sources.
Combines languages from:
- languageData.json (script/writing system info)
- territoryInfo.json (population data)
Filters out alternate forms (e.g., 'aa-alt-secondary').
Args:
territory_info_data: Parsed territoryInfo.json
language_data: Parsed languageData.json
Returns:
Set of primary language codes
"""
# From languageData.json
lang_data_section = language_data.get('supplemental', {}).get('languageData', {})
languages_from_lang_data = {
lang for lang in lang_data_section.keys()
if '-alt-' not in lang
}
# From territoryInfo.json
territory_info = territory_info_data.get('supplemental', {}).get('territoryInfo', {})
languages_from_territory = set()
for territory_data in territory_info.values():
lang_pop = territory_data.get('languagePopulation', {})
languages_from_territory.update(lang_pop.keys())
return languages_from_lang_data | languages_from_territory
def compute_speaker_estimates(territory_info_data: Dict) -> Tuple[Dict[str, int], Dict[str, List[str]]]:
"""
Calculate estimated speaker counts and territory mappings per language.
Formula: speakers(lang, territory) = population × (language_percent / 100)
Args:
territory_info_data: Parsed territoryInfo.json
Returns:
Tuple of (speaker_counts dict, language_territories dict)
"""
territory_info = territory_info_data.get('supplemental', {}).get('territoryInfo', {})
speaker_counts = {}
lang_territories = {}
for territory_code, territory_data in territory_info.items():
# Get territory population
try:
territory_population = int(territory_data.get('_population', '0'))
except (ValueError, TypeError):
continue
# Process each language in this territory
lang_pop = territory_data.get('languagePopulation', {})
for lang_code, lang_info in lang_pop.items():
# Get population percentage
try:
pop_percent = float(lang_info.get('_populationPercent', '0'))
except (ValueError, TypeError):
continue
# Calculate speakers
speakers = int(territory_population * pop_percent / 100)
speaker_counts[lang_code] = speaker_counts.get(lang_code, 0) + speakers
# Track territories
if lang_code not in lang_territories:
lang_territories[lang_code] = []
lang_territories[lang_code].append(territory_code)
# Sort territory lists
for lang_code in lang_territories:
lang_territories[lang_code].sort()
return speaker_counts, lang_territories
def detect_fedora_languages(stats_dir: Path, release: str = "f43") -> Tuple[Set[str], bool]:
"""
Detect Fedora languages by scanning CSV files in the stats directory.
Args:
stats_dir: Base stats directory
release: Fedora release to scan (default: 'f43')
Returns:
Tuple of (set of language codes, bool indicating if folder exists)
"""
release_dir = stats_dir / release
if not release_dir.exists():
return set(), False
# Exclude special files
special_files = {'distribution', 'error'}
languages = set()
for csv_file in release_dir.glob('*.csv'):
lang_code = csv_file.stem
if lang_code not in special_files:
languages.add(lang_code)
return languages, True
def get_primary_region(territories: List[str], region_mapping: Dict[str, Set[str]]) -> str:
"""
Determine the primary region for a language based on its territories.
Returns the region with the most territories for this language.
Args:
territories: List of territory codes where the language is spoken
region_mapping: Dict mapping region names to sets of territory codes
Returns:
Region name or 'Unknown' if no match
"""
if not territories:
return 'Unknown'
region_counts = {region: 0 for region in region_mapping}
for territory in territories:
for region, region_territories in region_mapping.items():
if territory in region_territories:
region_counts[region] += 1
break
# Return region with highest count, or 'Unknown' if all zeros
max_region = max(region_counts, key=region_counts.get)
return max_region if region_counts[max_region] > 0 else 'Unknown'
def get_speaker_bucket(speakers: Optional[int]) -> str:
"""
Categorize speaker count into buckets.
Args:
speakers: Estimated speaker count (may be None)
Returns:
Bucket label: '<1M', '1–10M', '10–100M', '>100M', or 'Unknown'
"""
if speakers is None or pd.isna(speakers):
return 'Unknown'
elif speakers < 1_000_000:
return '<1M'
elif speakers < 10_000_000:
return '1–10M'
elif speakers < 100_000_000:
return '10–100M'
else:
return '>100M'
def build_alignment_dataframe(
cldr_langs: Set[str],
fedora_langs: Set[str],
speaker_estimates: Dict[str, int],
territory_mapping: Dict[str, List[str]],
region_mapping: Dict[str, Set[str]]
) -> pd.DataFrame:
"""
Build the CLDR ↔ Fedora language alignment DataFrame with enriched columns.
Columns:
- language_code: Language identifier
- in_cldr: Boolean - in CLDR?
- in_fedora: Boolean - in Fedora?
- estimated_speakers: Total estimated speakers
- territories: Comma-separated territory list
- log_speakers: log10(speakers), NaN-safe
- speaker_bucket: <1M, 1–10M, 10–100M, >100M, Unknown
- region: Primary region (Africa, Americas, Asia, Europe, Oceania, Unknown)
Args:
cldr_langs: Set of CLDR language codes
fedora_langs: Set of Fedora language codes
speaker_estimates: Dict of language -> speaker count
territory_mapping: Dict of language -> territory list
region_mapping: Dict of region -> territory set
Returns:
Enriched alignment DataFrame
"""
all_languages = cldr_langs | fedora_langs
records = []
for lang_code in sorted(all_languages):
territories = territory_mapping.get(lang_code, [])
speakers = speaker_estimates.get(lang_code)
# Calculate log_speakers (NaN-safe)
if speakers is not None and speakers > 0:
log_speakers = np.log10(speakers)
else:
log_speakers = np.nan
records.append({
'language_code': lang_code,
'in_cldr': lang_code in cldr_langs,
'in_fedora': lang_code in fedora_langs,
'estimated_speakers': speakers,
'territories': ', '.join(territories) if territories else None,
'log_speakers': log_speakers,
'speaker_bucket': get_speaker_bucket(speakers),
'region': get_primary_region(territories, region_mapping)
})
df = pd.DataFrame(records)
# Set appropriate dtypes
df['in_cldr'] = df['in_cldr'].astype(bool)
df['in_fedora'] = df['in_fedora'].astype(bool)
df['estimated_speakers'] = pd.to_numeric(df['estimated_speakers'], errors='coerce').astype('Int64')
# Make speaker_bucket categorical with proper order
bucket_order = ['<1M', '1–10M', '10–100M', '>100M', 'Unknown']
df['speaker_bucket'] = pd.Categorical(df['speaker_bucket'], categories=bucket_order, ordered=True)
# Make region categorical
region_order = ['Africa', 'Americas', 'Asia', 'Europe', 'Oceania', 'Unknown']
df['region'] = pd.Categorical(df['region'], categories=region_order, ordered=True)
return df
# Load CLDR data
territory_info_data, language_data = load_cldr_data(CLDR_DIR)
print(f"Loaded CLDR version: {territory_info_data['supplemental']['version']['_cldrVersion']}")
# Extract CLDR languages
cldr_languages = extract_cldr_languages(territory_info_data, language_data)
print(f"CLDR languages: {len(cldr_languages)}")
# Compute speaker estimates and territory mapping
speaker_estimates, language_territories = compute_speaker_estimates(territory_info_data)
print(f"Languages with speaker data: {len(speaker_estimates)}")
# Detect Fedora languages
fedora_languages, stats_found = detect_fedora_languages(FEDORA_STATS_DIR, DEFAULT_FEDORA_RELEASE)
if stats_found:
print(f"Fedora languages (from {DEFAULT_FEDORA_RELEASE}): {len(fedora_languages)}")
else:
print(f"ℹ️ Stats folder not found. Running in DEMO MODE with empty Fedora set.")
# Build enriched alignment DataFrame
alignment_df = build_alignment_dataframe(
cldr_languages,
fedora_languages,
speaker_estimates,
language_territories,
REGION_MAPPING
)
print(f"\nAlignment DataFrame: {len(alignment_df)} languages, {len(alignment_df.columns)} columns")
print(f"Columns: {list(alignment_df.columns)}")
# Preview the enriched DataFrame
print("Sample data (first 15 rows):")
alignment_df.head(15)
Coverage Summary¶
def summarize_coverage(df: pd.DataFrame, stats_found: bool) -> None:
"""
Print a comprehensive coverage summary.
"""
total = len(df)
cldr_total = len(df[df['in_cldr']])
fedora_total = len(df[df['in_fedora']])
in_both = len(df[df['in_cldr'] & df['in_fedora']])
cldr_only = len(df[df['in_cldr'] & ~df['in_fedora']])
fedora_only = len(df[~df['in_cldr'] & df['in_fedora']])
print("=" * 55)
print("CLDR ↔ Fedora Language Coverage Summary")
print("=" * 55)
print(f"Total unique languages: {total:>6}")
print(f"Languages in CLDR: {cldr_total:>6}")
print(f"Languages in Fedora: {fedora_total:>6}")
print(f" → In both (overlap): {in_both:>6}")
print(f" → CLDR only (Fedora gaps): {cldr_only:>6}")
print(f" → Fedora only (not in CLDR): {fedora_only:>6}")
print("=" * 55)
if cldr_total > 0 and stats_found:
coverage_pct = (in_both / cldr_total) * 100
print(f"\nFedora covers {coverage_pct:.1f}% of CLDR languages")
# Speaker-weighted coverage
total_speakers = df[df['in_cldr']]['estimated_speakers'].sum()
covered_speakers = df[df['in_cldr'] & df['in_fedora']]['estimated_speakers'].sum()
if total_speakers and total_speakers > 0:
speaker_coverage = (covered_speakers / total_speakers) * 100
print(f"Fedora covers {speaker_coverage:.1f}% of estimated speakers")
summarize_coverage(alignment_df, stats_found)
def analyze_by_speaker_bucket(df: pd.DataFrame) -> pd.DataFrame:
"""
Create analysis table by speaker bucket.
"""
# Filter to CLDR languages only for this analysis
cldr_df = df[df['in_cldr']].copy()
# Group by speaker bucket
bucket_stats = cldr_df.groupby('speaker_bucket', observed=True).agg(
cldr_count=('language_code', 'count'),
fedora_count=('in_fedora', 'sum'),
total_speakers=('estimated_speakers', 'sum')
).reset_index()
# Calculate coverage ratio
bucket_stats['coverage_ratio'] = (
bucket_stats['fedora_count'] / bucket_stats['cldr_count'] * 100
).round(1)
bucket_stats['fedora_count'] = bucket_stats['fedora_count'].astype(int)
return bucket_stats
speaker_bucket_analysis = analyze_by_speaker_bucket(alignment_df)
print("CLDR Language Count and Fedora Coverage by Speaker Bucket:")
print("(Coverage ratio = Fedora languages / CLDR languages × 100)")
print()
speaker_bucket_analysis
Region Analysis¶
How are languages distributed across geographic regions, and what's Fedora's coverage?
def analyze_by_region(df: pd.DataFrame) -> pd.DataFrame:
"""
Create analysis table by region.
"""
# Filter to CLDR languages only
cldr_df = df[df['in_cldr']].copy()
# Group by region
region_stats = cldr_df.groupby('region', observed=True).agg(
cldr_count=('language_code', 'count'),
fedora_count=('in_fedora', 'sum'),
total_speakers=('estimated_speakers', 'sum')
).reset_index()
# Calculate coverage ratio
region_stats['coverage_ratio'] = (
region_stats['fedora_count'] / region_stats['cldr_count'] * 100
).round(1)
region_stats['fedora_count'] = region_stats['fedora_count'].astype(int)
return region_stats
region_analysis = analyze_by_region(alignment_df)
print("CLDR Language Count and Fedora Coverage by Region:")
print("(Coverage ratio = Fedora languages / CLDR languages × 100)")
print()
region_analysis
def plot_speaker_bucket_comparison(bucket_df: pd.DataFrame) -> None:
"""
Create a grouped bar chart comparing CLDR and Fedora language counts by speaker bucket.
Uses grayscale colors and hatching for accessibility.
"""
fig, ax = plt.subplots(figsize=(10, 6))
# Filter out Unknown bucket for cleaner visualization
plot_df = bucket_df[bucket_df['speaker_bucket'] != 'Unknown'].copy()
x = np.arange(len(plot_df))
width = 0.35
# CLDR bars (solid gray)
bars1 = ax.bar(x - width/2, plot_df['cldr_count'], width,
label='CLDR', color='#888888', edgecolor='black')
# Fedora bars (hatched)
bars2 = ax.bar(x + width/2, plot_df['fedora_count'], width,
label='Fedora', color='#CCCCCC', edgecolor='black', hatch='//')
# Add coverage ratio annotations
for i, (_, row) in enumerate(plot_df.iterrows()):
ax.annotate(f"{row['coverage_ratio']:.0f}%",
xy=(x[i] + width/2, row['fedora_count']),
ha='center', va='bottom', fontsize=9)
ax.set_xlabel('Speaker Bucket', fontsize=12)
ax.set_ylabel('Number of Languages', fontsize=12)
ax.set_title('CLDR vs Fedora Languages by Estimated Speaker Count', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(plot_df['speaker_bucket'].astype(str))
ax.legend()
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
plot_speaker_bucket_comparison(speaker_bucket_analysis)
Bar Chart: CLDR vs Fedora by Region¶
def plot_region_comparison(region_df: pd.DataFrame) -> None:
"""
Create a grouped bar chart comparing CLDR and Fedora language counts by region.
Uses grayscale colors and hatching for accessibility.
"""
fig, ax = plt.subplots(figsize=(12, 6))
# Filter out Unknown region
plot_df = region_df[region_df['region'] != 'Unknown'].copy()
x = np.arange(len(plot_df))
width = 0.35
# CLDR bars (solid gray)
bars1 = ax.bar(x - width/2, plot_df['cldr_count'], width,
label='CLDR', color='#888888', edgecolor='black')
# Fedora bars (hatched)
bars2 = ax.bar(x + width/2, plot_df['fedora_count'], width,
label='Fedora', color='#CCCCCC', edgecolor='black', hatch='//')
# Add coverage ratio annotations
for i, (_, row) in enumerate(plot_df.iterrows()):
ax.annotate(f"{row['coverage_ratio']:.0f}%",
xy=(x[i] + width/2, row['fedora_count']),
ha='center', va='bottom', fontsize=9)
ax.set_xlabel('Region', fontsize=12)
ax.set_ylabel('Number of Languages', fontsize=12)
ax.set_title('CLDR vs Fedora Languages by Geographic Region', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(plot_df['region'].astype(str))
ax.legend()
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
plot_region_comparison(region_analysis)
Scatter Plot: Speakers vs Fedora Coverage¶
def plot_speakers_vs_coverage(df: pd.DataFrame) -> None:
"""
Scatter plot showing estimated speakers vs Fedora coverage status.
Languages with Fedora translations are marked differently.
"""
fig, ax = plt.subplots(figsize=(12, 8))
# Filter to CLDR languages with speaker data
plot_df = df[df['in_cldr'] & df['log_speakers'].notna()].copy()
# Separate covered and not covered
covered = plot_df[plot_df['in_fedora']].copy()
not_covered = plot_df[~plot_df['in_fedora']].copy()
# Create jittered y-axis for visibility (based on region)
region_y = {'Africa': 1, 'Americas': 2, 'Asia': 3, 'Europe': 4, 'Oceania': 5, 'Unknown': 0}
# Add jitter - convert categorical to string first for mapping
np.random.seed(42) # Reproducibility
covered_y = covered['region'].astype(str).map(region_y).values + np.random.uniform(-0.3, 0.3, len(covered))
not_covered_y = not_covered['region'].astype(str).map(region_y).values + np.random.uniform(-0.3, 0.3, len(not_covered))
# Plot not covered (empty circles)
ax.scatter(not_covered['log_speakers'], not_covered_y,
s=50, c='white', edgecolors='#666666', linewidth=1.5,
label='Not in Fedora', alpha=0.7)
# Plot covered (filled circles with hatch-like pattern)
ax.scatter(covered['log_speakers'], covered_y,
s=50, c='#333333', edgecolors='black', linewidth=1,
label='In Fedora', alpha=0.8)
# Customize axes
ax.set_xlabel('Log₁₀(Estimated Speakers)', fontsize=12)
ax.set_ylabel('Region (jittered)', fontsize=12)
ax.set_title('Language Distribution: Speakers vs Region\n(Fedora coverage indicated by fill)', fontsize=14)
ax.set_yticks(list(region_y.values()))
ax.set_yticklabels(list(region_y.keys()))
ax.set_ylim(-0.5, 5.5)
ax.legend(loc='upper left')
ax.grid(alpha=0.3)
# Add reference lines for speaker thresholds
for threshold, label in [(6, '1M'), (7, '10M'), (8, '100M')]:
ax.axvline(x=threshold, color='#AAAAAA', linestyle='--', linewidth=1, alpha=0.5)
ax.text(threshold, 5.3, label, ha='center', fontsize=9, color='#666666')
plt.tight_layout()
plt.show()
plot_speakers_vs_coverage(alignment_df)
Abstract Voronoi Diagram¶
Important Note on Interpretation¶
This is NOT a geographic map. The Voronoi diagram below is a conceptual visualization that:
Uses abstract coordinates: X-axis represents log₁₀(speakers), Y-axis represents region (as a numeric index with deterministic jitter based on language code hash)
Shows relative importance: Languages with more speakers appear further right; regions are separated vertically
Highlights coverage patterns: Hatched cells = language has Fedora translations; plain cells = no Fedora translations
Purpose: Visualize language diversity and Fedora's coverage beyond the dominance of English, showing that many languages exist with significant speaker populations that may or may not have Fedora support
def generate_voronoi_coordinates(df: pd.DataFrame) -> pd.DataFrame:
"""
Generate deterministic 2D coordinates for Voronoi diagram.
X-axis: log_speakers (with imputation for missing values)
Y-axis: region index + deterministic jitter based on language code hash
Args:
df: Alignment DataFrame with log_speakers and region columns
Returns:
DataFrame with added 'voronoi_x' and 'voronoi_y' columns
"""
result = df.copy()
# Region to Y mapping
region_y = {'Africa': 1, 'Americas': 2, 'Asia': 3, 'Europe': 4, 'Oceania': 5, 'Unknown': 0}
# X-coordinate: log_speakers with imputation
# For missing speakers, use minimum log_speakers minus 1
min_log = result['log_speakers'].min()
if pd.isna(min_log):
min_log = 3 # Default fallback (~1000 speakers)
result['voronoi_x'] = result['log_speakers'].fillna(min_log - 1)
# Y-coordinate: region + deterministic jitter
def get_jitter(lang_code: str) -> float:
"""Generate deterministic jitter from language code hash."""
hash_val = int(hashlib.md5(lang_code.encode()).hexdigest()[:8], 16)
return (hash_val % 1000) / 1000 * 0.8 - 0.4 # Range: [-0.4, 0.4]
result['voronoi_y'] = result.apply(
lambda row: region_y.get(str(row['region']), 0) + get_jitter(row['language_code']),
axis=1
)
return result
# Generate coordinates
voronoi_df = generate_voronoi_coordinates(alignment_df[alignment_df['in_cldr']].copy())
print(f"Prepared {len(voronoi_df)} CLDR languages for Voronoi diagram")
print(f"X range: {voronoi_df['voronoi_x'].min():.2f} to {voronoi_df['voronoi_x'].max():.2f}")
print(f"Y range: {voronoi_df['voronoi_y'].min():.2f} to {voronoi_df['voronoi_y'].max():.2f}")
def plot_abstract_voronoi(df: pd.DataFrame) -> None:
"""
Create an improved abstract Voronoi diagram showing language coverage.
Features:
- Each cell represents a CLDR language
- X-axis: log10(speakers) - larger = more speakers
- Y-axis: region bands (Africa, Americas, Asia, Europe, Oceania)
- Green cells: language has Fedora translations
- Light gray cells: language missing from Fedora
- Labels for top languages by speaker count
"""
fig, ax = plt.subplots(figsize=(16, 12))
# Extract points
points = df[['voronoi_x', 'voronoi_y']].values
# Need at least 4 points for Voronoi
if len(points) < 4:
ax.text(0.5, 0.5, 'Insufficient data for Voronoi diagram',
ha='center', va='center', transform=ax.transAxes, fontsize=14)
plt.show()
return
# Define cleaner bounds with padding
x_min, x_max = points[:, 0].min() - 0.5, points[:, 0].max() + 0.5
y_min, y_max = -0.6, 5.6 # Fixed bounds for region strips
# Add far-away boundary points for bounded Voronoi cells
boundary_points = np.array([
[x_min - 20, y_min - 20],
[x_min - 20, y_max + 20],
[x_max + 20, y_min - 20],
[x_max + 20, y_max + 20],
[(x_min + x_max)/2, y_min - 20],
[(x_min + x_max)/2, y_max + 20],
[x_min - 20, (y_min + y_max)/2],
[x_max + 20, (y_min + y_max)/2]
])
all_points = np.vstack([points, boundary_points])
# Compute Voronoi tessellation
vor = Voronoi(all_points)
# Get data for coloring and labeling
in_fedora = df['in_fedora'].values
lang_codes = df['language_code'].values
speakers = df['estimated_speakers'].fillna(0).values
# Clip polygon to bounds
def clip_polygon(polygon, x_min, x_max, y_min, y_max):
"""Clip polygon to rectangular bounds."""
from matplotlib.path import Path
import matplotlib.patches as patches
clipped = []
for x, y in polygon:
cx = max(x_min, min(x_max, x))
cy = max(y_min, min(y_max, y))
clipped.append([cx, cy])
return np.array(clipped)
# Draw region background strips for visual separation
region_colors = ['#F8F8F8', '#FFFFFF']
for i in range(6):
rect = plt.Rectangle((x_min, i - 0.5), x_max - x_min, 1,
facecolor=region_colors[i % 2], edgecolor='none', alpha=0.5, zorder=0)
ax.add_patch(rect)
# Draw Voronoi regions
for idx in range(len(points)):
region_idx = vor.point_region[idx]
if region_idx == -1:
continue
region = vor.regions[region_idx]
if not region or -1 in region:
continue
# Get polygon vertices
polygon = np.array([vor.vertices[i] for i in region])
# Clip polygon to visible bounds
polygon = clip_polygon(polygon, x_min, x_max, y_min, y_max)
# Skip if polygon is degenerate
if len(polygon) < 3:
continue
# Choose style based on Fedora coverage - use distinct colors
if in_fedora[idx]:
# Green tint for Fedora languages
poly = plt.Polygon(polygon, facecolor='#90EE90', edgecolor='#2E8B57',
linewidth=0.8, alpha=0.7, zorder=1)
else:
# Light gray for missing languages
poly = plt.Polygon(polygon, facecolor='#E8E8E8', edgecolor='#AAAAAA',
linewidth=0.5, alpha=0.6, zorder=1)
ax.add_patch(poly)
# Plot points with better visibility
fedora_mask = df['in_fedora'].values
# Non-Fedora points (hollow circles)
ax.scatter(points[~fedora_mask, 0], points[~fedora_mask, 1],
s=30, c='white', edgecolors='#888888', linewidth=1,
zorder=4, label='Not in Fedora', marker='o')
# Fedora points (filled circles)
ax.scatter(points[fedora_mask, 0], points[fedora_mask, 1],
s=40, c='#228B22', edgecolors='#145214', linewidth=1,
zorder=5, label='In Fedora', marker='o')
# Add labels for top languages (by speaker count) to help interpretation
# Get top 15 languages overall
top_indices = np.argsort(speakers)[-15:]
for idx in top_indices:
if speakers[idx] > 0:
x, y = points[idx]
label = lang_codes[idx]
# Color based on Fedora coverage
color = '#145214' if in_fedora[idx] else '#666666'
fontweight = 'bold' if in_fedora[idx] else 'normal'
# Add label with offset
ax.annotate(label, (x, y),
xytext=(5, 5), textcoords='offset points',
fontsize=8, color=color, fontweight=fontweight,
zorder=6, alpha=0.9)
# Set axis limits
ax.set_xlim(x_min, x_max)
ax.set_ylim(y_min, y_max)
# Draw region separator lines
for y_val in [0.5, 1.5, 2.5, 3.5, 4.5]:
ax.axhline(y=y_val, color='#CCCCCC', linestyle='-', linewidth=0.5, zorder=2)
# Labels and title
ax.set_xlabel('Estimated Speakers (log₁₀ scale) →\nSmaller ← → Larger', fontsize=12, labelpad=10)
ax.set_ylabel('Geographic Region', fontsize=12)
title_text = ('Abstract Voronoi Diagram: Language Coverage Landscape\n'
'Each cell = one language | Green = Fedora translated | Gray = Missing')
ax.set_title(title_text, fontsize=14, fontweight='bold', pad=15)
# Add region labels on y-axis
region_labels = ['Unknown', 'Africa', 'Americas', 'Asia', 'Europe', 'Oceania']
ax.set_yticks([0, 1, 2, 3, 4, 5])
ax.set_yticklabels(region_labels, fontsize=11)
# Add speaker threshold reference lines with better visibility
speaker_thresholds = [
(4, '10K', '#DDDDDD'),
(5, '100K', '#CCCCCC'),
(6, '1M', '#AAAAAA'),
(7, '10M', '#888888'),
(8, '100M', '#666666'),
(9, '1B', '#444444')
]
for threshold, label, color in speaker_thresholds:
if x_min < threshold < x_max:
ax.axvline(x=threshold, color=color, linestyle='--', linewidth=1.5, zorder=3, alpha=0.7)
ax.text(threshold, y_max + 0.15, label, ha='center', fontsize=10,
color=color, fontweight='bold')
# Create custom legend
legend_elements = [
mpatches.Patch(facecolor='#90EE90', edgecolor='#2E8B57', linewidth=1.5,
label='✓ In Fedora (translated)'),
mpatches.Patch(facecolor='#E8E8E8', edgecolor='#AAAAAA', linewidth=1,
label='✗ Not in Fedora (gap)'),
plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='#228B22',
markeredgecolor='#145214', markersize=10, label='Language point (Fedora)'),
plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='white',
markeredgecolor='#888888', markersize=10, label='Language point (Missing)'),
]
ax.legend(handles=legend_elements, loc='upper left', fontsize=10,
framealpha=0.95, edgecolor='#CCCCCC')
# Add explanatory note box
explanation = (
"HOW TO READ THIS DIAGRAM:\n"
"• Each CELL represents one language from CLDR\n"
"• Cell POSITION: X = speaker count, Y = region\n"
"• Cell SIZE: larger cells = more 'unique' in the landscape\n"
"• GREEN = Fedora has translations\n"
"• GRAY = potential translation opportunity\n"
"• Labels show top 15 languages by speakers"
)
ax.text(0.98, 0.02, explanation,
transform=ax.transAxes, fontsize=9, verticalalignment='bottom',
horizontalalignment='right', family='monospace',
bbox=dict(boxstyle='round,pad=0.5', facecolor='#FFFFEE',
edgecolor='#CCCC99', alpha=0.95))
# Add "NOT A MAP" warning
ax.text(0.5, 0.98, '⚠️ CONCEPTUAL DIAGRAM - NOT A GEOGRAPHIC MAP ⚠️',
transform=ax.transAxes, fontsize=11, ha='center', va='top',
color='#CC6600', fontweight='bold',
bbox=dict(boxstyle='round', facecolor='#FFF8E7', edgecolor='#FFCC80'))
plt.tight_layout()
plt.show()
# Print summary statistics
total = len(df)
fedora_count = in_fedora.sum()
print(f"\n📊 Voronoi Summary: {fedora_count}/{total} languages ({fedora_count/total*100:.1f}%) have Fedora translations")
plot_abstract_voronoi(voronoi_df)
Top Missing Languages (Expansion Opportunities)¶
Languages in CLDR with significant speaker populations that Fedora doesn't currently support.
# Top CLDR languages missing from Fedora
missing_from_fedora = alignment_df[
alignment_df['in_cldr'] &
~alignment_df['in_fedora'] &
alignment_df['estimated_speakers'].notna()
].sort_values('estimated_speakers', ascending=False)
if len(missing_from_fedora) > 0:
print(f"Top 20 CLDR languages missing from Fedora (by estimated speakers):")
print()
display_cols = ['language_code', 'estimated_speakers', 'speaker_bucket', 'region']
print(missing_from_fedora[display_cols].head(20).to_string(index=False))
else:
if not stats_found:
print("ℹ️ Running in demo mode — no Fedora data to compare.")
else:
print("✓ Fedora covers all CLDR languages with speaker estimates!")
Summary¶
This notebook provides:
Data Alignment¶
- ✅ CLDR data loading (
load_cldr_data): LoadsterritoryInfo.jsonandlanguageData.json - ✅ Language extraction (
extract_cldr_languages): Combines both CLDR sources - ✅ Speaker estimation (
compute_speaker_estimates): Calculates from territory populations - ✅ Fedora detection (
detect_fedora_languages): Scans stats folder with graceful fallback - ✅ Alignment DataFrame (
build_alignment_dataframe): Enriched with log_speakers, speaker_bucket, region
Analysis¶
- ✅ Coverage summary (language count, speaker-weighted)
- ✅ Speaker bucket analysis table
- ✅ Region analysis table
Visualizations (Grayscale + Hatching)¶
- ✅ Bar chart: CLDR vs Fedora by speaker bucket
- ✅ Bar chart: CLDR vs Fedora by region
- ✅ Scatter plot: Speakers vs coverage by region
- ✅ Abstract Voronoi diagram: Conceptual visualization of language landscape
Key Insight¶
The Voronoi diagram helps visualize that:
- Language diversity extends far beyond major languages
- Many languages with millions of speakers lack Fedora support
- Regional coverage varies significantly
Next Steps: Use the alignment_df DataFrame for further analysis, filtering, or export.
# Final summary
print("="*60)
print("NOTEBOOK COMPLETE")
print("="*60)
print(f"\nDataFrame 'alignment_df' ready with {len(alignment_df)} languages")
print(f"Columns: {list(alignment_df.columns)}")
print(f"\nSample:")
alignment_df[alignment_df['in_fedora']].sample(min(5, len(alignment_df[alignment_df['in_fedora']])), random_state=42)