Introduction

Languages are the primary medium through which human beings establish mutual understanding, yet the question “what fraction of humanity could I actually communicate with?” has no easily accessible answer. Existing resources provide pieces of this puzzle, like Ethnologue (Eberhard et al. 2025) catalogues over seven thousand living languages with speaker estimates, the Unicode CLDR (Unicode Consortium, n.d.) maps languages to territories with population percentages, and the CIA World Factbook (Central Intelligence Agency, n.d.) publishes prose descriptions of each country’s linguistic composition; however, not many tool synthesises these inputs into an interactive, user friendly experience that lets a person select their languages and immediately see the result on a world map.

Speakrow was built to fill this gap. It is a publicly accessible web application in which a user selects one or more languages and receives, in real time, an estimate of the percentage of humanity they can communicate with, visualised as a colour-coded world map accompanied by per-country, per-continent, and global statistics. The application addresses a question that is at once personal (how far does my linguistic repertoire reach?).

Building such a system requires solving three interconnected problems. First, a reliable dataset must be assembled that maps languages to countries with population percentages: a task complicated by the absence of any single authoritative source and by systematic biases in the sources that do exist. Second, a computation model must handle the fundamental challenge of double-counting: in a country where 100% speak Language A and 80% also speak Language B, selecting both languages does not grant access to 180% of the population. Third, the user interface must make the results legible and engaging, with instantaneous feedback on every interaction, so that the experience feels exploratory rather than analytical.

Data Collection and Processing

Primary Source: Unicode CLDR

The Unicode Common Locale Data Repository (CLDR) is maintained by the Unicode Consortium with contributions from Apple, Google, Microsoft, and IBM, among others. Its territoryInfo.json file encodes, for each of the world’s territories, an estimated total population and a languagePopulation object listing every language spoken there. Each entry includes a populationPercent field, the estimated share of the territory’s population with meaningful proficiency in the language, and an officialStatus field indicating whether the language holds official, de facto official, or regional official status. This file served as the foundation for all country-language relationship data in the project.

Three supplementary files were drawn from the same CLDR repository: languages.json (English display names for ISO 639 language codes), territories.json (English display names for ISO 3166 territory codes), and the ISO 3166 country table maintained by Duncalfe (Duncalfe, n.d.), which supplied alpha-3 codes, numeric codes, and the regional classifications used for continent assignment.

Country Scope

The dataset was restricted to 197 entities: the 193 United Nations member states plus Palestine (PS), the Holy See (VA), Kosovo (XK), and Taiwan (TW). All overseas territories, dependent territories, and special administrative regions were excluded as independent entries but their language data was consolidated into the records of their sovereign parent countries.

Continent assignment was derived from the ISO 3166 regional classification using a three-level priority scheme: intermediate-region (most specific) was consulted first, then sub-region, then top-level region as a fallback. This placed all 197 countries into one of six continents: Africa, Asia, Europe, North America, South America, and Oceania, without requiring any manual assignments.

Language Code Normalisation

CLDR uses language codes that may include script-variant suffixes (e.g., zh-Hans for Simplified Chinese, pa-Arab for Punjabi in Arabic script). Before statistical processing, all codes were normalised to BCP 47 conventions and a controlled vocabulary of thirteen script-variant-to-base-code mappings was applied. Norwegian Bokmål (nb) and Nynorsk (nn) were merged into a generic Norwegian code (no), Chinese simplified and traditional script variants were merged into zh, and analogous merges were applied to variants of Cantonese, Punjabi, Azerbaijani, Hausa, Kazakh, Kurdish, Mongolian, Malay, and Sindhi.

When two rows for the same country and base language code arose after merging, their population percentages were summed (capped at 100%) and the most official of the two status values was retained.

Territory Consolidation

Because the application’s country scope excludes territories but CLDR contains language data for them, population-weighted language distributions of excluded territories were folded into the records of their sovereign parent countries before filtering. For a territory with population $P_t$ where language $L$ is spoken by $p_t$ % of residents, the contribution to the parent country (population $P_c$ ) was:

\Delta_L = \frac{P_t \times p_t}{P_c \times 100} \times 100

yielding the number of territory speakers of $L$ expressed as a percentage of the parent country’s population. This consolidation was applied to all 66 mapped territory-parent relationships, encompassing British Overseas Territories, French overseas departments, Dutch Caribbean territories, Danish autonomous territories, Norwegian territories, United States territories, Australian external territories, New Zealand associated states, and the Chinese special administrative regions.

Manual Corrections for Known Biases

A systematic review identified eleven country-language pairs where CLDR values were demonstrably inconsistent with the scholarly consensus. These were corrected via a hardcoded override table applied during processing. Table 1 summarises the most consequential corrections.

Selected manual overrides applied to CLDR data. The “Source” column indicates the original CLDR approximate value; “Corrected” is the value used in the final dataset.
Country	Lang.	Source	Corr.	Rationale
PK	en	50%	15%	L2 over-estimate
CH	en	45%	30%	L2 over-estimate
AE	en	50%	40%	L2 over-estimate
GB	fr	17%	8%	L2 over-estimate
GB	de	9%	4%	L2 over-estimate
CH	de	63%	20%	Swiss German conflation
KZ	ru	low	84%	Under-reported
NO	en	—	90%	Missing from CLDR
IS	en	—	85%	Missing from CLDR

The most consequential class of corrections addressed CLDR’s over-estimation of English proficiency. In Pakistan, CLDR reported approximately 50% English-speaking prevalence, a figure that conflates literacy in English-medium institutions with functional conversational ability; the corrected value of 15% is more consistent with estimates from the British Council and academic surveys. In Norway and Iceland, CLDR omitted English entirely despite both countries ranking among the highest English proficiency populations globally in the EF English Proficiency Index.

Coverage Computation Model

The coverage statistics presented to users are computed entirely in the browser at interaction time. The computation proceeds in three stages.

Country-Level Coverage

When a user selects a set of languages $S$ , the system iterates over all 197 countries. For each country $c$ , the raw coverage is the sum of the population percentages of all selected languages present in that country:

r_c = \sum_{l \in S} \texttt{pct}(c, l)

where $\texttt{pct}(c, l)$ returns the population percentage of language $l$ in country $c$ , or zero if the pair does not exist in the dataset. The capped coverage is then:

\hat{r}_c = \min(r_c,\; 100)

This cap reflects the constraint that a single person can be counted at most once regardless of how many selected languages they speak. The estimated reachable population in country $c$ is:

n_c = \left\lfloor \frac{\hat{r}_c \times P_c}{100} + 0.5 \right\rfloor

where $P_c$ is the country’s population.

Global Statistics

Global reach is:

G = \frac{\sum_{c} n_c}{W} \times 100

where $W = 8.1 \times 10^9$ is a fixed world population constant. The number of “countries reached” counts only those with $\hat{r}_c \geq 10\%$ , a threshold chosen to exclude countries where a language is technically present but spoken by too small a fraction to constitute meaningful communicative reach.

Survival Score

The survival score counts countries where $r_c \geq 40\%$ (computed before the 100% cap, since the uncapped sum better reflects the depth of language coverage). This threshold is calibrated as a rough approximation of the minimum share of a population that, if reachable, would allow a traveller to navigate most everyday situations.

Three distinct thresholds thus address three distinct questions: 10% for “is this language meaningfully present?”, 25% for continent-level regional significance, and 40% for “could you survive here?”

System Architecture

Technology Stack

Speakrow is built on Next.js 16 (App Router) with React 19, TypeScript, and Tailwind CSS v4. The interactive map uses react-simple-maps 3.0 with a Mercator projection rendering TopoJSON country geometries from the world-atlas package at 110m resolution. Animations are implemented with Framer Motion 12, and milestone celebrations use canvas-confetti. The sole backend is Supabase (hosted PostgreSQL with Row-Level Security), accessed directly from the browser via the client SDK.

Client-Only Architecture

A defining architectural decision is the complete absence of server-side routes. The application is a single-route, fully client-rendered page. All data is fetched from Supabase on mount via a single parallel Promise.all call, and every subsequent computation executes in the browser. This architecture maximises interaction responsiveness: once the initial data load completes ( 861 country–language pairs, 408 languages, 197 countries, 80 fun facts), every language toggle produces instantaneous visual feedback with zero network latency.

The trade-off is that the initial load requires a Supabase fetch before any meaningful UI is shown. This is mitigated by the small dataset size and is considered acceptable because users orient themselves to the interface during the loading period.

State Management

The application uses a two-layer state model. DataProvider (a React context) is the single point of truth for raw data, exposing four pre-built lookup Map structures optimised for the computation layer’s access patterns. useLanguageSelection (a custom hook) owns the sole piece of mutable state: the array of selected language codes, and derives every downstream statistic via useMemo with selectedCodes as the dependency.

This design ensures that a language toggle triggers exactly one state update and exactly one round of memoised recomputation, after which React re-renders only the affected components. The most performance-critical data structure is countryLanguagesMap, which groups all country–language pairs by language code so that the computation layer iterates only the relevant countries per selected language, reducing the inner loop from $O(N \times M)$ to $O(|S| \times \bar{k})$ , where $|S|$ is the number of selected languages and $\bar{k}$ is the average number of countries per language.

Database Schema

Five PostgreSQL tables are hosted on Supabase: countries (197 rows; alpha-2 primary key, alpha-3, numeric code, name, continent, population), languages (408 rows; ISO 639 code as primary key, name, family, sub-family, total speakers), country_languages (861 rows; foreign keys to both parent tables, population percentage, official status), fun_facts (80 rows; language trivia), and bug_reports (user feedback with automatically collected browser context). Row-Level Security policies ensure that the public data tables allow anonymous reads while bug reports require authenticated access.

User Interface and Interaction Design

Map Encoding

The world map encodes both which languages are present and how strongly they are represented. Countries with no selected language present are rendered in neutral dark grey. Single-language countries receive their language’s colour at an opacity computed as $\min(0.3 + 0.7 \times \hat{r}_c / 100,\; 1.0)$ , so that even faintly covered countries are visibly tinted while high-coverage countries appear richly saturated. Countries where multiple selected languages are present receive an SVG diagonal stripe pattern interleaving the colours of the contributing languages, with stripe width adapting from 4 pixels for two languages to 3 pixels for three or more.

Hovering over any country displays a tooltip showing the country name, capped coverage percentage, and a list of contributing languages with their individual percentages.

Responsive Layout

On desktop viewports ( $\geq$ 1024 px), the interface presents a three-column layout: a 320 px fixed sidebar housing the language selector, a main content area with the stats panel and map, and a lower row with continent bars and fun facts. On mobile viewports, the sidebar is replaced by a bottom sheet that shows a compact summary bar (global reach percentage, country count, language count) in its collapsed state and expands to a tabbed interface providing full access to statistics, language selection, and fun facts. Map panning is disabled on mobile to avoid gesture conflicts with the bottom sheet, while zoom remains accessible via on-screen controls.

Cross-Check Validation

Following the initial database seeding, a systematic cross-check was conducted against three independent sources to identify and correct residual errors.

CIA World Factbook

For 75 of the most-populated countries, the language percentage field was parsed from the Factbook’s structured JSON (Central Intelligence Agency, n.d.) using regular expression extraction. Matched language-country pairs where the absolute difference exceeded 15 percentage points were flagged. Only pairs with discrepancies exceeding 20 percentage points were automatically corrected, using the arithmetic mean of the CLDR and Factbook values as a conservative split. Pairs protected by manual overrides were excluded.

Ethnologue via Wikipedia

The Wikipedia article “List of languages by total number of speakers” was parsed to extract speaker count estimates sourced primarily from Ethnologue. For languages with at least 10 million speakers, relative differences exceeding 30% triggered an automatic correction using the Wikipedia/Ethnologue value, on the grounds that Ethnologue’s estimates for large languages are more refined than the sum of CLDR’s per-country percentages.

REST Countries API

The REST Countries API (Clavijo, n.d.) was used for two purposes: country populations differing by more than 10% from our figures were updated, and languages listed as official by the API but missing official status in our records were upgraded.

Results

The cross-check pass raised 124 flags in total. Of these, 56 resulted in automatic corrections: 39 population updates, 14 official status upgrades, and 3 percentage corrections. The remaining 68 flags were retained as documented discrepancies, the majority representing country–language pairs present in external sources but absent from the database entirely — a condition requiring manual resolution since no reliable population_pct could be inferred from the external sources alone.

Limitations

Several limitations warrant explicit acknowledgement.

The population_pct values represent proficiency prevalence rather than exclusive primary-language affiliation. A person counted under both English and French for a given country is not double-counted in the coverage computation (due to the 100% cap), but the underlying data does not encode which individuals overlap. The cap is therefore a population-level approximation rather than an individual-level accounting.

The fixed world population constant of $8.1 \times 10^9$ does not reflect real-time demographic change, and individual country populations are point-in-time estimates. The total_speakers field reflects speakers within the 197-country scope only, excluding diaspora populations in excluded territories and speakers in countries where the language falls below the 0.5% threshold.

CLDR’s characterisation of L2 proficiency is not methodologically uniform across countries: in some cases it reflects official-language-in-education policies, in others self-reported census responses, and in others contributor estimates. The manual overrides and cross-check corrections address the most egregious known instances, but they cannot eliminate the underlying heterogeneity.

Finally, the three coverage thresholds (10%, 25%, 40%) are heuristic rather than empirically calibrated. They encode reasonable intuitions about communicative utility but should not be interpreted as precise sociolinguistic boundaries.

Users should therefore interpret the coverage percentages as order-of-magnitude estimates of communicative reach rather than precise demographic measurements.

Speakrow.com