Full Application

Your Linguistic Reach
Speakrow.com is a web app that shows what share of the world's population you can communicate with based on the languages you speak. You select the languages you know, and the app instantly updates an interactive world map to show where you can communicate. It also calculates your overall global coverage, provides a survival score, and breaks down your reach by continent with simple summary statistics.
Languages are the primary medium through which human beings establish mutual understanding, yet the question “what fraction of humanity could I actually communicate with?” has no easily accessible answer. Existing resources provide pieces of this puzzle, like Ethnologue (Eberhard et al. 2025) catalogues over seven thousand living languages with speaker estimates, the Unicode CLDR (Unicode Consortium, n.d.) maps languages to territories with population percentages, and the CIA World Factbook (Central Intelligence Agency, n.d.) publishes prose descriptions of each country’s linguistic composition; however, not many tool synthesises these inputs into an interactive, user friendly experience that lets a person select their languages and immediately see the result on a world map.
Speakrow was built to fill this gap. It is a publicly accessible web application in which a user selects one or more languages and receives, in real time, an estimate of the percentage of humanity they can communicate with, visualised as a colour-coded world map accompanied by per-country, per-continent, and global statistics. The application addresses a question that is at once personal (how far does my linguistic repertoire reach?).
Building such a system requires solving three interconnected problems. First, a reliable dataset must be assembled that maps languages to countries with population percentages: a task complicated by the absence of any single authoritative source and by systematic biases in the sources that do exist. Second, a computation model must handle the fundamental challenge of double-counting: in a country where 100% speak Language A and 80% also speak Language B, selecting both languages does not grant access to 180% of the population. Third, the user interface must make the results legible and engaging, with instantaneous feedback on every interaction, so that the experience feels exploratory rather than analytical.
The Unicode Common Locale Data Repository (CLDR) is maintained by the Unicode Consortium
with contributions from Apple, Google, Microsoft, and IBM, among others.
Its territoryInfo.json file encodes, for each of the
world’s territories, an estimated total population and a
languagePopulation object listing every language spoken
there. Each entry includes a populationPercent field, the
estimated share of the territory’s population with meaningful
proficiency in the language, and an officialStatus field
indicating whether the language holds official, de facto
official, or regional official status. This file served as the
foundation for all country-language relationship data in the
project.
Three supplementary files were drawn from the same CLDR repository: languages.json
(English display names for ISO 639 language codes),
territories.json (English display names for ISO 3166
territory codes), and the ISO 3166 country table maintained by
Duncalfe (Duncalfe,
n.d.), which supplied alpha-3 codes, numeric codes, and the
regional classifications used for continent assignment.
The dataset was restricted to 197 entities: the 193 United Nations member states plus Palestine (PS), the Holy See (VA), Kosovo (XK), and Taiwan (TW). All overseas territories, dependent territories, and special administrative regions were excluded as independent entries but their language data was consolidated into the records of their sovereign parent countries.
Continent assignment was derived from the ISO 3166 regional classification using a three-level priority scheme: intermediate-region (most specific) was consulted first, then sub-region, then top-level region as a fallback. This placed all 197 countries into one of six continents: Africa, Asia, Europe, North America, South America, and Oceania, without requiring any manual assignments.
CLDR uses language codes that may
include script-variant suffixes (e.g., zh-Hans for
Simplified Chinese, pa-Arab for Punjabi in Arabic script).
Before statistical processing, all codes were normalised to BCP 47
conventions and a controlled vocabulary of thirteen
script-variant-to-base-code mappings was applied. Norwegian Bokmål
(nb) and Nynorsk (nn) were merged into a
generic Norwegian code (no), Chinese simplified and
traditional script variants were merged into zh, and
analogous merges were applied to variants of Cantonese, Punjabi,
Azerbaijani, Hausa, Kazakh, Kurdish, Mongolian, Malay, and Sindhi.
When two rows for the same country and base language code arose after merging, their population percentages were summed (capped at 100%) and the most official of the two status values was retained.
Because the application’s country scope excludes territories but CLDR contains language data for them, population-weighted language distributions of excluded territories were folded into the records of their sovereign parent countries before filtering. For a territory with population where language is spoken by % of residents, the contribution to the parent country (population ) was:
yielding the number of territory speakers of expressed as a percentage of the parent country’s population. This consolidation was applied to all 66 mapped territory-parent relationships, encompassing British Overseas Territories, French overseas departments, Dutch Caribbean territories, Danish autonomous territories, Norwegian territories, United States territories, Australian external territories, New Zealand associated states, and the Chinese special administrative regions.
A systematic review identified eleven country-language pairs where CLDR values were demonstrably inconsistent with the scholarly consensus. These were corrected via a hardcoded override table applied during processing. Table 1 summarises the most consequential corrections.
| Country | Lang. | Source | Corr. | Rationale |
|---|---|---|---|---|
| PK | en | 50% | 15% | L2 over-estimate |
| CH | en | 45% | 30% | L2 over-estimate |
| AE | en | 50% | 40% | L2 over-estimate |
| GB | fr | 17% | 8% | L2 over-estimate |
| GB | de | 9% | 4% | L2 over-estimate |
| CH | de | 63% | 20% | Swiss German conflation |
| KZ | ru | low | 84% | Under-reported |
| NO | en | — | 90% | Missing from CLDR |
| IS | en | — | 85% | Missing from CLDR |
The most consequential class of corrections addressed CLDR’s over-estimation of English proficiency. In Pakistan, CLDR reported approximately 50% English-speaking prevalence, a figure that conflates literacy in English-medium institutions with functional conversational ability; the corrected value of 15% is more consistent with estimates from the British Council and academic surveys. In Norway and Iceland, CLDR omitted English entirely despite both countries ranking among the highest English proficiency populations globally in the EF English Proficiency Index.
The coverage statistics presented to users are computed entirely in the browser at interaction time. The computation proceeds in three stages.
When a user selects a set of languages , the system iterates over all 197 countries. For each country , the raw coverage is the sum of the population percentages of all selected languages present in that country:
where returns the population percentage of language in country , or zero if the pair does not exist in the dataset. The capped coverage is then:
This cap reflects the constraint that a single person can be counted at most once regardless of how many selected languages they speak. The estimated reachable population in country is:
where is the country’s population.
Global reach is:
where is a fixed world population constant. The number of “countries reached” counts only those with , a threshold chosen to exclude countries where a language is technically present but spoken by too small a fraction to constitute meaningful communicative reach.
The survival score counts countries where (computed before the 100% cap, since the uncapped sum better reflects the depth of language coverage). This threshold is calibrated as a rough approximation of the minimum share of a population that, if reachable, would allow a traveller to navigate most everyday situations.
Three distinct thresholds thus address three distinct questions: 10% for “is this language meaningfully present?”, 25% for continent-level regional significance, and 40% for “could you survive here?”
Speakrow is built on Next.js 16 (App
Router) with React 19, TypeScript, and Tailwind CSS v4. The interactive
map uses react-simple-maps 3.0 with a Mercator projection
rendering TopoJSON country geometries from the world-atlas
package at 110m resolution. Animations are implemented with
Framer Motion 12, and milestone celebrations use
canvas-confetti. The sole backend is Supabase (hosted
PostgreSQL with Row-Level Security), accessed directly from the browser
via the client SDK.
A defining architectural decision is the complete absence of
server-side routes. The application is a single-route, fully
client-rendered page. All data is fetched from Supabase on mount via a
single parallel Promise.all call, and every subsequent
computation executes in the browser. This architecture maximises
interaction responsiveness: once the initial data load completes ( 861
country–language pairs, 408 languages, 197 countries, 80 fun facts),
every language toggle produces instantaneous visual feedback with zero
network latency.
The trade-off is that the initial load requires a Supabase fetch before any meaningful UI is shown. This is mitigated by the small dataset size and is considered acceptable because users orient themselves to the interface during the loading period.
The application uses a two-layer state model.
DataProvider (a React context) is the single point of truth
for raw data, exposing four pre-built lookup Map structures
optimised for the computation layer’s access patterns.
useLanguageSelection (a custom hook) owns the sole piece of
mutable state: the array of selected language codes, and derives every
downstream statistic via useMemo with
selectedCodes as the dependency.
This design ensures that a language toggle triggers exactly one state
update and exactly one round of memoised recomputation, after which
React re-renders only the affected components. The most
performance-critical data structure is countryLanguagesMap,
which groups all country–language pairs by language code so that the
computation layer iterates only the relevant countries per selected
language, reducing the inner loop from to , where is the
number of selected languages and is the average number of
countries per language.
Five PostgreSQL tables are hosted on Supabase: countries
(197 rows; alpha-2 primary key, alpha-3, numeric code, name, continent,
population), languages (408 rows; ISO 639 code as primary
key, name, family, sub-family, total speakers),
country_languages (861 rows; foreign keys to both parent
tables, population percentage, official status), fun_facts
(80 rows; language trivia), and bug_reports (user feedback
with automatically collected browser context). Row-Level Security
policies ensure that the public data tables allow anonymous reads while
bug reports require authenticated access.
The world map encodes both which languages are present and how strongly they are represented. Countries with no selected language present are rendered in neutral dark grey. Single-language countries receive their language’s colour at an opacity computed as , so that even faintly covered countries are visibly tinted while high-coverage countries appear richly saturated. Countries where multiple selected languages are present receive an SVG diagonal stripe pattern interleaving the colours of the contributing languages, with stripe width adapting from 4 pixels for two languages to 3 pixels for three or more.
Hovering over any country displays a tooltip showing the country name, capped coverage percentage, and a list of contributing languages with their individual percentages.
On desktop viewports (1024 px), the interface presents a three-column layout: a 320 px fixed sidebar housing the language selector, a main content area with the stats panel and map, and a lower row with continent bars and fun facts. On mobile viewports, the sidebar is replaced by a bottom sheet that shows a compact summary bar (global reach percentage, country count, language count) in its collapsed state and expands to a tabbed interface providing full access to statistics, language selection, and fun facts. Map panning is disabled on mobile to avoid gesture conflicts with the bottom sheet, while zoom remains accessible via on-screen controls.
Following the initial database seeding, a systematic cross-check was conducted against three independent sources to identify and correct residual errors.
For 75 of the most-populated countries, the language percentage field was parsed from the Factbook’s structured JSON (Central Intelligence Agency, n.d.) using regular expression extraction. Matched language-country pairs where the absolute difference exceeded 15 percentage points were flagged. Only pairs with discrepancies exceeding 20 percentage points were automatically corrected, using the arithmetic mean of the CLDR and Factbook values as a conservative split. Pairs protected by manual overrides were excluded.
The Wikipedia article “List of languages by total number of speakers” was parsed to extract speaker count estimates sourced primarily from Ethnologue. For languages with at least 10 million speakers, relative differences exceeding 30% triggered an automatic correction using the Wikipedia/Ethnologue value, on the grounds that Ethnologue’s estimates for large languages are more refined than the sum of CLDR’s per-country percentages.
The REST Countries API (Clavijo, n.d.) was used for two purposes: country populations differing by more than 10% from our figures were updated, and languages listed as official by the API but missing official status in our records were upgraded.
The cross-check pass raised 124 flags in total. Of these, 56 resulted
in automatic corrections: 39 population updates, 14 official status
upgrades, and 3 percentage corrections. The remaining 68 flags were
retained as documented discrepancies, the majority representing
country–language pairs present in external sources but absent from the
database entirely — a condition requiring manual resolution since no
reliable population_pct could be inferred from the external
sources alone.
Several limitations warrant explicit acknowledgement.
The population_pct values represent proficiency
prevalence rather than exclusive primary-language affiliation. A person
counted under both English and French for a given country is not
double-counted in the coverage computation (due to the 100% cap), but
the underlying data does not encode which individuals overlap. The cap
is therefore a population-level approximation rather than an
individual-level accounting.
The fixed world population constant of does not reflect
real-time demographic change, and individual country populations are
point-in-time estimates. The total_speakers field reflects
speakers within the 197-country scope only, excluding diaspora
populations in excluded territories and speakers in countries where the
language falls below the 0.5% threshold.
CLDR’s characterisation of L2 proficiency is not methodologically uniform across countries: in some cases it reflects official-language-in-education policies, in others self-reported census responses, and in others contributor estimates. The manual overrides and cross-check corrections address the most egregious known instances, but they cannot eliminate the underlying heterogeneity.
Finally, the three coverage thresholds (10%, 25%, 40%) are heuristic rather than empirically calibrated. They encode reasonable intuitions about communicative utility but should not be interpreted as precise sociolinguistic boundaries.
Users should therefore interpret the coverage percentages as order-of-magnitude estimates of communicative reach rather than precise demographic measurements.