Methodology

This vignette walks through the complete interpElections methodology step by step. Computationally heavy steps show pre-rendered outputs and figures. Small toy examples (eval = TRUE) can be run interactively.

Part I: Motivation and Pipeline Overview

1. Why Sub-Municipal Electoral Geography Matters

Electoral analysis in Brazil faces a fundamental data gap. Votes are counted at polling stations — points on a map, not geographic zones. The Superior Electoral Court (TSE) publishes detailed results per station: how many votes each candidate received, turnout statistics, the age and gender composition of registered voters. But polling stations do not have defined geographic boundaries. A voter who lives in tract A and walks past tract B may vote at a station located in tract C.

Meanwhile, the Census Bureau (IBGE) organizes demographic data by census tracts — small geographic polygons with known population counts by age, gender, education, and income. These tracts are the finest spatial unit at which demographic data is available.

The question is: how did each neighborhood vote? To answer this, we need to bridge two incompatible data systems: point-level electoral data and polygon-level census data. That is what this package does.

Age structure as a proxy for socioeconomic status

The bridge between these two systems is the age structure of the population. Age brackets are observed at both levels: the census records how many people of each age live in each tract, and the TSE records how many voters of each age voted at each station.

But why is age structure useful? Because it encodes far more than just demographics. In Brazilian cities, the age composition of a neighborhood is strongly correlated with its socioeconomic profile. Consider two neighborhoods in Rio de Janeiro:

Rocinha, one of Brazil’s largest favelas, has a young population: large cohorts in the 18–29 age range, tapering off sharply after 50. Copacabana, one of Rio’s wealthiest neighborhoods, shows the opposite pattern: a narrower base of young adults and a large proportion of residents over 50. This difference reflects fertility rates, life expectancy, and migration patterns that are tightly linked to income, education, and living conditions.

Census tract codes. Neighborhoods are defined by aggregating IBGE 2022 census tracts whose centroids fall within the 2010 neighborhood boundaries (via geobr::read_neighborhood()). Rocinha comprises 115 tracts (codes starting with 330455705330, subdistrict 0533); Copacabana comprises 420 tracts (codes starting with 330455705100, subdistrict 0510). Tracts shown in green on the map have zero population and are excluded from the interpolation.

Because these age signatures differ systematically across neighborhoods, they carry information about socioeconomic composition — and, by extension, about voting behavior. The method we describe exploits this: it finds weights that reproduce the census age structure at each tract when applied to station-level voter data. If the weights are correct for demographics, they should also produce good estimates for vote counts.

2. The Problem: Many-to-Many Spatial Disaggregation

Imagine you are a researcher who wants to know how Copacabana voted in the 2022 presidential election. The TSE can tell you how each polling station in the area voted. But several things make this harder than it seems:

One station, many neighborhoods: a school in Botafogo may receive voters from Botafogo, Laranjeiras, and even from the Santa Marta favela up the hill. The station’s vote total is a mix of all these populations.
One neighborhood, many stations: Copacabana voters may be assigned to polling stations in Copacabana itself, in neighboring Ipanema, or in Leme. The neighborhood’s voters are split across multiple stations.
No lookup table: unlike in some countries, there is no official mapping from residential addresses to polling stations. The relationship is many-to-many, and there is no administrative data to resolve it.

This is not a standard areal interpolation problem. In areal interpolation, both source and target are polygons, and the overlap area provides natural weights. Here, the sources are points (stations), not polygons. Area-weighted interpolation does not apply. Nor is it a simple point-in-polygon assignment: since the mapping is many-to-many, assigning each tract to its nearest station (Voronoi) would lose information and produce poor estimates (we demonstrate this in Section 17).

The approach: estimating the association structure

Our solution estimates the association structure between tracts and stations — a weight matrix $W$ of dimension $[N \times M]$ where each entry $W_{ij}$ captures how much station $j$ “serves” tract $i$ .

Two fundamental properties are enforced:

Source conservation (column constraint): each station distributes exactly 100% of its data. The total votes in the municipality are preserved exactly: $\sum_i W_{ij} = 1$ for all $j$ .
Population proportionality (row constraint): each tract receives total weight proportional to its population share: $\sum_j W_{ij} \propto \text{pop}_i$ .

The method does not assume a fixed assignment of voters to stations. Instead, it estimates a smooth probabilistic mapping, calibrated against known demographic data.

Walking to vote: a behavioral assumption

The key behavioral assumption is that voters walk to their polling stations. This grounds the spatial model: closer stations receive more weight because voters are more likely to walk to them.

This assumption is supported by empirical evidence. Pereira et al. (2023) studied the effect of free public transit on election day in Brazilian cities, finding that it increased turnout — implying that transportation costs (including walking distance) are a meaningful barrier. Brazilian electoral law requires that voters be assigned to polling stations near their registered address, reinforcing the spatial proximity assumption.

Pereira, R. H. M., Braga, C. K. V., Serra, B., & Nadalin, V. (2023). “Free public transit and voter turnout.” Available at: https://www.urbandemographics.org/files/2023_Pereira_et_al_free_public_transit_voter_turnout.pdf

A second important assumption is that voters reside in the municipality where they are registered to vote, which is the same municipality captured by the census. This residential assumption is what allows us to use census population data as the target distribution: the age structure of voters at a polling station should reflect the age structure of the surrounding population. Without this assumption, the calibration step would not be valid.

We use realistic walking travel times computed on the OpenStreetMap road network, not straight-line distances. Mountains, rivers, and the street layout all affect the actual travel time and are captured by the routing engine (see Section 9).

3. Pipeline Overview

The full interpolation pipeline has six steps:

Census data: population counts by age bracket per tract (br_prepare_population())
Tract geometries: census tract polygons from IBGE (br_prepare_tracts())
Electoral data: votes, turnout, and voter demographics per station (br_prepare_electoral())
Travel times: walking routes from tract representative points to stations (compute_travel_times())
Optimization: find per-tract decay parameters $\alpha$ that minimize demographic mismatch (optimize_alpha())
Interpolation: use the column-normalized weight matrix from optimization (or rebuild with compute_weight_matrix()) and apply it to any station-level variable ( $\hat{X} = W \times V$ )

The convenience function interpolate_election_br() wraps all six steps into a single call for Brazilian municipalities, handling data downloads, geocoding, and default parameter selection automatically.

The figure above illustrates the spatial data layers involved. Starting from the bottom: the municipality boundary, census tracts colored by population, polling station locations, and the final interpolated results at the tract level.

Part II: Data

4. Census Population Data

pop_data <- br_prepare_population("3170701", year = 2022)
head(pop_data[, 1:10])

  code_tract    code_muni pop_00_04 pop_05_09 pop_10_14 pop_15_19 pop_20_24 pop_25_29 pop_30_39 pop_40_49
1 3170701xxxxx  3170701      68        68        62        88        68        58       144       129
2 3170701xxxxx  3170701      37        36        47        38        47        44        83        95
...

The data frame also includes calibration columns that cross gender (male/female) × 7 age brackets. These calibration variables bridge census tracts and polling stations.

Age bracket	Census column	TSE equivalent
18–19	`pop_hom_18_19`, `pop_mul_18_19`	`vot_hom_18_19`, `vot_mul_18_19`
20–24	`pop_hom_20_24`, `pop_mul_20_24`	`vot_hom_20_24`, `vot_mul_20_24`
25–29	`pop_hom_25_29`, `pop_mul_25_29`	`vot_hom_25_29`, `vot_mul_25_29`
30–39	`pop_hom_30_39`, `pop_mul_30_39`	`vot_hom_30_39`, `vot_mul_30_39`
40–49	`pop_hom_40_49`, `pop_mul_40_49`	`vot_hom_40_49`, `vot_mul_40_49`
50–59	`pop_hom_50_59`, `pop_mul_50_59`	`vot_hom_50_59`, `vot_mul_50_59`
60–69	`pop_hom_60_69`, `pop_mul_60_69`	`vot_hom_60_69`, `vot_mul_60_69`

How the 7 brackets are built. The TSE voter profile data records voters in 13 finer age bins (18, 19, 20, 21–24, 25–29, 30–34, 35–39, 40–44, 45–49, 50–54, 55–59, 60–64, 65–69). br_prepare_electoral() aggregates these into the 7 groups above so they align with the census age brackets. For the 2022 Census — which reports a 15–19 group instead of 18–20 — the calibration matching step applies a $2/5$ proxy to approximate the 18–19 voting-age share from the census tract population data.

These 7 matched brackets are the key insight: they are observed at both levels (census tracts and polling stations), enabling calibration. The calibration crosses gender × 7 age groups (14 brackets), providing a strong spatial signal because the optimizer must simultaneously match the spatial distribution of each gender–age subgroup.

5. Census Tract Geometries

tracts_sf <- br_prepare_tracts(3170701, pop_data, year = 2022)

This downloads tract geometries from geobr, transforms to EPSG:5880 (equal-area projection for accurate spatial operations), and joins the population columns. Tracts with very low population can be filtered later via the min_tract_pop parameter in interpolate_election().

Note the urban-rural gradient: small, densely populated tracts in the city center; large, sparsely populated tracts on the periphery.

6. Electoral Data at Polling Stations

electoral <- br_prepare_electoral(
  code_muni_ibge = "3170701",
  code_muni_tse  = "54135",
  uf = "MG", year = 2022,
  cargo = "presidente", turno = 1,
  what = c("candidates", "turnout")
)

The result contains one row per voting location with:

Vote columns (CAND_13, CAND_22, …): votes per candidate (ballot number as suffix; 95 = blank, 96 = null)
Turnout (QT_COMPARECIMENTO): total voters who showed up
Voter age profiles (votantes_18_19, …, votantes_65_69): TSE’s fine-grained age registration brackets, aggregated to match the 7 census brackets
Coordinates (lat, long): geocoded polling station locations
Column dictionary: attr(electoral, "column_dictionary") contains metadata for each column (type, cargo, candidate name, party)

The what parameter controls what gets included:

"candidates": one column per candidate
"parties": aggregated by party (PARTY_PT, PARTY_PL, …)
"turnout": QT_COMPARECIMENTO, QT_APTOS, QT_ABSTENCOES
"demographics": GENERO_* and EDUC_* columns

7. Calibration Variables: Comparing Source and Target

The calibration variables are the bridge between census tracts and polling stations. The same age brackets are observed at both levels:

Census: “How many people aged 18–19 live in this tract?” (from IBGE)
Electoral: “How many registered voters aged 18–19 voted at this location?” (from TSE)

These are the same age brackets we saw distinguishing Rocinha from Copacabana in Section 1. The demographic signature that varies across neighborhoods is precisely what the method uses to calibrate the weights.

The shapes are similar but totals differ: the census counts residents, the TSE counts registered voters. Not everyone of voting age is registered, and registration rates vary by age bracket.

The optimization finds alpha values so that the interpolated voter age profiles match the census age profiles at each tract. If the weights are correct for demographics, they should also be correct for vote counts.

Part III: Geography

8. Representative Points: From Tracts to Routable Origins

Routing engines need point-to-point queries, but tracts are polygons. We need a single representative point per tract. Three methods are available:

Method	Function	Guarantee
`"centroid"`	`sf::st_centroid()`	Fast. May fall outside concave polygons.
`"point_on_surface"`	`sf::st_point_on_surface()`	Guaranteed inside. Default.
`"pop_weighted"`	WorldPop raster + `terra::patches()`	Cluster-based: largest population cluster with road-proximity refinement. Best for large tracts.

The choice of representative point matters most for large, irregularly shaped tracts in rural and peri-urban areas. Consider the southern periphery of São Paulo, where census tracts extend over several square kilometers of mixed urban and forested land:

The three-panel comparison shows the same set of large tracts in south São Paulo. The red dots (centroids) frequently fall in uninhabited areas — in the middle of Mata Atlântica forest or in empty fields. The blue dots (point on surface) are guaranteed to be inside the polygon, but may still be far from where people actually live. The green dots (pop-weighted) use the WorldPop population raster to find the most densely populated cell within each tract, placing the representative point where residents actually are.

Zooming into the largest tract, the centroid falls in dense forest while the pop-weighted point correctly identifies the settlement area. This difference can affect travel time calculations by tens of minutes.

pts_pos <- compute_representative_points(tracts_sf,
  tract_id = "code_tract", method = "point_on_surface")
pts_pop <- compute_representative_points(tracts_sf,
  tract_id = "code_tract", method = "pop_weighted")

The pop_weighted method downloads the WorldPop Brazil Constrained 2020 raster (~48 MB, cached) and uses a multi-step algorithm:

Identify connected clusters of populated cells via terra::patches()
Select the largest cluster by total population
If OSM road data is available, apply tiered road proximity refinement (cells within 200 m of roads are preferred)
Within the selected cluster (and road-proximate subset), choose the cell with the highest population density

The min_area_for_pop_weight parameter (default: 1 km²) controls the threshold: tracts smaller than this use point_on_surface regardless of the chosen method, since small urban tracts are dense enough that any interior point is a reasonable proxy.

9. Travel Times: Building the Distance Matrix

9.1 Why travel time?

In Brazil, voters are assigned to polling stations near their registered address. On election day, most voters walk to their polling station. This is not merely an assumption of convenience — it is grounded in the geographic design of the Brazilian electoral system and in empirical evidence.

Pereira et al. (2023) studied the introduction of free public transit on election days in Brazilian cities. Their finding that free transit significantly increased turnout implies that the cost of getting to the polls — primarily walking distance — is a real barrier that affects participation. If distance matters for whether people vote at all, it certainly matters for where they vote.

We therefore use realistic travel times computed on the OpenStreetMap road network, not Euclidean (straight-line) distances. The routing engine (r5r) accounts for the actual street layout, elevation, one-way streets, pedestrian paths, and physical barriers. The default AUTO mode attempts walking first and falls back to car routing if needed.

9.2 Euclidean distance vs. walking routes

Why not just use straight-line distance? Because geography creates dramatic discrepancies between Euclidean distance and actual walking time. Two examples:

Example 1: Mountain barrier (Rio de Janeiro)

A point in the Botafogo neighborhood lies at the base of the Corcovado/Sumaré hills. A polling station in Laranjeiras sits on the other side. In a straight line, the distance is modest — but the walking route must navigate around the steep terrain:

The solid blue line shows the actual walking route computed by r5r on the OSM road network. The dotted red line shows the straight-line distance. In mountainous terrain, these can differ by a factor of 3–5x.

Example 2: Water barrier (Recife)

A point in the Boa Vista neighborhood and a destination in the Brasília Teimosa peninsula are separated by Recife’s waterways. The walking route must navigate around the estuary and cross bridges, adding significant distance compared to the straight line.

Example 3: Water barrier (Florianópolis)

Florianópolis, capital of Santa Catarina, is split between the mainland and an island in the Atlantic. A point in Coqueiros (mainland) and a destination at Costeira do Pirajubaé (island) are separated by the bay. The walking route must detour to one of the bridges connecting the two sides:

These examples illustrate why Euclidean distance is a poor proxy for actual accessibility — and why the routing engine matters.

9.3 Building the travel time matrix

time_matrix <- compute_travel_times(
  tracts_sf = tracts_sf,
  points_sf = electoral_sf,
  network_path = network_path,
  tract_id = "code_tract",
  point_id = "id",
  routing = routing_control()
)

This builds an [N tracts × M stations] matrix of travel times in minutes, using the r5r routing engine on OpenStreetMap data.

Origins: representative points (from Section 8)
Destinations: voting location coordinates
Mode: AUTO (default) — attempts walking first, then falls back to car routing if more than 1% of pairs are unreachable by foot
Max trip duration: 120 minutes (default); pairs beyond this threshold are NA (unreachable) and receive exactly zero weight

The OSM data is downloaded via download_r5r_data(), which fetches state-level .pbf files and clips them to the municipality bounding box using osmium.

The heatmap shows spatial structure: a block-diagonal pattern where nearby tract-station pairs have short travel times.

Tracts in the center have short travel times to the nearest station; peripheral tracts are farther away.

The offset: before applying the power-law IDW kernel (the default), we add 1 to all travel times (time_matrix + 1) to avoid singularity when $t = 0$ . The exponential kernel does not need this offset (see Section 10).

Part IV: The Weight Matrix

10. The IDW Kernel: Turning Distance into Influence

Intuition: gravitational pull

Think of each polling station as exerting a “gravitational pull” on nearby tracts. Closer stations pull harder. The IDW (Inverse Distance Weighting) kernel translates this intuition into numbers: given a travel time between a tract and a station, it produces a weight that decreases with distance.

A concrete example

Suppose a tract has three nearby stations at 5, 12, and 30 minutes walking distance. How much should each station contribute to this tract’s estimate? With a decay parameter $\alpha = 2$ :

tt <- c(5, 12, 30)  # travel times in minutes
alpha <- 2
raw_weights <- (tt + 1)^(-alpha)
normalized <- raw_weights / sum(raw_weights)
cat("Travel times:", tt, "minutes\n")
#> Travel times: 5 12 30 minutes
cat("Raw weights: ", round(raw_weights, 5), "\n")
#> Raw weights:  0.02778 0.00592 0.00104
cat("Normalized:  ", round(normalized, 3), "\n")
#> Normalized:   0.8 0.17 0.03
cat("The 5-min station gets", round(normalized[1] * 100),
    "% of this tract's weight\n")
#> The 5-min station gets 80 % of this tract's weight

The nearest station (5 min) gets the lion’s share of the weight. The 30-minute station contributes very little.

The formula

The package supports two kernel functions, selected via optim_control(kernel = ...):

Power-law kernel (default): $K_{ij} = (t_{ij} + 1)^{-\alpha_i}$

Exponential kernel: $K_{ij} = \exp(-\alpha_i \cdot t_{ij})$

Both share the unified form $\log K_{ij} = -\alpha_i \cdot \phi(t_{ij})$ , where $\phi(t) = \log(t + 1)$ for the power kernel and $\phi(t) = t$ for the exponential. Everything downstream (column normalization, Poisson deviance, gradient computation) is kernel-agnostic.

The kernels differ in tail behavior. For the power kernel, the relative decay $K(2t)/K(t) \approx 2^{-\alpha}$ is approximately constant regardless of distance (heavy tail). For the exponential kernel, the relative decay $K(2t)/K(t) = \exp(-\alpha \cdot t)$ increases with distance (light tail). These are structural differences in shape, not claims about which kernel produces more localized weights — that depends on what $\alpha$ the optimizer finds.

Note that $\alpha$ has different units across kernels: it is dimensionless for the power kernel but has units of $\text{min}^{-1}$ for the exponential. The optimizer finds the appropriate scale automatically, so comparing kernels at the “same alpha” is meaningless.

For the default power kernel, the raw weight between tract $i$ and station $j$ is:

$W^{\text{raw}}_{ij} = (t_{ij} + 1)^{-\alpha_i}$

where:

$t_{ij}$ : travel time from tract $i$ to station $j$ (minutes)
$\alpha_i$ : decay parameter for tract $i$ (to be optimized)

How alpha controls the decay

tt_range <- 0:60
alphas <- c(0.5, 1, 2, 5, 10)
decay_df <- do.call(rbind, lapply(alphas, function(a) {
  data.frame(time = tt_range, weight = (tt_range + 1)^(-a),
             alpha = paste0("alpha = ", a))
}))
decay_df$alpha <- factor(decay_df$alpha,
                          levels = paste0("alpha = ", alphas))

library(ggplot2)
ggplot(decay_df, aes(x = time, y = weight, color = alpha)) +
  geom_line(linewidth = 0.8) +
  scale_y_log10() +
  labs(title = "Weight Decay Curves",
       subtitle = "Higher alpha = steeper decay = more concentrated weights",
       x = "Travel time (minutes)", y = "Weight (log scale)",
       color = "") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "right")

$\alpha = 0.5$ : nearly flat — all stations contribute almost equally
$\alpha = 2$ : moderate — stations within 10 minutes dominate
$\alpha = 10$ : steep cliff — only the nearest station matters

Why per-tract alpha?

A single global $\alpha$ cannot capture the heterogeneity of urban geography. Urban centers have many overlapping station catchments — these tracts need a low $\alpha$ to spread weight across several stations. Isolated peripheral tracts with one nearby station need a high $\alpha$ to concentrate weight. The optimizer finds the best $\alpha$ for each tract independently.

11. Why Single-Sided Standardization Fails

Before describing the column normalization approach used by the package, let us understand why simpler row-based approaches do not work. We need to normalize the raw weight matrix so that the weights sum to meaningful quantities. There are two natural choices — and only column normalization provides the guarantee we need.

11.1 Column-only standardization

The first approach: divide each column by its sum so that each station distributes 100% of its data:

$W^{\text{col}}_{ij} = \frac{W^{\text{raw}}_{ij}}{\sum_i W^{\text{raw}}_{ij}}$

Column sums = 1 (source conservation: every vote is distributed exactly once). But there is no constraint on row sums.

# Toy example: 5 tracts, 3 stations
tt_toy <- matrix(
  c(2,  10, 25, 30, 50,   # station 1: close to tract 1
    15,  3, 12, 28, 45,   # station 2: close to tract 2
    40, 35, 20,  5,  8),  # station 3: close to tracts 4-5
  nrow = 5, ncol = 3
)
pop_toy <- c(100, 300, 150, 200, 250)  # population per tract

alpha_toy <- rep(2, 5)
W_raw <- (tt_toy + 1)^(-alpha_toy)

# Column standardize
W_col <- t(t(W_raw) / colSums(W_raw))

cat("Column sums (should be 1 — OK):\n")
#> Column sums (should be 1 — OK):
round(colSums(W_col), 4)
#> [1] 1 1 1

cat("\nRow sums (uncontrolled — BAD):\n")
#> 
#> Row sums (uncontrolled — BAD):
round(rowSums(W_col), 4)
#> [1] 0.9751 0.9300 0.1439 0.6594 0.2917

cat("\nPopulation shares (what rows should look like):\n")
#> 
#> Population shares (what rows should look like):
round(pop_toy / sum(pop_toy) * 3, 4)
#> [1] 0.30 0.90 0.45 0.60 0.75

Tract 1 has only 10% of the population but receives a disproportionate share because it is close to station 1. Municipal totals are preserved, but the distribution across tracts is wrong.

11.2 Row-only standardization

The opposite approach: normalize rows to match population shares.

# Row standardize to population-proportional targets
m <- ncol(tt_toy)
row_targets_toy <- pop_toy / sum(pop_toy) * m

W_row <- W_raw * (row_targets_toy / rowSums(W_raw))

cat("Row sums (should match population targets — OK):\n")
#> Row sums (should match population targets — OK):
round(rowSums(W_row), 4)
#> [1] 0.30 0.90 0.45 0.60 0.75

cat("\nColumn sums (should be 1 — BAD):\n")
#> 
#> Column sums (should be 1 — BAD):
round(colSums(W_row), 4)
#> [1] 0.5038 1.1226 1.3736

Now the rows are correct, but the column sums are wrong. Some stations “distribute” more than 100% of their votes, others less than 100%. The municipal total is no longer conserved: if we multiply these weights by vote counts, we get more (or fewer) total votes than actually existed.

The choice: column normalization

Column-only: preserves totals (source conservation) — every vote is distributed exactly once. The spatial distribution across tracts is governed by the optimized alpha parameters, which are calibrated against census demographics.
Row-only: distributes correctly but does not preserve totals.

Column normalization is the right choice because source conservation is the non-negotiable constraint: we must not create or destroy votes. The alpha optimization (Section 13) takes care of making the row-side distribution match census demographics as closely as possible.

12. Column Normalization: Source Conservation

12.1 The idea: preserving station totals

The core constraint is source conservation: each polling station must distribute exactly 100% of its data. If a station recorded 500 votes, the interpolation must assign exactly 500 votes across all tracts — no more, no less.

Column normalization enforces this directly: divide each column of the raw weight matrix by its sum so that each station’s weights sum to 1.

$W_{ij} = \frac{K_{ij}}{\sum_i K_{ij}}$

where $K_{ij} = (t_{ij} + 1)^{-\alpha_i}$ is the IDW kernel from Section 10.

12.2 Worked example

# Using the toy example from Section 11
# W_raw is already defined: 5 tracts, 3 stations
W_col <- t(t(W_raw) / colSums(W_raw))

cat("Column sums (should be 1 — source conservation):\n")
#> Column sums (should be 1 — source conservation):
round(colSums(W_col), 4)
#> [1] 1 1 1

cat("\nRow sums (governed by alpha — not explicitly constrained):\n")
#> 
#> Row sums (governed by alpha — not explicitly constrained):
round(rowSums(W_col), 4)
#> [1] 0.9751 0.9300 0.1439 0.6594 0.2917

cat("\nWeight matrix:\n")
#> 
#> Weight matrix:
print(round(W_col, 4))
#>        [,1]   [,2]   [,3]
#> [1,] 0.9087 0.0528 0.0136
#> [2,] 0.0676 0.8448 0.0176
#> [3,] 0.0121 0.0800 0.0518
#> [4,] 0.0085 0.0161 0.6348
#> [5,] 0.0031 0.0064 0.2821

Each station distributes exactly 100% of its data. The distribution across tracts is determined by the IDW kernel: tracts closer to a station receive more weight. The alpha optimization (Section 13) adjusts the decay parameters so that the resulting tract-level estimates match census demographics as closely as possible.

Key properties:

Source conservation: column sums are exactly 1
Non-negative: all weights are non-negative
Deterministic: no iterative procedure, just a single division
Unreachable tracts (all-zero rows) remain zero and are flagged

12.3 Per-Bracket Column Normalization

The problem: age groups have different geographies

Think about the spatial distribution of voters by age. Young adults (18–24) cluster near universities, cheap rental apartments, and transit corridors. Elderly voters (60–69) concentrate in established neighborhoods, retirement communities, and areas with healthcare facilities. These spatial patterns differ substantially — yet a single weight matrix must allocate all age groups simultaneously.

With aggregate column normalization, the weight matrix has one set of weights for all age brackets. The optimizer must find one set of weights that simultaneously matches multiple different demographic distributions. This is an inherent compromise: weights that correctly distribute 18–20 year-olds may systematically misallocate 60–69 year-olds, and vice versa.

The per-bracket extension

Per-bracket column normalization addresses this by building a separate column-normalized weight matrix for each demographic bracket $k = 1, \ldots, K$ . Instead of one weight matrix $W$ , we build $K$ bracket-specific matrices $W^{(1)}, \ldots, W^{(K)}$ , each column-normalized independently:

For each bracket $k$ :

Compute the IDW kernel: $K^{(k)}_{ij} = (t_{ij} + 1)^{-\alpha_{ik}}$ (per-tract, per-bracket decay parameter)
Column-normalize: $W^{(k)}_{ij} = K^{(k)}_{ij} / \sum_{i'} K^{(k)}_{i'j}$

Aggregate (voter-weighted):

$W_{\text{agg},ij} = \frac{\sum_{k=1}^{K} W^{(k)}_{ij} \cdot c_{kj}}{\sum_{k=1}^{K} c_{kj}}$

where $c_{kj}$ is the voter count for bracket $k$ at station $j$ . Each bracket’s contribution is weighted by the number of voters it represents at that station, so brackets with more voters have proportionally more influence on the aggregated weights.

Why this works

Each bracket’s column-normalized weights distribute that bracket’s station-level data across tracts proportionally to the IDW kernel. The alpha optimization (Section 13) adjusts the per-tract, per-bracket decay parameters so that the resulting interpolated demographics match the census demographics across all brackets simultaneously. The per-bracket structure allows different demographic groups to have different spatial reaches — young voters may draw from more distant stations than elderly voters.

# Bracket 1 ("young"): concentrated near station 1
# Bracket 2 ("old"): concentrated near station 3
pop_young <- c(60, 100, 50, 30, 20)   # young pop per tract
pop_old   <- c(40, 200, 100, 170, 230) # old pop per tract

src_young <- c(150, 70, 40)  # young voters per station
src_old   <- c(100, 330, 310) # old voters per station

# --- Per-bracket alpha values (different spatial reach) ---
alpha_young <- 1.5  # young voters: broader reach
alpha_old   <- 3.0  # old voters: more localized

# Build separate IDW kernels with different decay
K_young <- (tt_toy + 1)^(-alpha_young)
K_old   <- (tt_toy + 1)^(-alpha_old)

# Column-normalize each bracket independently
W_young <- t(t(K_young) / colSums(K_young))
W_old   <- t(t(K_old)   / colSums(K_old))

# Interpolate each bracket with its own weight matrix
young_hat <- as.numeric(W_young %*% src_young)
old_hat   <- as.numeric(W_old   %*% src_old)

# Voter-weighted aggregation: W_agg = sum(W_k * c_k) / sum(c_k)
c_total <- src_young + src_old
W_agg <- (W_young * rep(src_young, each = nrow(W_young)) +
           W_old * rep(src_old, each = nrow(W_old))) /
          rep(c_total, each = nrow(W_young))

poisson_dev <- function(obs, fit) {
  2 * sum(fit - obs * log(ifelse(obs > 0, fit, 1)) +
          ifelse(obs > 0, obs * log(obs), 0) - obs)
}

cat("=== Young bracket (alpha = 1.5, broader reach) ===\n")
#> === Young bracket (alpha = 1.5, broader reach) ===
cat("Census truth:      ", pop_young, "\n")
#> Census truth:       60 100 50 30 20
cat("Per-bracket interp:", round(young_hat), "\n")
#> Per-bracket interp: 130 70 17 28 15
cat("Deviance:          ", round(poisson_dev(pop_young, young_hat)), "\n")
#> Deviance:           103

cat("\n=== Old bracket (alpha = 3.0, more localized) ===\n")
#> 
#> === Old bracket (alpha = 3.0, more localized) ===
cat("Census truth:      ", pop_old, "\n")
#> Census truth:       40 200 100 170 230
cat("Per-bracket interp:", round(old_hat), "\n")
#> Per-bracket interp: 103 318 15 234 69
cat("Deviance:          ", round(poisson_dev(pop_old, old_hat)), "\n")
#> Deviance:           563

Different alpha values give each demographic group a different spatial reach. The optimizer finds per-tract, per-bracket decay parameters that minimize the total Poisson deviance across all brackets simultaneously.

# Visual comparison: census truth vs per-bracket estimate
comparison_df <- data.frame(
  tract = rep(1:5, 4),
  value = c(pop_young, pop_old, young_hat, old_hat),
  bracket = rep(c("Young", "Old", "Young", "Old"), each = 5),
  source = rep(c("Census truth", "Census truth",
                  "Per-bracket estimate",
                  "Per-bracket estimate"),
               each = 5)
)

ggplot(comparison_df,
       aes(x = factor(tract), y = value, fill = source)) +
  geom_col(position = "dodge", width = 0.7) +
  facet_wrap(~bracket, scales = "free_y") +
  scale_fill_manual(values = c("Census truth" = "#4575b4",
                                "Per-bracket estimate" = "#fdae61")) +
  labs(title = "Per-Bracket Column Normalization: Recovery of Age-Group Profiles",
       subtitle = "Each bracket has its own alpha and weight matrix",
       x = "Tract", y = "Population", fill = "") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "bottom")

Connection to the optimizer

During optimization (Section 14), the per-bracket column normalization is performed on 3D tensors of shape [ka x n x m], where ka is the number of active demographic brackets. This vectorizes all $K$ separate normalization operations into a single batched computation on GPU or CPU.

After optimization, the weight matrix $W$ is returned directly from optimize_alpha() as result$W. It can also be recomputed from pre-existing alpha values using compute_weight_matrix():

# W is already in the optimization result
W <- optim_result$W

# Or recompute from alpha
W <- compute_weight_matrix(time_matrix, alpha,
  pop_matrix = pop_matrix,
  source_matrix = source_matrix)

When pop_matrix and source_matrix are provided, the function automatically uses per-bracket mode.

Part V: Calibration and Optimization

13. The Calibration Objective

13.1 The key insight

We have all the ingredients: travel times, an IDW kernel parametrized by $\alpha$ , and column normalization to enforce source conservation. Now we need to find the right $\alpha$ .

Here is the calibration logic: if the weights are correct, then interpolating voter age profiles from stations should recover the census age profiles at each tract.

Consider a concrete example. Census data says tract X has 200 people aged 30–39. If our weights are good, then multiplying the weight row for tract X by the station-level voter counts for the 30–39 age bracket should give us approximately 200. If it gives 150 or 300, the weights are wrong — the geographic allocation is biased.

The optimization finds $\alpha$ values that minimize this mismatch across all tracts and all age brackets simultaneously.

13.2 The iterative procedure

The optimization is a loop:

In words:

Initialize: set $\alpha_i = 2$ for all tracts (moderate decay)
Build the IDW kernel: compute raw weights from travel times and current $\alpha$
Column-normalize: enforce source conservation (column sums = 1)
Interpolate demographics: multiply normalized weights by station-level age profiles
Compare with census: compute the Poisson deviance between interpolated and census age profiles
Adjust $\alpha$ : use the gradient (via automatic differentiation) to nudge $\alpha$ in the direction that reduces the loss
Repeat until the loss converges

Each iteration of this loop evaluates the complete forward pass: $\alpha \to \text{kernel} \to \text{column-normalize} \to \text{interpolation} \to \text{loss}$ .

13.3 The data-fit term: Poisson deviance

Voter counts are non-negative integers. The Poisson deviance is the natural goodness-of-fit measure for count data: it penalizes relative deviations rather than absolute ones. A tract with 10 predicted voters vs. 20 observed is penalized more heavily than a tract with 1,000 predicted vs. 1,010 observed — even though both differ by 10.

$D(\alpha) = 2 \sum_{i=1}^{N} \sum_{k=1}^{K} \left[\hat{V}_{ik} - P_{ik}\log\hat{V}_{ik} + P_{ik}\log P_{ik} - P_{ik}\right]$

where:

$\hat{V} = W(\alpha) \times S$ — interpolated voter demographics per tract, with $W$ built via per-bracket column normalization (Section 12.3)
$P$ — census demographics per tract
$k$ = demographic bracket index ( $K = 14$ for full gender $\times$ age calibration)

When $\hat{V}_{ik} = P_{ik}$ for all entries, $D = 0$ . The constant terms $P_{ik}\log P_{ik} - P_{ik}$ are precomputed once and do not depend on $\alpha$ .

13.4 Regularization: barrier, entropy, and quadratic penalties

Poisson deviance alone can produce degenerate solutions: some tracts might receive zero predicted voters (collapsed allocation), or weights might be too diffuse or too concentrated. Three penalty terms address these pathologies.

Log-barrier penalty (prevents zero-voter tracts):

$B(\alpha) = -\mu_B \sum_{i=1}^{N} \log\hat{V}_{i,\text{total}}$

where $\hat{V}_{i,\text{total}} = \sum_k \hat{V}_{ik}$ is the total predicted population in tract $i$ . The barrier diverges as any tract’s total approaches zero, preventing collapsed allocations. Default strength $\mu_B = 1$ .

Entropy penalty (controls weight concentration):

$E(\alpha) = \mu_E \sum_{i=1}^{N} H_i, \quad H_i = -\sum_{j=1}^{M} p_{ij}\log p_{ij}$

where $p_{ij}$ are the row-normalized aggregated weights. $H_i$ is the Shannon entropy of tract $i$ ’s weight distribution — higher values mean more stations contribute. The strength $\mu_E$ can be set manually or tuned automatically via dual ascent (Section 14.5).

Per-tract quadratic penalty (augmented Lagrangian):

$Q(\alpha) = \frac{\rho}{2}\sum_{i=1}^{N} \left(H_i - h_i^*\right)^2$

where $h_i^*$ is the per-tract entropy target (Section 14.5). This quadratic term works alongside the linear entropy penalty in an augmented Lagrangian framework, driving each tract’s entropy toward its individual target. Active only in adaptive mode (target_eff_src set).

13.5 The total loss

$L(\alpha) = D(\alpha) + B(\alpha) + E(\alpha) + Q(\alpha)$

The Poisson deviance $D$ dominates in magnitude; the barrier $B$ is a safety net (typically small); and the entropy terms $E + Q$ implement entropy control. In the default mode (no target_eff_src), $E = Q = 0$ and the optimizer minimizes $D + B$ only.

Figure @ref(fig:fig-loss-decomposition) shows the relative contribution of each loss component at convergence across 26 Brazilian state capitals. Deviance dominates in all cities, but Norte/Centro-Oeste capitals with structural infeasibility (Macapá, Manaus, Brasília, Rio de Janeiro) show elevated entropy penalties as the dual ascent works harder to achieve the entropy targets.

Loss decomposition at convergence across 26 state capitals. Deviance dominates, but cities with sparse road networks (Norte) show elevated entropy penalties.

# Simplified case: all tracts share the same alpha
tt_adj <- tt_toy + 1
pop_matrix_toy <- matrix(pop_toy, ncol = 1)
src_matrix_toy <- matrix(c(250, 400, 350), ncol = 1)

alpha_range <- seq(0.1, 8, by = 0.1)
deviance_values <- sapply(alpha_range, function(a) {
  # Compute column-normalized weights and Poisson deviance
  K <- tt_adj ^ (-a)
  K[!is.finite(K)] <- 0
  W <- t(t(K) / colSums(K))  # column-normalize
  v_hat <- as.numeric(W %*% src_matrix_toy)
  p <- as.numeric(pop_matrix_toy)
  # Poisson deviance: 2 * sum(v_hat - p*log(v_hat) + p*log(p) - p)
  v_safe <- ifelse(v_hat > 0, v_hat, 1e-30)
  2 * sum(v_safe - p * log(ifelse(p > 0, v_safe, 1))
          + ifelse(p > 0, p * log(p), 0) - p)
})

ggplot(data.frame(alpha = alpha_range, deviance = deviance_values),
       aes(x = alpha, y = deviance)) +
  geom_line(color = "#4575b4", linewidth = 0.8) +
  geom_vline(xintercept = alpha_range[which.min(deviance_values)],
             linetype = "dashed", color = "#d73027") +
  labs(title = "Objective Function Landscape",
       subtitle = sprintf(
         "Minimum at alpha = %.1f (shared across all tracts)",
         alpha_range[which.min(deviance_values)]),
       x = expression(alpha), y = "Poisson Deviance") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

The landscape has a clear minimum. In practice, each tract has its own alpha (per bracket), creating a high-dimensional optimization problem.

14. Optimization: Per-Bracket Gradient Descent with Column Normalization

14.1 Why automatic differentiation?

The objective function involves column normalization of the IDW kernel, followed by matrix multiplication and Poisson deviance loss. Writing the analytical gradient by hand is straightforward for simple column normalization, but becomes involved when combined with the per-bracket structure, softplus reparameterization, and multiple penalty terms.

For small problems, torch’s automatic differentiation handles this: we define the forward pass (kernel -> column-normalize -> interpolation -> loss), and torch computes the gradient of the loss with respect to $\theta$ automatically by recording every operation and applying the chain rule backward. For large problems, the package derives and uses analytical gradients that are mathematically equivalent but computationally faster (Section 14.8).

14.2 Log-domain computation for numerical stability

When $\alpha$ is large and travel times vary widely, some weights become astronomically small:

# Demonstrate the underflow problem
tt_demo <- c(5, 20, 60, 120, 200)  # travel times (minutes)
alpha_high <- 8

raw <- (tt_demo + 1)^(-alpha_high)
cat("Raw weights at alpha = 8:\n")
#> Raw weights at alpha = 8:
cat(formatC(raw, format = "e", digits = 2), "\n")
#> 5.95e-07 2.64e-11 5.22e-15 2.18e-17 3.75e-19
cat("\nIn float32, values below ~1.2e-38 become exactly 0\n")
#> 
#> In float32, values below ~1.2e-38 become exactly 0
cat("Zero weights cause division by zero in column normalization\n")
#> Zero weights cause division by zero in column normalization

In 32-bit floating point (standard for GPU computation), numbers smaller than about $1.2 \times 10^{-38}$ underflow to zero. When a weight becomes exactly zero, the column normalization produces $0/0 = \text{NaN}$ , and the optimization collapses.

The log-domain solution: the IDW kernel is computed in log space as $\log K_{ij} = -\alpha_i \cdot \log(t_{ij} + 1)$ , and the column normalization uses the logsumexp trick for numerical stability. The logsumexp function $\log(\sum_j \exp(x_j))$ is a numerically stable primitive available in torch.

14.3 The column normalization algorithm

The optimizer performs per-bracket column normalization on 3D tensors. The tensor shapes involved:

Symbol	Shape	Meaning
`log_t`	$[n \times m]$	Log travel times (precomputed once)
`alpha`	$[n \times k_a]$	Per-tract-per-bracket decay parameters
`P`	$[n \times k_a]$	Census population per tract and bracket
`V`	$[m \times k_a]$	Voter counts per station and bracket
`log_K`	$[k_a \times n \times m]$	Per-bracket log-kernels
`W_all`	$[k_a \times n \times m]$	Per-bracket weight tensors

The first dimension ( $k_a$ ) indexes demographic brackets. Each bracket uses its own alpha values, producing bracket-specific kernels. Column normalization is applied independently to each bracket slice.

Input: log_t[n,m], alpha[n,ka], P[n,ka], V[m,ka]

# Log-kernel (per bracket)
log_K[b,i,j] = -alpha[i,b] * log_t[i,j]               # [ka x n x m]

# Column normalization in log domain (per bracket)
log_colsum[b,j] = logsumexp_i(log_K[b,i,j])            # [ka x m]
W[b,i,j] = exp(log_K[b,i,j] - log_colsum[b,j])        # [ka x n x m]

# Interpolate: per-bracket weighted voter counts
V_hat[i,b] = sum_j W[b,i,j] * V[j,b]                  # [n x ka]
loss = poisson_deviance(V_hat, P) + barrier(V_hat) + entropy(W)

Every operation — log, exp, logsumexp, matrix multiply, Poisson deviance — has a defined backward pass in torch. The gradient $\nabla_\theta L$ flows backward through the column normalization automatically.

In practice, the optimizer implements the column normalization via nnf_softmax() (which uses logsumexp internally), while compute_weight_matrix() uses linear-space normalization with a clamp guard for safety.

14.4 Softplus reparameterization

The decay parameters $\alpha$ must satisfy $\alpha_{ib} \geq \alpha_{\min}$ (default $\alpha_{\min} = 1$ for the power kernel). Rather than clamping or projecting, the optimizer works with unconstrained parameters $\theta$ and maps:

$\alpha_{ib} = \alpha_{\min} + \text{softplus}(\theta_{ib}) = \alpha_{\min} + \log(1 + e^{\theta_{ib}})$

The softplus function $\log(1 + e^x)$ is a smooth approximation to $\max(0, x)$ . Key properties:

$\alpha$ is always strictly above $\alpha_{\min}$ (softplus $> 0$ )
Gradient $\partial\alpha/\partial\theta = \sigma(\theta)$ , always in $(0, 1)$ — no saturation at any boundary
No clamping, no projection — smooth and differentiable everywhere
Corner solutions due to hard boundary constraints cannot occur

Per-tract, per-bracket alpha: the $\alpha$ matrix has shape $[n \times k_a]$ , allowing different decay rates for different demographic groups within the same tract. A young-voter bracket might have a different spatial reach than an old-voter bracket.

Initialization: $\theta_0 = \text{softplus}^{-1}(\alpha_{\text{init}} - \alpha_{\min})$ , with a small random perturbation ( $\sigma = 0.1$ ) to break symmetry across tracts.

Figure @ref(fig:fig-alpha-bracket) shows how the optimized $\alpha$ values vary across the 14 demographic brackets in Belo Horizonte, illustrating that the per-bracket structure allows each age-gender group to learn its own spatial reach.

Distribution of optimized alpha by demographic bracket in Belo Horizonte (5,109 tracts). Each bracket learns a different decay rate, reflecting age- and gender-specific spatial mobility patterns.

14.5 Adaptive entropy targets and dual ascent

When the user specifies target_eff_src (e.g., target_eff_src = 5) in optim_control(), the optimizer automatically adapts the entropy penalty strength $\mu_E$ to achieve approximately that many effective sources per tract.

Per-tract targets

A global entropy target is infeasible for tracts with sparse or uniform travel times. Consider a peripheral tract in the Amazon where all reachable stations are at similar distances: no value of $\alpha$ can produce a weight distribution with low entropy, because the kernel values are nearly equal regardless of the decay rate.

The package computes per-tract adaptive targets:

$h_i^* = \max\!\left( \log\min(k_i - 1,\; T),\; H_i(\alpha_{\text{ref}}),\; \log 2 \right)$

where:

$k_i$ = number of reachable stations for tract $i$
$T$ = user’s target_eff_src
$H_i(\alpha_{\text{ref}})$ = Shannon entropy of the weight distribution at a reference decay ( $\alpha_{\text{ref}} = 7$ for power kernel, 50 for exponential). This captures the minimum achievable entropy given the actual travel time structure. Tracts where all stations are at similar distances get a higher floor.
$\log 2$ = absolute floor (at least 2 effective sources)

Why this matters: without adaptive targets, 9 of 26 Brazilian state capitals (all in Norte or Centro-Oeste — Amazonian and Cerrado regions) exhibit optimization instability. With per-tract adaptive targets, catastrophic optimization failure (crash-collapse) is prevented and all 26 Brazilian state capitals converge. However, cities with structurally infeasible constraints (sparse road networks with uniform travel times) achieve higher deviance than well-connected cities — the per-tract targets mitigate but cannot eliminate this structural limitation.

Figure @ref(fig:fig-eff-sources) shows the distribution of effective sources across tracts for four contrasting capitals. Palmas (Norte) is concentrated near 1–2 effective sources due to its sparse road network, while larger cities achieve broader distributions.

Distribution of effective sources across census tracts in four capitals. Dashed red line indicates the target. Palmas (sparse Norte city) concentrates near 1–2 sources; larger Sudeste/Nordeste cities achieve broader distributions.

Effective sources

The metric $\text{eff\_src}_i = \exp(H_i)$ measures how many stations effectively contribute to tract $i$ . When $H_i = \log M$ (uniform weights), all $M$ stations contribute equally; when $H_i = 0$ , a single station dominates.

Log-reparameterized dual ascent

The entropy multiplier is parameterized as $\mu_E = \exp(\log\mu)$ , which structurally ensures $\mu_E > 0$ (prevents collapse to zero). The dual variable is updated once per epoch via the augmented Lagrangian method:

$\log\mu \leftarrow \log\mu + d(t) \cdot \frac{\eta_d}{T_d} \cdot (\bar{H} - \bar{h}^*)$

where:

$d(t) = 1/\sqrt{1 + t/5000}$ is a gentle decay schedule
$\eta_d$ = dual step scaling (dual_eta, default 1.0)
$T_d = 500$ is a dampening constant
$\bar{H}$ = current mean entropy across tracts
$\bar{h}^*$ = mean of per-tract targets

An ALM bound caps $\log\mu \leq \log(3\rho)$ to prevent overshoot (Bertsekas, 1999). The quadratic penalty coefficient is initialized as $\rho = M / T$ where $M$ is the number of stations and $T$ is target_eff_src. The per-tract quadratic penalty $(\rho/2)\sum_i(H_i - h_i^*)^2$ does the heavy lifting of driving individual tracts toward their targets; the dual update ensures asymptotic exactness of the aggregate constraint.

The effective sources are smoothed via an exponential moving average ( $\alpha_{\text{ema}} = 0.05$ ) to filter noise from dual updates.

14.6 ADAM optimizer and learning rate schedule

We use ADAM (Adaptive Moment Estimation) with betas $(0.9, 0.99)$ as the underlying optimizer, with per-bracket column normalization. At each gradient step:

Build the IDW kernel from current $\theta$ values (via softplus)
Per-bracket column normalization on 3D tensors
Compute loss: Poisson deviance + barrier + entropy + quadratic
Backward + Adam step on unconstrained $\theta$

At the end of each epoch, the true objective is evaluated on the full dataset (no gradient) for convergence monitoring.

Learning rate schedule — two modes:

Non-adaptive mode (no target_eff_src): SGDR cosine annealing with warm restarts (Loshchilov & Hutter, 2017). The learning rate oscillates between lr_init and lr_init * 0.01, with the cycle length doubling after each restart ( $T_0 = 500$ , $T_{\text{mult}} = 2$ ). The maximum learning rate decays by a factor of 0.7 per restart. At warm restarts, ADAM’s momentum buffers are cleared to allow fresh exploration. This deterministic schedule is decoupled from loss values.

Adaptive mode (target_eff_src set): constant learning rate throughout. The dual ascent updates change the loss landscape every epoch, making LR decay counterproductive. ADAM’s per-parameter adaptive second moment already handles oscillations.

Gradient clipping: max_norm = 1.0 prevents exploding gradients.

optim_result <- optimize_alpha(
  time_matrix   = time_matrix,
  pop_matrix    = pop_matrix,
  source_matrix = source_matrix,
  row_targets   = row_targets,
  optim = optim_control(
    lr_init        = 0.05,        # initial learning rate
    target_eff_src = 5,           # enable dual ascent (5 effective sources)
    use_gpu        = FALSE        # CPU for small problems
  )
)

The epoch loss curve shows the total loss (deviance + barrier + entropy + quadratic) evaluated at each epoch. The package tracks the best model and returns the corresponding alpha values.

Figure @ref(fig:fig-convergence-trajectories) compares convergence behavior across four contrasting capitals: Aracaju converges in 2,687 epochs, while Palmas (sparse Norte city) requires 19,383. The entropy multiplier $\mu_E$ panel reveals how the dual ascent adapts penalty strength — healthy cities see $\mu_E$ rise and settle quickly, while structurally difficult cities show prolonged oscillations.

Convergence trajectories for four contrasting capitals. Left: Poisson deviance (log scale). Right: entropy multiplier. Healthy cities (Aracaju, Belo Horizonte) converge quickly; sparse Norte cities (Palmas, Manaus) require many more epochs and show prolonged dual variable dynamics.

Key tuning parameters (via optim_control()):

Parameter	Default	Effect
`max_epochs`	50000	Maximum epochs (convergence usually earlier).
`lr_init`	0.05	Initial learning rate. SGDR (non-adaptive) or constant (adaptive).
`convergence_tol`	1e-5	Relative improvement threshold for convergence.
`patience`	100	Lookback window (epochs) for convergence check.
`barrier_mu`	1	Log-barrier strength. 0 to disable.
`entropy_mu`	0	Manual entropy penalty strength. Superseded by `target_eff_src`.
`target_eff_src`	NULL	Target effective sources per tract. Enables dual ascent.
`dual_eta`	1.0	Dual step scaling factor.
`alpha_init`	2	Initial guess for alpha (scalar, vector, or matrix).
`alpha_min`	1 (power)	Lower bound via softplus reparameterization.
`kernel`	“power”	“power” or “exponential”.
`dtype`	“float32”	“float64” for more precision (2x memory).
`use_gpu`	NULL	GPU acceleration (auto-detected when NULL).

14.7 Convergence criteria

The optimizer uses a two-tier convergence system, never declaring convergence before $\max(3 \times \text{patience}, 500)$ epochs:

Tier 1 — Gradient-based (true stationarity). Requires 10 consecutive epochs satisfying all of:

Relative gradient norm: $\|\nabla_\theta L\| / (1 + \|\theta\|) < 10^{-4}$
Effective sources within 5% of adjusted target (adaptive mode)
Dual variable $\mu_E$ stable over patience epochs
Total loss stable over patience epochs
Deviance not rising: mean deviance over the last 100 epochs must not exceed mean deviance from 400–500 epochs ago by more than 1%

Tier 2 — Window-based (diminishing returns). Triggered when the loss metric improves by less than convergence_tol over the last patience epochs, provided:

Effective sources within 20% of target (coarser than Tier 1)
Dual variable and total loss both stable
In refinement phase: adaptive mode (always), or non-adaptive mode with LR below 10% of initial

In adaptive mode, the convergence metric is Poisson deviance alone; in non-adaptive mode, total loss (deviance + barrier) is used.

Best model tracking: in adaptive mode, the optimizer tracks the model with the lowest Poisson deviance among epochs where effective sources are within 20% of the adjusted target. At convergence (Tier 1 or 2), the final-epoch model (the ALM saddle point) is returned; when max_epochs is hit without convergence, the best tracked near-target model is used instead.

14.8 Station-chunked computation

For large problems where $k_a \times n \times m > 25 \times 10^6$ , the optimizer automatically switches to a station-chunked computation path. Instead of building the full $[k_a \times n \times m]$ tensor, the forward pass iterates over station chunks of size:

$m_{\text{chunk}} = \left\lceil \frac{25 \times 10^6}{k_a \times n}\right\rceil$

capped by available GPU memory (60% of VRAM).

Two-pass structure:

Pass 1 (no gradient): loops over station chunks, accumulating the predicted values $\hat{V}$ ( $n \times k_a$ ) and the aggregated weight matrix $W_{\text{agg}}$ ( $n \times m$ ). Computes the loss value and entropy/barrier gradient signals.
Pass 2 (analytical gradient): computes $\partial L / \partial\alpha$ directly using the derived softmax Jacobian formula, without autograd backward:

$\frac{\partial L}{\partial \alpha_{ib}} = \sum_j c_{bj}\, W^{(b)}_{ij}\, \phi(t_{ij}) \left(\overline{dL}_{bj} - dL_{ib}\right)$

where $\overline{dL}_{bj} = \sum_{i'} W^{(b)}_{i'j}\, dL_{i'b}$ is the weight-averaged loss gradient at station $j$ (see Section 20 for the full derivation).

The entropy gradient uses a one-epoch-delayed signal: the gradient $\partial(E+Q)/\partial W_{\text{agg}}$ from the previous epoch is propagated through the same softmax Jacobian, adding the corresponding terms. This avoids rebuilding the computational graph while introducing negligible lag.

Speedup: the analytical path achieves 2–3x speedup over autograd backward for large cities (e.g., 2.62 s/epoch vs. 7.77 s/epoch for S0e3o Paulo with 26,611 tracts and 2,028 stations).

Part VI: Results and Validation

15. The Effect of Alpha: Spatial Interpretation

Most alphas cluster in a moderate range, with tails extending toward 0 (uniform allocation) and large values (nearest-station-only).

Low alpha (blue, urban core): weight is spread across many nearby stations. Multiple stations overlap, and the optimizer distributes influence broadly.
High alpha (red, periphery): weight is concentrated on the nearest station. Isolated tracts with one dominant station.

# How alpha changes the weight distribution for a fixed tract
tt_one_tract <- c(3, 8, 15, 25, 40)
alphas_demo <- c(1, 3, 5, 10)
wt_df <- do.call(rbind, lapply(alphas_demo, function(a) {
  w <- (tt_one_tract + 1)^(-a)
  w <- w / sum(w)
  data.frame(station = 1:5, weight = w,
             alpha = paste0("alpha = ", a))
}))
wt_df$alpha <- factor(wt_df$alpha,
                        levels = paste0("alpha = ", alphas_demo))

ggplot(wt_df, aes(x = factor(station), y = weight, fill = alpha)) +
  geom_col(position = "dodge") +
  labs(title = "Weight Distribution as Alpha Increases",
       subtitle = "Fixed tract, 5 stations at 3, 8, 15, 25, 40 min",
       x = "Station (sorted by distance)",
       y = "Normalized weight", fill = "") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

At $\alpha = 1$ , weights are nearly uniform. At $\alpha = 10$ , the nearest station dominates with ~85% of the weight.

Interpretation guide:

$\alpha < 1$ : draws information almost equally from all stations. Unusual — check for data issues.
$\alpha \in [1, 3]$ : several nearby stations contribute. Typical for dense urban areas.
$\alpha \in [3, 6]$ : a few stations dominate. Typical for suburban areas.
$\alpha > 6$ : one station dominates almost entirely. Typical for isolated tracts.

16. The Interpolation: From Weights to Tract-Level Estimates

The weight matrix $W$ encodes the geographic relationship between tracts and stations. Multiplying by any station-level variable produces tract-level estimates:

$\hat{X} = W \times V$

where $W$ is [N × M], $V$ is [M × 1], and $\hat{X}$ is [N × 1]. For multiple variables: $\hat{X} = W \times V_{\text{matrix}}$ where $V_{\text{matrix}}$ is [M × P].

# W is returned directly from optimization
W <- optim_result$W

electoral_data <- as.matrix(
  sources[, c("CAND_13", "CAND_22", "QT_COMPARECIMENTO")]
)
interpolated <- W %*% electoral_data

pct_lula <- interpolated[, "CAND_13"] /
  interpolated[, "QT_COMPARECIMENTO"] * 100

Geographic polarization becomes visible at the tract level — a spatial pattern that was previously impossible to observe with station-level data alone.

Calibration validation: interpolated demographics should match census demographics (the optimization objective).

Points cluster tightly around the 45-degree line, confirming that the optimization found good weights.

The reuse principle: the weight matrix $W$ is a property of the geography and the calibration. Once computed, it can interpolate any variable measured at polling stations — candidates, parties, turnout, gender composition, education level — without re-optimizing alpha.

17. Validation: How Well Did We Recover the Age Structure?

17.1 Conservation properties

Column conservation (source conservation):

$\sum_i W_{ij} = 1 \quad \forall j$

Each station distributes exactly 100% of its data. Municipal totals are preserved exactly.

votes <- c(500, 800, 700)
interpolated_votes <- as.numeric(W_col %*% votes)

cat("Source total:", sum(votes), "\n")
#> Source total: 2000
cat("Interpolated total:", round(sum(interpolated_votes)), "\n")
#> Interpolated total: 2000
cat("Per-tract:", round(interpolated_votes), "\n")
#> Per-tract: 506 722 106 461 204

Calibration residuals:

Residuals should be centered near zero with small magnitude. Outliers may indicate boundary effects (voters crossing municipality borders) or data quality issues.

17.2 Age pyramid recovery: Rocinha and Copacabana revisited

In Section 1, we showed that Rocinha and Copacabana have strikingly different age structures. Does the interpolation recover these patterns?

The blue bars show the census truth (population by age bracket) and the yellow bars show the IDW-interpolated voter profiles. The interpolation closely recovers the distinctive age patterns of both neighborhoods: Rocinha’s young skew and Copacabana’s aging profile.

This is a demanding test. The method does not know anything about Rocinha or Copacabana as neighborhoods — it only sees census tracts and polling stations. Yet the calibration against age brackets produces weights that correctly recover the neighborhood-level demographic signatures.

17.3 Comparison with Voronoi assignment

The simplest alternative to our method is Voronoi assignment: assign each tract to its nearest polling station and use that station’s data directly. This is equivalent to constructing Voronoi (Thiessen) polygons around each station and assigning each tract 100% of the weight from one station.

We compare both methods across four Brazilian state capitals with diverse urban morphologies: Palmas (TO), Cuiabá (MT), Manaus (AM), and Natal (RN).

In the Voronoi approach, each tract inherits the complete demographic and electoral profile of a single station. This ignores the many-to-many structure: a tract near the boundary between two Voronoi cells gets none of the influence from the station just across the boundary.

How well do the two methods recover the census age structure? The following figure compares predicted versus observed demographic profiles across all four cities:

IDW points cluster tightly around the 45-degree line in every city. Voronoi points show much more scatter — many tracts have large prediction errors because they were assigned to a station with a very different demographic composition.

The residual boxplots confirm the pattern across all four cities: IDW residuals are tightly centered around zero, while Voronoi residuals have much larger spread.

The aggregate SSE ratio quantifies the advantage:

In every city, the IDW method achieves dramatically lower squared error than Voronoi assignment — by a factor of 275x (Cuiaba) to 1,624x (Manaus), with Palmas at 601x and Natal at 639x. The advantage is consistent across cities of different sizes and urban layouts, confirming that smooth, calibrated weighting is fundamentally superior to hard nearest-station assignment.

As a final validation exercise, we examine whether the interpolated voting patterns exhibit expected ecological correlations. If the spatial weights are correct, interpolated vote shares should correlate with tract-level socioeconomic indicators from independent data sources.

We correlate the interpolated percentage of votes for Lula (PT) in each census tract with the average monthly income of household heads, obtained from the 2022 Census (via censobr::read_tracts(2022, "ResponsavelRenda")). We show this correlation for four cities with very different socioeconomic profiles: São Paulo (SP), Salvador (BA), Porto Alegre (RS), and Cuiabá (MT).

The negative correlation is consistent across all four cities and with well-established patterns in Brazilian electoral geography: lower-income areas tend to vote more heavily for PT candidates, while higher-income areas lean toward other parties. The fact that our interpolated tract-level estimates reproduce this pattern — using vote data that was originally measured at polling stations, not at tracts — provides external validation that the spatial weights are geographically meaningful.

The income data is log-transformed ( $\log(x+1)$ ) to reduce the influence of extreme values and improve linearity.

Part VII: Practical

18. Performance and Tuning

All timings below assume default optimization parameters: max_epochs = 50000, dtype = "float32". The optimizer typically converges well before max_epochs due to the two-tier convergence system (Section 14.7).

Problem size (tracts)	Recommendation	Typical optimization time
< 200	CPU	10–30 seconds
200–1,000	CPU or GPU	30s – 2 min
1,000–5,000	GPU recommended	1 – 5 min
> 5,000	GPU strongly recommended	5–15 min

18.1 Automatic station chunking

For large problems where $k_a \times n \times m > 25 \times 10^6$ , the optimizer automatically switches to the station-chunked computation path (Section 14.8). This reduces both per-epoch time and peak GPU memory:

Municipality	Tracts	Stations	Auto-chunks	~s/epoch (GPU)
Varginha	279	37	1 (monolithic)	~0.05
Belo Horizonte	5,109	407	2	~0.085
Rio de Janeiro	13,354	1,392	11	~0.52
S0e3o Paulo	26,611	2,028	30	~0.96

The chunk size is computed as $m_{\text{chunk}} = \lceil 25\text{M} / (k_a \times n) \rceil$ , capped by available GPU memory. Users can override via force_chunked and chunk_size in optim_control(), but the defaults work well for the tested range.

18.2 Practical runtimes

Runtimes for the optimization step only (GPU, RTX 3060 8GB, convergence or max_epochs):

Igrejinha (85 tracts, 17 stations): ~24s CPU
Varginha (279 tracts, 37 stations): ~52s CPU
Palmas (660 tracts, 71 stations): ~135s CPU
Niter0f3i (1,169 tracts, 135 stations): ~17s GPU
Belo Horizonte (5,109 tracts, 407 stations): ~105s GPU
Rio de Janeiro (13,370 tracts, 1,392 stations): ~10 min GPU

Note that total wall-clock time for interpolate_election_br() also includes data downloads, travel time computation (via r5r), and weight matrix construction, which can dominate for larger cities.

18.3 Memory estimation

GPU memory during optimization depends on the computation path:

Monolithic path ( $k_a \times n \times m \leq 25\text{M}$ ): the full 3D tensor $[k_a \times n \times m]$ is held in memory, plus autograd intermediate tensors. Peak usage $\approx 6 \times k_a \times n \times m \times \text{bytes}$ .

Station-chunked path ( $k_a \times n \times m > 25\text{M}$ ): only one chunk $[k_a \times n \times m_{\text{chunk}}]$ is in memory at a time, reducing peak usage to $O(k_a \times n \times m_{\text{chunk}})$ . No autograd graph is built (analytical gradients).

Municipality	Tracts	Stations	$k_a$	Path	Estimated GPU memory
Varginha	279	37	14	Monolithic	~28 MB
Niteroi	1,169	135	14	Monolithic	~242 MB
Belo Horizonte	5,109	407	14	Chunked (2)	~400 MB
Rio de Janeiro	13,370	1,392	14	Chunked (11)	~1.5 GB

For very large cities, dtype = "float32" (the default) halves memory compared to "float64".

RStudio: torch inside RStudio IDE requires subprocess execution via callr (automatic and transparent to the user). Standalone R scripts run directly.

19. Setup: Torch, Java, and r5r

Torch (required for optimization):

setup_torch()    # installs torch + libtorch/lantern binaries
use_gpu(TRUE)    # enable GPU globally

GPU support: Windows (CUDA), macOS (MPS for Apple Silicon), Linux (CUDA). CPU always works as fallback.

Java + r5r (required for travel time computation):

setup_java()     # downloads and installs Java 21

Memory: options(java.parameters = "-Xmx8g") before loading r5r for large municipalities.

Cache management:

interpElections_cache()                             # list cached files
interpElections_cache("dir")                        # cache directory path
interpElections_cache("clean", category = "votes")  # clear by category

20. Mathematical Appendix

Notation

$N$ = number of census tracts, $M$ = number of polling stations
$K$ = number of calibration brackets (14 for full gender $\times$ age)
$k_a$ = number of active brackets (those with nonzero population and voters)
$t \in \mathbb{R}^{N \times M}$ : travel time matrix (minutes)
$\alpha \in \mathbb{R}^{N \times k_a}$ : per-tract-per-bracket decay parameters
$\theta \in \mathbb{R}^{N \times k_a}$ : unconstrained optimizer variables
$P \in \mathbb{R}^{N \times k_a}$ : census population by bracket
$S \in \mathbb{R}^{M \times k_a}$ : voter counts by bracket at stations
$c_{kj} = S_{jk}$ : voter count for bracket $k$ at station $j$
$\hat{V}_{ik}$ : predicted population for tract $i$ , bracket $k$
$W^{(k)}_{ij}$ : per-bracket column-normalized weight
$W_{\text{agg},ij}$ : voter-weighted aggregated weight
$H_i$ : Shannon entropy of tract $i$ ’s weight distribution
$h_i^*$ : per-tract entropy target
$\mu_B$ : barrier penalty strength (default 1)
$\mu_E$ : entropy penalty multiplier (dual variable in adaptive mode)
$\rho$ : quadratic penalty coefficient

IDW kernel

Both kernels share the form $\log K_{ij} = -\alpha_{ib} \cdot \phi(t_{ij})$ :

Kernel	$K_{ij}$	$\phi(t)$	Tail
Power (default)	$(t_{ij} + 1)^{-\alpha_{ib}}$	$\log(t_{ij} + 1)$	Heavy
Exponential	$\exp(-\alpha_{ib} \cdot t_{ij})$	$t_{ij}$	Light

Log-domain kernel

$\log K_{ij} = -\alpha_{ib} \cdot \phi(t_{ij})$

Column normalization

$W_{ij} = \frac{K_{ij}}{\sum_{i'} K_{i'j}}$

In log domain (for numerical stability):

$W_{ij} = \exp\!\left(\log K_{ij} - \text{logsumexp}_i(\log K_{ij})\right)$

Per-bracket column normalization

For each demographic bracket $k = 1, \ldots, k_a$ :

Compute kernel: $K^{(k)}_{ij} = \exp(-\alpha_{ik} \cdot \phi(t_{ij}))$
Column-normalize: $W^{(k)}_{ij} = K^{(k)}_{ij} / \sum_{i'} K^{(k)}_{i'j}$

Aggregated weight matrix

The voter-weighted aggregation across brackets:

$W_{\text{agg},ij} = \frac{\sum_{k=1}^{k_a} W^{(k)}_{ij} \cdot c_{kj}}{\sum_{k=1}^{k_a} c_{kj}}$

This is the weight matrix used for entropy computation and for interpolating non-demographic variables after optimization.

Interpolation

$\hat{V}_{ik} = \sum_{j=1}^{M} W^{(k)}_{ij} \cdot c_{kj}$

Softplus reparameterization

The decay parameters $\alpha$ must satisfy $\alpha_{ib} \geq \alpha_{\min}$ (default $\alpha_{\min} = 1$ for the power kernel). Rather than clamping, the optimizer works with unconstrained $\theta$ and maps:

$\alpha_{ib} = \alpha_{\min} + \text{softplus}(\theta_{ib}) = \alpha_{\min} + \log(1 + e^{\theta_{ib}})$

Gradient through softplus:

$\frac{\partial \alpha}{\partial \theta} = \sigma(\theta) = \frac{1}{1 + e^{-\theta}}$

Inverse (for initialization):

$\theta = \log\!\left(e^{\alpha - \alpha_{\min}} - 1\right)$

Poisson deviance (data loss)

$D(\alpha) = 2 \sum_{i=1}^{N} \sum_{k=1}^{k_a} \left[\hat{V}_{ik} - P_{ik}\log\hat{V}_{ik} + P_{ik}\log P_{ik} - P_{ik}\right]$

When $\hat{V}_{ik} = P_{ik}$ for all entries, $D = 0$ . The constant terms $P_{ik}\log P_{ik} - P_{ik}$ are precomputed once.

Gradient with respect to predicted values:

$\frac{\partial D}{\partial \hat{V}_{ik}} = 2\!\left(1 - \frac{P_{ik}}{\hat{V}_{ik}}\right)$

Safe-log convention: where $P_{ik} = 0$ , the term $P_{ik}\log\hat{V}_{ik}$ is treated as zero (the contribution vanishes regardless of $\hat{V}$ ).

Log-barrier penalty

Prevents any census tract from receiving zero total predicted voters:

$B(\alpha) = -\mu_B \sum_{i=1}^{N} \log\hat{V}_{i,\text{total}}, \quad \hat{V}_{i,\text{total}} = \sum_{k=1}^{k_a} \hat{V}_{ik}$

Gradient:

$\frac{\partial B}{\partial \hat{V}_{ik}} = -\frac{\mu_B}{\hat{V}_{i,\text{total}}}$

This signal is the same for all brackets $k$ within a tract, gently pushing all brackets away from zero.

Entropy penalty

Operates on the row-normalized aggregated weights:

$p_{ij} = \frac{W_{\text{agg},ij}}{\sum_{j'} W_{\text{agg},ij'}}, \quad H_i = -\sum_{j=1}^{M} p_{ij}\log p_{ij}$

$E(\alpha) = \mu_E \sum_{i=1}^{N} H_i$

The effective number of sources for tract $i$ is $\exp(H_i)$ . When $H_i = \log M$ (uniform weights), all stations contribute equally; when $H_i = 0$ , a single station dominates.

Quadratic penalty (augmented Lagrangian)

Active only in adaptive mode (target_eff_src set):

$Q(\alpha) = \frac{\rho}{2}\sum_{i=1}^{N}\left(H_i - h_i^*\right)^2$

The quadratic term drives each tract’s entropy toward its individual target. Together with the linear entropy penalty $E$ , this forms the augmented Lagrangian:

$\mathcal{L}_{\text{aug}} = D + B + \mu_E\sum_i H_i + \frac{\rho}{2}\sum_i(H_i - h_i^*)^2$

Per-tract adaptive entropy targets

A global entropy target is infeasible for tracts with sparse or uniform travel times. Per-tract targets account for local geography:

$h_i^* = \max\!\left( \log\min(k_i - 1,\; T),\; H_i(\alpha_{\text{ref}}),\; \log 2 \right)$

where:

$k_i$ = number of reachable stations for tract $i$
$T$ = user-specified target_eff_src
$H_i(\alpha_{\text{ref}})$ = Shannon entropy of the softmax weights at a reference decay ( $\alpha_{\text{ref}} = 7$ for power kernel, 50 for exponential), capturing the minimum achievable entropy given the actual travel time distribution
$\log 2$ = absolute floor (at least 2 effective sources)

The aggregate target used for the dual update is $\bar{h}^* = \frac{1}{N}\sum_i h_i^*$ .

Total loss

$L(\alpha) = D(\alpha) + B(\alpha) + E(\alpha) + Q(\alpha)$

In the default mode (no target_eff_src), $E = Q = 0$ and the optimizer minimizes $D + B$ .

Dual ascent update

The entropy multiplier $\mu_E$ is reparameterized as $\mu_E = \exp(\log\mu)$ , ensuring strict positivity. The dual variable is updated once per epoch:

$\log\mu \leftarrow \log\mu + d(t) \cdot \frac{\eta_d}{T_d} \cdot (\bar{H} - \bar{h}^*)$

where:

$d(t) = 1/\sqrt{1 + t/5000}$ is a gentle decay schedule
$\eta_d$ = dual step scaling (dual_eta, default 1.0)
$T_d = 500$ is a dampening constant
$\bar{H} = \frac{1}{N}\sum_i H_i$ = current mean entropy

ALM bound: $\log\mu \leq \log(3\rho)$ prevents overshoot (Bertsekas, 1999).

SGDR cosine annealing with warm restarts

The primal learning rate (non-adaptive mode) follows Loshchilov & Hutter (2017):

$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min}) \left(1 + \cos\!\left(\frac{T_{\text{cur}}}{T_i}\pi\right)\right)$

where $T_0 = 500$ , $T_{\text{mult}} = 2$ (cycle length doubles after each restart), $\eta_{\min} = 0.01 \cdot \eta_{\max}$ , and $\eta_{\max}$ decays by a factor of 0.7 per restart. In adaptive mode, the learning rate is held constant (dual updates make the loss landscape non-stationary).

Analytical gradient (station-chunked path)

For large problems ( $k_a \times N \times M > 25 \times 10^6$ ), the gradient is computed analytically instead of via autograd. The derivative of the total loss with respect to $\alpha_{ib}$ , through the softmax column normalization, is:

$\frac{\partial L}{\partial \alpha_{ib}} = \sum_{j=1}^{M} c_{kj}\, W^{(b)}_{ij}\, \phi(t_{ij}) \left(\overline{dL}_{bj} - dL_{ib}\right)$

where:

$dL_{ib} = \partial L / \partial\hat{V}_{ib}$ is the loss gradient with respect to predicted values (deviance + barrier contributions)
$\overline{dL}_{bj} = \sum_{i'} W^{(b)}_{i'j}\, dL_{i'b}$ is the weight-averaged gradient signal at station $j$
$\phi(t_{ij})$ is the kernel basis function ( $\log(t_{ij}+1)$ for power, $t_{ij}$ for exponential)

This formula arises from the Jacobian of the softmax (column normalization) and avoids building the full $k_a \times N \times M$ computational graph. The chain rule through softplus gives the final parameter gradient:

$\frac{\partial L}{\partial \theta_{ib}} = \frac{\partial L}{\partial \alpha_{ib}} \cdot \sigma(\theta_{ib})$

The entropy gradient on $W_{\text{agg}}$ is incorporated via a one-epoch-delayed signal: the gradient $\partial(E + Q)/\partial W_{\text{agg}}$ from the previous epoch is propagated through the same softmax Jacobian formula, adding the corresponding terms to $\partial L / \partial\alpha_{ib}$ .

Convergence criteria

Tier 1 — Gradient-based (true stationarity). Requires 10 consecutive epochs satisfying all of:

Relative gradient: $\|\nabla_\theta L\| / (1 + \|\theta\|) < 10^{-4}$
Effective sources within 5% of adjusted target (adaptive mode)
Dual variable $\mu_E$ stable over patience epochs
Total loss stable over patience epochs
Deviance not rising (mean of last 100 epochs $\leq 1.01 \times$ mean of epochs 400–500 ago)

Tier 2 — Window-based (diminishing returns). Triggered when the loss metric improves by less than convergence_tol over the last patience epochs, provided effective sources are within 20% of target. In adaptive mode the metric is Poisson deviance; in non-adaptive mode it is total loss (deviance + barrier).

Both tiers require a minimum of $\max(3 \times \text{patience}, 500)$ epochs before declaring convergence.

Part I: Motivation and Pipeline Overview

1. Why Sub-Municipal Electoral Geography Matters

Age structure as a proxy for socioeconomic status

2. The Problem: Many-to-Many Spatial Disaggregation

The approach: estimating the association structure

Walking to vote: a behavioral assumption

3. Pipeline Overview

Part II: Data

4. Census Population Data

5. Census Tract Geometries

6. Electoral Data at Polling Stations

7. Calibration Variables: Comparing Source and Target

Part III: Geography

8. Representative Points: From Tracts to Routable Origins

9. Travel Times: Building the Distance Matrix

9.1 Why travel time?

9.2 Euclidean distance vs. walking routes

9.3 Building the travel time matrix

Part IV: The Weight Matrix

10. The IDW Kernel: Turning Distance into Influence

Intuition: gravitational pull

A concrete example

The formula

How alpha controls the decay

Why per-tract alpha?

11. Why Single-Sided Standardization Fails

11.1 Column-only standardization

11.2 Row-only standardization

The choice: column normalization

12. Column Normalization: Source Conservation

12.1 The idea: preserving station totals

12.2 Worked example

12.3 Per-Bracket Column Normalization

The problem: age groups have different geographies

The per-bracket extension

Why this works

Connection to the optimizer

Part V: Calibration and Optimization

13. The Calibration Objective

13.1 The key insight

13.2 The iterative procedure

13.3 The data-fit term: Poisson deviance

13.4 Regularization: barrier, entropy, and quadratic penalties

13.5 The total loss

14. Optimization: Per-Bracket Gradient Descent with Column Normalization

14.1 Why automatic differentiation?

14.2 Log-domain computation for numerical stability

14.3 The column normalization algorithm

14.4 Softplus reparameterization

14.5 Adaptive entropy targets and dual ascent

Per-tract targets

Effective sources

Log-reparameterized dual ascent

14.6 ADAM optimizer and learning rate schedule

14.7 Convergence criteria

14.8 Station-chunked computation

Part VI: Results and Validation

15. The Effect of Alpha: Spatial Interpretation

16. The Interpolation: From Weights to Tract-Level Estimates

17. Validation: How Well Did We Recover the Age Structure?

17.1 Conservation properties

17.2 Age pyramid recovery: Rocinha and Copacabana revisited

17.3 Comparison with Voronoi assignment

17.4 Ecological correlation: Lula vote share and household income

Part VII: Practical

18. Performance and Tuning

18.1 Automatic station chunking

18.2 Practical runtimes

18.3 Memory estimation

19. Setup: Torch, Java, and r5r

20. Mathematical Appendix

Notation

IDW kernel

Log-domain kernel

Column normalization

Per-bracket column normalization

Aggregated weight matrix

Interpolation

Softplus reparameterization