Spatial Smoothing with the Network Lasso ======================================== This guide walks through a small-area-estimation problem: we have noisy per-county survey estimates of (log) median household income across the state of Georgia and we want a smoothed map that borrows strength between neighboring counties. The natural tool is gamdist's :doc:`categorical feature <../api>` with **network lasso** regularization on the county-adjacency graph, which adds an L1 penalty to coefficient differences along graph edges and therefore *fuses* neighbors to identical values rather than merely shrinking them toward each other. By the end you will have produced a three-panel choropleth contrasting the raw observations, the network-lasso fit, and the (simulated) ground truth, and you will have seen how to obtain real county geometries with a couple of lines of :mod:`geopandas` --- no hand-curated SVG required. Extra dependencies ------------------ The example uses two libraries that are *not* runtime dependencies of gamdist itself: - `geopandas `_ for reading the county shapefile and rendering the choropleth. - `libpysal `_ for the bundled Georgia counties dataset and for building the Queen-contiguity adjacency graph. Install them into your environment alongside gamdist: .. code-block:: bash pip install geopandas libpysal The same workflow generalizes to any other shapefile or GeoJSON of geographic units --- US counties from the Census TIGER/Line files, ZIP-code tabulation areas, school districts, electoral precincts, and so on. We use Georgia's 159 counties because the geometries are bundled with libpysal and the example runs in a few seconds. Loading the geometry -------------------- `libpysal.examples` ships a small library of spatial datasets that load without an internet round-trip. The ``georgia`` example contains a shapefile of Georgia's 159 counties together with 1990 socio-economic attributes. We need only the geometries: .. code-block:: python import geopandas as gpd from libpysal.examples import load_example ex = load_example('georgia') counties = gpd.read_file(ex.get_path('G_utm.shp')) counties['county_id'] = counties['AreaKey'].astype(str) print(counties.shape) # (159, 18) print(sorted(counties.columns)[:6]) # ['AREA', 'AreaKey', 'G_UTM_', 'G_UTM_ID', 'Latitude', 'Longitud'] A :class:`geopandas.GeoDataFrame` is a regular :class:`pandas.DataFrame` with one extra column --- ``geometry`` --- holding a Shapely polygon per row. ``AreaKey`` is the FIPS code; we promote it to a string so it can serve as our categorical level. Building the adjacency graph ---------------------------- Network lasso needs an *edge list*: a DataFrame with three columns ``node1``, ``node2``, and ``weight`` describing which categories should be encouraged to share a coefficient. For spatial data the natural choice is **Queen contiguity**: two polygons are neighbors if they share any boundary point (vertex or edge). :mod:`libpysal.weights` has this built in: .. code-block:: python import pandas as pd from libpysal.weights import Queen w = Queen.from_dataframe(counties, use_index=False) pairs = [] for i, neighbors in w.neighbors.items(): for j in neighbors: if i < j: # each undirected edge once pairs.append((counties.iloc[i]['county_id'], counties.iloc[j]['county_id'])) edges = pd.DataFrame(pairs, columns=['node1', 'node2']) edges['weight'] = 1.0 print(len(edges)) # 431 Each row of ``edges`` says "these two counties are neighbors and should contribute a penalty term ``weight * |β_node1 - β_node2|`` to the fit". A larger ``weight`` makes the corresponding pair more strongly fused; uniform 1.0 weights are the typical starting point. Use ``Rook.from_dataframe`` instead if you want only edge-sharing neighbors (i.e., excluding pairs that meet at a single corner). Simulating the response ----------------------- To demonstrate the smoother we'll synthesize a smooth-along-the-graph *truth* and corrupt it with sampling noise. The true log-income declines radially from a peak near metro Atlanta: .. code-block:: python import numpy as np rng = np.random.default_rng(0) atlanta_lat, atlanta_lon = 33.75, -84.39 dist_to_atl = np.sqrt( (counties['Latitude'] - atlanta_lat) ** 2 + (counties['Longitud'] - atlanta_lon) ** 2 ) counties['true_log_income'] = 10.8 - 0.25 * dist_to_atl We then draw five noisy observations per county, as if from a survey with five respondents in each: .. code-block:: python obs_rows = [] for _, row in counties.iterrows(): for _ in range(5): obs_rows.append({ 'county_id': row['county_id'], 'log_income': row['true_log_income'] + rng.normal(0, 0.30), }) obs = pd.DataFrame(obs_rows) print(len(obs)) # 795 The unsmoothed estimate for each county is just the per-county sample mean: .. code-block:: python counties['observed_mean'] = ( counties['county_id'] .map(obs.groupby('county_id')['log_income'].mean()) ) Fitting the GAM --------------- The model has a single feature --- ``county_id``, treated as a categorical with network-lasso regularization. The ``regularization`` dict pairs the coefficient :math:`\lambda` with the edge list. We start from :math:`\lambda = 0.25`, which a moment of GCV-driven tuning (next section) will confirm is a good default for this dataset: .. code-block:: python from gamdist import GAM mdl = GAM(family='normal') mdl.add_feature( name='county_id', type='categorical', regularization={'network_lasso': {'coef': 0.25, 'edges': edges}}, ) mdl.fit(obs[['county_id']], obs['log_income'].to_numpy()) mdl.summary() # Model Statistics # ---------------- # phi: 0.0890 # edof: 51 # Deviance: 66.2 # AIC: 848 # AICc: 855 # BIC: 1091 # R^2: 0.408 # GCV: 0.0951 The fitted county effects --- one per category --- are obtained by calling :meth:`~gamdist.GAM.predict` with a one-row-per-county DataFrame: .. code-block:: python counties['fitted_log_income'] = mdl.predict(counties[['county_id']]) The reported effective degrees of freedom (51) is much smaller than the 159 categories: the network lasso has fused most neighboring counties to identical coefficients, and ``edof`` reflects the actual number of distinct fitted values rather than the unconstrained category count. That correctly-shrunk ``edof`` is what allows the AIC, BIC, and GCV numbers above to be compared meaningfully across :math:`\lambda` values. Drawing the choropleth ---------------------- :meth:`geopandas.GeoDataFrame.plot` colors each polygon by the value in a chosen column. We render the raw observations, the network-lasso fit, and the underlying truth side by side, sharing a color scale so the panels are directly comparable: .. code-block:: python import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 3, figsize=(15, 5), constrained_layout=True) cols = ['observed_mean', 'fitted_log_income', 'true_log_income'] titles = ['Observed (noisy survey mean)', 'Network-lasso fit', 'Truth'] vmin = counties[cols].min().min() vmax = counties[cols].max().max() for ax, col, title in zip(axes, cols, titles): counties.plot(column=col, cmap='viridis', vmin=vmin, vmax=vmax, edgecolor='white', linewidth=0.2, ax=ax) ax.set_axis_off() ax.set_title(title) fig.suptitle('Simulated log median household income, Georgia counties') plt.show() .. figure:: ../_static/network_lasso_choropleth.png :alt: Three-panel choropleth of Georgia counties: noisy survey means on the left, the network-lasso fit in the middle, and the smooth truth on the right. :align: center Network-lasso smoothing of a simulated per-county survey of log median household income at the GCV-optimal :math:`\lambda = 0.25`. The middle panel collapses 159 counties to 50 distinct fitted values; most internal boundaries disappear as neighbors fuse to identical coefficients. The left panel is mottled --- noise dominates the per-county means. The middle panel shows the smoothed estimate: most county boundaries have *vanished* because the network lasso has fused neighboring counties to identical coefficients, leaving a small number of contiguous regions of constant value. The right panel is the smooth truth that the simulation drew from. Visually, the middle panel is much closer to the right than the left. Choosing the smoothing strength ------------------------------- The single tuning knob is :math:`\lambda` (passed as ``coef``). At :math:`\lambda = 0` the categorical feature is unregularized and each county gets its own free coefficient (= the per-county sample mean, modulo the implicit zero-sum centering). As :math:`\lambda \to \infty` all coefficients along any connected component of the graph collapse to the same value. The principled way to choose :math:`\lambda` is to minimize **generalized cross-validation** (GCV), an efficient leave-one-out CV approximation reported by :meth:`~gamdist.GAM.summary` on every fit: .. code-block:: python for lam in [0.001, 0.05, 0.1, 0.25, 0.5, 1.0]: mdl_l = GAM(family='normal') mdl_l.add_feature( name='county_id', type='categorical', regularization={'network_lasso': {'coef': lam, 'edges': edges}}, ) mdl_l.fit(obs[['county_id']], obs['log_income'].to_numpy()) fit = mdl_l.predict(counties[['county_id']]) n_unique = np.unique(np.round(fit, 4)).size print(f'lambda={lam:>6}: edof={mdl_l.dof():>5.1f}' f' GCV={mdl_l.gcv():.4f} unique levels={n_unique:3d}') # lambda= 0.001: edof=159.0 GCV=0.1146 unique levels=157 # lambda= 0.05: edof=124.0 GCV=0.1044 unique levels=123 # lambda= 0.1: edof=106.0 GCV=0.1020 unique levels=105 # lambda= 0.25: edof= 51.0 GCV=0.0951 unique levels= 50 # lambda= 0.5: edof= 34.0 GCV=0.0981 unique levels= 34 # lambda= 1.0: edof= 11.0 GCV=0.1009 unique levels= 11 GCV bottoms out near :math:`\lambda = 0.25`, where the network lasso has fused Georgia's 159 counties into 50 contiguous regions of constant value. A negligible :math:`\lambda` leaves nearly every county with its own value (157 distinct levels out of 159) and the fit barely improves on the raw per-county sample means. Push :math:`\lambda` to 1.0 and the state is carved into 11 large regions: a clearer summary, but now over-smoothed where the truth is genuinely heterogeneous. The ``edof`` column makes this concrete: it is the number of *independent* coefficients the fit is actually using, after fusion. For ``network_lasso`` :meth:`~gamdist.GAM.dof` counts distinct fitted values rather than the unconstrained category count, so :math:`\lambda = 0.25` and :math:`\lambda = 0.5` both report ``edof`` that is far below 159 even though every county still has a (fused) prediction. That is what makes GCV --- which weights the deviance by :math:`(1 - \mathrm{edof}/n)^{-2}` --- a usable model-selection criterion here. For a tighter optimum, sweep more :math:`\lambda` values around the minimum and pick the one with the smallest GCV. Holding out part of the data and choosing :math:`\lambda` by validation deviance gives the same answer up to noise, and is the standard fallback when you mistrust the GCV approximation (e.g., for very small samples or non-Gaussian families with a fixed dispersion). When to reach for ``network_lasso`` ----------------------------------- Network lasso is the right tool when: - You have a categorical feature with a known graph structure (spatial adjacency, a hierarchy, a friend graph, related-products graph) and you believe many connected categories share a coefficient. - You want a *piecewise-constant* answer --- a small number of homogeneous clusters --- rather than a smoothly varying surface. If you instead want smooth shrinkage along the graph (a quadratic penalty :math:`\lambda \sum_{(i,j) \in E} w_{ij} (\beta_i - \beta_j)^2`, which never collapses neighbors to identical values), swap ``network_lasso`` for ``network_ridge`` on the same edges DataFrame --- the rest of the API is unchanged.