Every city in the United States publishes building permit requirements on its website. Zoning setbacks, required inspections, permit fees, contractor license requirements - it is all there, updated by planning departments, usually freely accessible without authentication. The problem is that manually reading 19,000+ city websites to extract this data is not a human-scale task. The solution is a pipeline that combines URL pattern discovery, targeted HTML crawling through a proxy, PDF extraction via PDF.js, and LLM-based structured extraction - all running in a browser-based tool with per-city status tracking.

This guide covers every layer of that pipeline with the exact code patterns and decision logic that make it work reliably at scale. "Scale" here means thousands of cities, not dozens - the techniques that work for 50 cities will break at 5,000 unless you account for timeout handling, batch pacing, failure logging, and data freshness.

Finding City Government URLs

The first challenge is that US municipalities do not follow a uniform URL convention. A small Texas city might be at haltomcitytx.gov while a neighboring one is at ci.grand-prairie.tx.us. You cannot look them all up by hand for 19,000 cities. Instead, try a ranked list of URL patterns using HEAD requests with a 3-second timeout, and take the first one that resolves.

The pattern priority order, from most to least common:

  1. {city}{state}.gov (e.g., austintx.gov)
  2. www.{city}{state}.gov
  3. ci.{city}.{state}.us (e.g., ci.austin.tx.us)
  4. cityof{city}.com
  5. {city}{state}.com
  6. {city}.org
  7. www.{city}.org
  8. {city}city.com
  9. city{city}.com
  10. {city}gov.com

When building the city slug, normalize to lowercase and strip spaces, hyphens, and apostrophes: "St. Paul" becomes "stpaul", "Winston-Salem" becomes "winstonsalem". State abbreviations go lowercase: "tx", "ca", "ny".

Pattern Example Success Rate (est.) Notes
{city}{state}.govaustintx.gov~38%Best for large cities
www.{city}{state}.govwww.austintx.gov~12%Redirect from pattern 1 often
ci.{city}.{state}.usci.fresno.ca.us~10%Common in CA, TX, WA
cityof{city}.comcityofgarland.com~8%Smaller TX/OK cities
{city}{state}.commesquitenm.com~6%Rural municipalities
{city}.orgscottsdale.org~5%Medium cities
www.{city}.orgwww.chandler.org~4%Same sites, www prefix
{city}city.comgardencityny.com~3%Disambiguation from state
city{city}.comcitytucson.com~2%Less common
{city}az.gov (state-specific)tempaz.gov~3%AZ pattern is common
{city}nc.govraleighnc.gov~3%NC is consistent
{city}tx.govhoustontx.gov~4%TX cities are uniform
{city}ca.govlakedca.gov~3%CA small cities
{city}fl.govorlandofl.gov~3%FL cities
{city}mo.govkcmo.gov~1%Abbreviation-style slugs
muni.{city}.{state}.usmuni.anchorage.ak.us~1%Borough/municipality
townof{city}.comtownofhempstead.com~2%Incorporated towns NE
villageof{city}.comvillageofelmwood.com~1%IL/OH villages
{city}township.orgcranberrytownship.org~1%PA townships
{county}county.govcookcounty.gov~1%Unincorporated areas

If none of the patterns resolve within the timeout, mark the city as url_not_found and move on. Do not block the queue on a single city. Log the city name and state so you can manually investigate the top failures later - many will be tiny unincorporated communities with no web presence.

Crawling for Permit-Related Pages

Once you have a valid homepage URL, fetch it and extract all anchor tags. You are not indexing the entire site - you are looking for the 5-10 pages most likely to contain permit and municipal services data.

Filter links by matching against two signals - the anchor text and the URL path - using a keyword list:

const PERMIT_KEYWORDS = [
  'zoning', 'ordinance', 'permit', 'building', 'construction',
  'trash', 'garbage', 'recycling', 'waste', 'bulk pickup',
  'tax', 'assessor', 'property tax', 'utilities', 'water',
  'planning', 'development', 'inspection', 'contractor',
  'code enforcement', 'variance', 'setback'
];

Normalize both the anchor text and the URL path to lowercase before matching. A link that says "Building & Safety" or goes to /departments/community-development/permits both qualify. Collect up to 50 links per city - more than that and you are wasting LLM calls on navigation pages and press releases.

Always resolve relative URLs against the homepage before storing them. A link like /services/permits needs to become https://austintx.gov/services/permits. Use the URL constructor: new URL(href, homepageUrl).href - this handles all edge cases including protocol-relative URLs.

Do not follow external links. Check that the resolved URL's hostname matches (or is a subdomain of) the original city hostname. Some cities redirect their permit portal to a third-party SaaS product like Accela or Tyler Technologies - those are worth following one level deep since they often contain fee schedules.

Respect robots.txt. Fetch it once per domain, cache it in memory for the session, and check it before fetching each page. The robots-parser npm package handles this correctly if you are running server-side. For a browser-based tool, a simple pattern check against Disallow: lines is usually sufficient since municipal robots.txt files are rarely complex.

Fetching HTML Through a CORS Proxy

Browser-based tools cannot fetch arbitrary third-party URLs due to CORS restrictions. Government websites rarely set permissive CORS headers - they were not designed with third-party access in mind. You need a proxy.

The pattern for the Homeowner.wiki local proxy (or any compatible CORS proxy):

async function fetchViaProxy(targetUrl) {
  const proxyBase = 'http://localhost:3456/proxy';
  const url = `${proxyBase}?url=${encodeURIComponent(targetUrl)}`;

  const res = await fetch(url, {
    headers: {
      'User-Agent': 'HomeownerWiki/1.0 (+https://homeowner.wiki/bot.html)'
    }
  });

  if (!res.ok) {
    throw new Error(`Proxy fetch failed: ${res.status} for ${targetUrl}`);
  }

  // Check content type before assuming HTML
  const contentType = res.headers.get('content-type') || '';
  if (contentType.includes('application/pdf')) {
    return { type: 'pdf', buffer: await res.arrayBuffer() };
  }

  return { type: 'html', text: await res.text() };
}

Always include a descriptive User-Agent header with a contact URL. Small city IT staff will occasionally see bot traffic in their logs. A transparent User-Agent string with a way to contact you is the difference between getting blocked and getting a curious email.

Add a 2-second delay between page fetches within the same city. Municipal servers are often shared hosting with minimal capacity. Hammering a small town's web server to extract its fence permit page is a bad look and risks getting your IP blacklisted. The delay costs almost nothing at scale relative to LLM processing time.

Server-side tools (Node.js, Python) can skip the proxy entirely. Use node-fetch or Python's httpx with the same User-Agent header and delay logic. Add a 10-second request timeout - some municipal servers respond extremely slowly.

Handling PDFs

A significant minority of cities - particularly smaller ones and those that have not updated their web presence since the early 2010s - publish their building codes and permit fee schedules as PDF documents. You will encounter these when the proxy returns a Content-Type: application/pdf response instead of HTML.

For client-side extraction, PDF.js (the same library Firefox uses internally) works well and is available from a CDN:

import * as pdfjsLib from 'https://cdn.jsdelivr.net/npm/pdfjs-dist@4.0.379/build/pdf.min.mjs';
pdfjsLib.GlobalWorkerOptions.workerSrc =
  'https://cdn.jsdelivr.net/npm/pdfjs-dist@4.0.379/build/pdf.worker.min.mjs';

async function extractPdfText(arrayBuffer) {
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
  let text = '';

  // Most permit info is in the first 8 pages
  const pagesToRead = Math.min(pdf.numPages, 8);

  for (let i = 1; i <= pagesToRead; i++) {
    const page = await pdf.getPage(i);
    const content = await page.getTextContent();
    text += content.items.map(item => item.str).join(' ') + '\n\n';

    if (text.length > 15000) break; // Truncate to keep LLM context manageable
  }

  return text.trim();
}

Truncate to 15,000 characters. This is enough for any permit fee schedule or zoning ordinance summary. Sending an entire 200-page building code to an LLM is expensive and the model will not extract better data from it - permit fees and setback requirements are always in the first chapter or in a summary table near the front.

When you extract text from a PDF, note in your stored data that the source was a PDF rather than an HTML page. PDF text extraction can produce odd spacing and merged words, so LLM extraction prompts should account for this by instructing the model to look for numeric patterns (fees) and dimensional patterns (setback measurements in feet) even if the surrounding text is garbled.

LLM Extraction Prompts That Actually Work

The quality of structured extraction depends almost entirely on the prompt. Three principles matter most: (1) tell the model to return JSON only with no surrounding text, (2) specify exactly what to return for missing values - null, not "N/A" or "unknown" or "not mentioned", and (3) use temperature 0.1 to minimize hallucination on factual extraction tasks.

Fence permit extraction prompt:

You are a data extraction assistant. Extract fence permit information from the text below.
Return ONLY valid JSON with no explanation, no markdown code blocks, no surrounding text.
If a value is not found in the text, return null for that field.
Do not guess or infer - only extract explicitly stated information.

{
  "permit_required": boolean or null,
  "permit_cost": number or null,
  "cost_notes": string or null,
  "max_height_front_yard_ft": number or null,
  "max_height_backyard_ft": number or null,
  "setback_from_property_line_ft": number or null,
  "materials_allowed": string or null,
  "processing_days": number or null,
  "contractor_license_required": boolean or null,
  "source_url": "{SOURCE_URL}"
}

TEXT:
{HTML_TEXT}

Zoning rules extraction prompt (similar structure but different fields):

{
  "residential_zone": string or null,
  "min_lot_size_sqft": number or null,
  "max_building_height_ft": number or null,
  "front_setback_ft": number or null,
  "rear_setback_ft": number or null,
  "side_setback_ft": number or null,
  "max_lot_coverage_pct": number or null,
  "adu_allowed": boolean or null,
  "adu_notes": string or null
}

Trash and recycling schedule prompt:

{
  "trash_pickup_days": array of strings or null,
  "recycling_pickup_days": array of strings or null,
  "bulk_pickup_frequency": string or null,
  "bulk_pickup_schedule_notes": string or null,
  "hazardous_waste_dropoff": string or null,
  "holiday_delays": boolean or null
}

Always strip HTML tags before sending to the LLM. The model does not need navigation markup, script tags, or boilerplate - it makes the context window longer and the extraction worse. A simple regex-based strip is sufficient: html.replace(/<[^>]+>/g, ' ').replace(/\s+/g, ' ').trim(). Truncate to 8,000 characters for HTML sources and 15,000 for PDF text.

Validating and Normalizing the Output

Even at temperature 0.1, LLMs occasionally return malformed JSON - a trailing comma, a missing closing brace, or sometimes a brief apology sentence before the JSON block. Your extraction pipeline must handle these gracefully without crashing.

function parseExtractionResult(rawOutput) {
  // Try direct parse first
  try {
    return { success: true, data: JSON.parse(rawOutput) };
  } catch (e) {
    // Try to extract JSON block from response
    const match = rawOutput.match(/\{[\s\S]*\}/);
    if (match) {
      try {
        return { success: true, data: JSON.parse(match[0]) };
      } catch (e2) {
        // Fall through to failure
      }
    }
    return { success: false, raw: rawOutput, error: e.message };
  }
}

function validatePermitData(data) {
  const errors = [];

  if (data.permit_required !== null && typeof data.permit_required !== 'boolean') {
    errors.push('permit_required must be boolean or null');
    data.permit_required = null; // Sanitize
  }

  if (data.permit_cost !== null && typeof data.permit_cost !== 'number') {
    // Try to coerce "75.00" to 75
    const coerced = parseFloat(data.permit_cost);
    data.permit_cost = isNaN(coerced) ? null : coerced;
  }

  if (data.processing_days !== null && typeof data.processing_days !== 'number') {
    const coerced = parseInt(data.processing_days);
    data.processing_days = isNaN(coerced) ? null : coerced;
  }

  return { data, errors };
}

Store the raw LLM output string alongside the parsed and validated object. This is critical. When you improve your extraction prompt three months from now, you can re-run the extraction against the already-stored raw output without re-scraping any city websites. Re-scraping is slow and puts load on city servers. Re-running extraction against cached raw output is fast and free (or near-free).

Storing City Data in IndexedDB

IndexedDB is the right storage layer for this data. You are dealing with potentially tens of thousands of city records, each with multiple data types. localStorage's 5MB limit would be hit within the first few hundred cities.

Key cities by a canonical state/city string in lowercase with hyphens replacing spaces: "tx/austin", "ca/los-angeles", "ny/new-york". This makes lookups fast and predictable.

const CITY_SCHEMA = {
  key: 'tx/austin',         // state/city slug
  source_url: string,       // resolved homepage URL
  scraped_at: ISO8601,      // when the HTML was fetched
  extraction_model: string, // e.g. "claude-3-5-haiku"
  raw_html_length: number,  // character count before truncation
  permit_pages: [           // array of crawled sub-pages
    {
      url: string,
      content_type: 'html' | 'pdf',
      raw_text_length: number,
      llm_raw_output: string,    // store the raw string
      extracted: {               // parsed and validated object
        fence_permit: {...},
        zoning: {...},
        trash: {...}
      },
      extracted_at: ISO8601
    }
  ],
  status: 'complete' | 'partial' | 'url_not_found' | 'error',
  error_message: string | null
};

Include a freshness check before scraping: if scraped_at exists and is less than 90 days ago, skip the city and use the cached data. Permit requirements do not change frequently - annual budget cycles might update fees, and major zoning changes get news coverage. A 90-day TTL is conservative enough to stay reasonably current without constant re-scraping.

Scaling to Thousands of Cities

Processing 19,000 cities sequentially at 15 seconds per city (scrape + LLM) takes 79 hours. In practice you want to process in parallel batches of 10, with a 2-second delay between batch starts, which gets you to roughly 8 hours for a full run. But the architecture matters more than the parallelism factor.

async function processCityBatch(cities, batchSize = 10) {
  let stopRequested = false;
  // Hook this to a visible Stop button in your UI
  window.stopProcessing = () => { stopRequested = true; };

  for (let i = 0; i < cities.length; i += batchSize) {
    if (stopRequested) {
      logEvent(`Stopped at city ${i} of ${cities.length}. Progress saved.`);
      break;
    }

    const batch = cities.slice(i, i + batchSize);
    await Promise.allSettled(batch.map(city => processSingleCity(city)));

    // Update progress indicator
    updateProgressUI(i + batchSize, cities.length);

    // Delay between batches
    await new Promise(r => setTimeout(r, 2000));
  }
}

The stopRequested flag checked at each batch boundary is essential. Users need to be able to pause a multi-hour job, close their laptop, and resume later. Since you are writing each city result to IndexedDB immediately on completion, all completed work survives the pause. The next run skips cities with scraped_at timestamps within the freshness window.

Log every failure with a structured error type so you can retry selectively:

  • url_not_found - none of the URL patterns resolved
  • fetch_error - proxy returned non-200 or timed out
  • parse_error - HTML could not be parsed for links
  • llm_error - LLM API returned an error
  • llm_json_error - LLM returned unparseable JSON
  • cloudflare_block - 403 with Cloudflare challenge page

Export failed cities as CSV with their error type. url_not_found failures can be reviewed manually in bulk - often 20-30 minutes of manual URL research resolves the majority. cloudflare_block failures can be retried with a longer delay or processed manually. llm_json_error failures are worth re-running after improving the prompt.

Content published on US government websites is generally public domain under 17 U.S.C. ยง 105, which excludes federal government works from copyright. State and local government works occupy a grayer area - most states follow a similar doctrine, but a handful have asserted copyright over government publications. In practice, no city has sued a data aggregator for displaying permit requirements. The data is meant to be public.

That said, responsible scraping means:

  • Respect robots.txt - if the city disallows crawling, honor it and flag the city for manual review
  • Identify yourself - the User-Agent string should include your project name and a contact URL
  • Cache aggressively - re-scraping a city every day when permit requirements change maybe twice a year is pointless load on their servers
  • Honor rate limits - the 2-second inter-request delay protects small municipal servers running on shared hosting
  • Watch for Cloudflare - if you get consistent 403 responses with a Cloudflare challenge, add a longer delay (10-30 seconds) or skip that city and add it to a manual review queue

A city clerk who notices bot traffic and finds your User-Agent string helpful ("HomeownerWiki - helping homeowners find permit info") is far less likely to block you than one who sees a generic Python requests user agent hitting the site 100 times a minute.

Homeowner.wiki's Municipal Scraper handles URL discovery, proxy requests, PDF extraction, and LLM parsing in one workflow - with per-city status badges and editable prompt templates. You can inspect the raw LLM output for any city, re-run extraction with an updated prompt, and export your data at any point.

For more context on how this data powers local SEO content, see how to build a local SEO site using government data and the complete guide to building fence permit guides for all 50 states.

Access the Full Municipal Scraping Workflow

Join the waitlist to access Homeowner.wiki's complete scraping pipeline - URL discovery, proxy fetching, PDF extraction, and LLM parsing in one browser-based tool.

Join the Waitlist