← Back to projects
Project 11/03/2026

Tracking event schedules across 500+ Instagram accounts. I built a headless, self-hosted pipeline.

Manual data entry burns cycles. The end goal of this architecture is an aggregated event tracker website. This site alerts users exactly when specific ustadz and masjids hold events. However, my 500+ target accounts publish exclusively on Instagram. The critical scheduling data is locked inside heavily compressed JPEGs and ephemeral 24-hour stories. I refuse to click through timelines manually. It is a hostile environment for structured data extraction. I built a self-hosted, headless pipeline. It rips data from the backend APIs, extracts embedded text using machine vision, and normalizes it into a PostgreSQL database.

System Architecture#

The stack is modular and heavily biased toward self-hosting to minimize dependencies on external SaaS providers. I control the hardware and the data flow.

  • Core Brain: n8n. This is the integration layer. It executes API calls, manages webhooks, and orchestrates the sub-workflows. I run it self-hosted to avoid execution limits.
  • Ingestion Vector: RapidAPI Instagram endpoints. Writing a Selenium scraper is a losing battle against DOM obfuscation. This API provides direct access to the backend JSON responses.
  • Machine Vision Model: PaddleOCR. Deployed in a customized Docker container on a Hetzner VPS. I provisioned this specifically for local inference. It bypasses third-party API costs.
  • Storage Layer: Hetzner S3 Object Storage handles the raw media binaries. Supabase (PostgreSQL) stores the structured metadata, state flags, and OCR string outputs.
  • Content Delivery: Bunny CDN connected directly to the Hetzner S3 bucket. It handles edge-cached image serving on the frontend review interface.

The Three-Pillar Workflow Logic#

A single monolithic script is brittle. The integration logic is decoupled into three primary, isolated n8n workflows. If one phase fails, it does not cascade and drop the entire execution queue.

1. The Feed Scraper#

This sequence targets permanent grid data. The RapidAPI endpoint imposes strict limits. The maximum allowed rate is 1 request per second. Violating this triggers HTTP 429 timeouts and soft bans. The daily limit hits a hard cap at 3,000 requests. Across our 500 targets, we have not hit the 3,000-request ceiling, but burst frequency is a threat. I injected a hard 3-second wait node between scraping iterations. This throttles the execution loop and maintains a clean traffic signature.

The feed cron schedule fires at 09:00 UTC+7 every 48 hours. The extraction sequence queries PostgreSQL for active target UIDs. An iterator cycles through them sequentially. I explicitly avoided batch or parallel processing here. Uploading binary payloads—whether highly compressed JPEGs or MP4s—to the Hetzner S3 bucket represents our primary point of failure. Sequential execution isolates upload timeouts. If an S3 write stream breaks, it corrupts a single record rather than dropping an entire parallel batch array.

A custom Switch node parses the media_type string. Standard images download directly. Carousels trigger a nested array loop. Video binaries scrape the thumbnail. All binary files pipeline directly to Hetzner S3.

2. The Story Scraper#

Stories represent volatile data. They self-destruct after 24 hours. A 48-hour polling cycle guarantees data loss. The Story Scraper workflow polls the target UIDs via cron at 21:00 UTC+7 daily. This timing is deterministic. 9 PM marks the end of the standard posting cycle. Most accounts have concluded their outbound event updates for the day.

The sequence checks the target's activity index for live story objects. If an object validates true, it pulls the specific media URLs. It downloads the binary asset directly to Hetzner S3. It bypasses the standard feed logic. Once the file is secure on Hetzner hardware, the original source asset's expiration is structurally irrelevant. We own the copy. The payload immediately triggers the OCR extraction sequence.

3. The AI Classifier#

The entire pipeline exists to convert pixels into searchable strings. Processing 500 target accounts generates massive image volume. Defaulting to cloud-based LLM vision models introduces unacceptable latency spikes and financial overhead per request. Scanning hundreds of event flyers requires localized execution.

I deployed a Docker container running PaddleOCR on the Hetzner VPS. PaddleOCR excels at structured text extraction in high-noise environments. Event flyers have terrible contrast, complex typography, and heavy artifacting from compression. It maps bounding boxes over the image matrix. It outputs raw, unformatted string data. It operates extremely quickly at a fraction of a cent per megabyte processed.

The AI Classifier acts as the data structurer. It executes on demand. It queries Supabase for historical records lacking classification values. It isolates the S3 path for each record. It pulls the image directly from the Hetzner bucket. This bypasses the Instagram API entirely to eliminate ban risk. It feeds the cached image through PaddleOCR. It parses specific required data points from the raw OCR blocks. It executes an upsert command on PostgreSQL. Historical data is salvaged without external API dependency.

Error Handling and State Management#

Networks drop. External APIs timeout mid-execution. A sequential workflow can fail halfway down a 500-user UID list. State management handles recovery at the atomic level.

I built explicit error handling nets into the sub-workflows. At every discrete stage—API fetching, S3 uploading, OCR inference—the executing node logs its terminal status. The Supabase schema maintains a strict status column for every media ID generated. The acceptable values are finished, failed, or error.

Currently, there is zero automatic resume execution built into the pipeline logic. System recovery requires manual intervention. I filter the SQL table for the failed or error flags. I manually re-trigger the n8n ingestion payload against those specific UIDs. This maintains absolute control over the execution state. It also prevents infinite retry loops from draining our RapidAPI monthly credits in the background.

Validation and The Review UI#

System trust must be verified. PaddleOCR is robust, but scanning 500 unpredictable sources guarantees edge cases. Bad lighting or obscure fonts produce false positives. Target schemas are routed to Supabase with a default state value to prevent unchecked data from hitting the public frontend.

// Normalization Schema
{
  "username": "masjid_pusat_01",
  "post_pk": "31415926535",
  "caption": "Jadwal kajian ahad ini...",
  "ocr_text": "TABLIGH AKBAR 15:30 WIB PEMATERI USTADZ FULAN",
  "media_type": "carousel",
  "s3_path": "hetzner-core/31415926535.jpg",
  "status": "pending_approval"
}

I deployed a custom AI-generated photo viewer dashboard to act as the gatekeeper. The UI hooks directly into the Supabase backend. The architectural layout splits visually down the center.

The left pane renders the target source image. This image stream routes through Bunny CDN, edge-caching our Hetzner S3 bucket to ensure ultra-low latency rendering for the reviewer. The right pane displays the extracted PostgreSQL metadata and PaddleOCR string outputs.

I review the image adjacent to the OCR output. I verify the scheduled dates and ustadz spelling. I trigger an "Approve" HTTP request. The status flag updates to active, pushing the structured data out to the public event tracking site.

Architectural Limits and Scaling#

The current infrastructure operates flawlessly for 500 target accounts. Projecting scale to 5,000 accounts compromises the foundational logic.

Sequential processing forms an immediate bottleneck. Executing consecutive 3-second wait nodes across 5,000 targets guarantees the execution cycle bleeds outside the daily cron windows. I would need to strip out the sequential iterations. The replacement requires parallel sub-workflow processing or heavy chunked batching.

The RapidAPI 3,000 daily limit would instantly cap out. The ingestion vector requires a complete overhaul. We must migrate to enterprise API tiers or heavily rotated residential proxy networks.

The localized PaddleOCR instance on the Hetzner VPS would bottleneck compute resources. A single container cannot handle concurrent inferences effectively at that scale. I would provision a Docker Swarm cluster to distribute the image processing workload across multiple inference workers to maintain pipeline throughput.

For now, the system replaces hours of manual scrolling with absolute structural consistency. The infrastructure handles the volatile APIs, the strict pagination logic, the media encoding, and the machine vision. I simply verify the text strings. The pipeline executes.