← Back to projects
Project 11/30/2025

DailyOCR Pipeline: Automating Document Processing from Dropbox to the Cloud

How I built a fully automated, fault-tolerant OCR pipeline in Go at Telekomindo — turning hundreds of scanned PDFs into searchable documents and syncing them to cloud storage every single day, without human intervention.


The Problem

At Telekomindo, the operations team dealt with a recurring headache: hundreds of PDF documents — invoices, reports, signed contracts — were being uploaded daily to a shared Dropbox folder. Most of these were scanned images, meaning they were completely unsearchable. Finding a specific document meant opening files one by one and manually scanning through pages.

On top of that, there was a compliance requirement to archive all processed documents to an S3-compatible cloud storage provider (DigitalOcean Spaces) for long-term retention and backup. The team was doing this manually: download from Dropbox, run OCR on their machine, re-upload, then upload again to Spaces. It was slow, error-prone, and a massive waste of time.

I was tasked with building an end-to-end solution that would automate this entire workflow — reliably, efficiently, and with zero daily oversight.

The Solution: DailyOCR Pipeline

I built DailyOCR — a CLI tool written entirely in Go — that acts as a two-stage sequential pipeline, designed to run as a daily cron job. Drop PDFs into a Dropbox folder, and the pipeline takes care of the rest: OCR processing, archival, deduplication, and cloud sync.

High-Level Flow

STAGE 1: OCR Processing & Archival
───────────────────────────────────
Dropbox/Input
  └─► Download to temp
       └─► SHA-256 hash
            └─► SQLite lookup ── Duplicate? → Skip
                 └─► New file
                      ├─► Digitally signed? → Skip OCR, preserve signature
                      └─► Unsigned → Run ocrmypdf
                           └─► Upload to Dropbox/Final
                                └─► Archive original → Dropbox/Archive
                                     └─► Record hash in SQLite


STAGE 2: Cloud Sync
───────────────────
Dropbox/Final
  └─► For each file (4 concurrent workers)
       └─► S3 HeadObject check ── Already exists? → Skip
            └─► Missing → Stream: Dropbox → Memory → S3
                 (No local disk buffering)

Why Go?

The decision to use Go was deliberate and driven by several factors unique to this project:

  • Single binary deployment — The tool runs on a Windows server via Windows Task Scheduler. Go compiles to a single static binary (DailyOCR.exe), meaning zero runtime dependencies. No Python virtual environments, no Node.js installations. Just drop the binary and a .env file and it runs.
  • Native concurrency — Both stages benefit from parallelism. Go's goroutines and channels gave me a natural and efficient way to implement worker pools without the complexity of threading libraries.
  • Low memory footprint — The server wasn't a beefy machine. Go's ability to stream data without buffering entire files in memory made it a perfect fit.
  • Strong standard library — OS signal handling, context cancellation, file I/O, hashing — Go's standard library covered 80% of what I needed out of the box.

Architecture Deep Dive

The project follows a clean modular structure using Go's internal/ package convention, which prevents external packages from importing internal logic:

DailyOCR/
├── main.go                               # Entry point & orchestration
├── internal/
│   ├── config/config.go                  # Env loading & directory resolution
│   ├── storage/
│   │   ├── types.go                      # PipelineStorage interface
│   │   ├── storage_dropbox.go            # Dropbox API v2 implementation
│   │   ├── dropbox_oauth.go              # OAuth2 refresh token client
│   │   ├── dropbox_oauth_interactive.go  # One-time browser auth flow
│   │   └── dropbox_oauth_helpers.go      # Token exchange & .env persistence
│   ├── pipeline/
│   │   ├── pipeline.go                   # OCR worker pool & file processing
│   │   └── pdf_utils.go                  # Digital signature detection
│   ├── cloud/
│   │   └── s3_stream.go                  # Dropbox→S3 streaming & existence check
│   └── history/
│       └── history.go                    # SQLite deduplication engine
└── go.mod

The Storage Abstraction Layer

One of the first design decisions I made was to abstract storage behind a PipelineStorage interface. Even though the current backend is strictly Dropbox, I wanted the architecture to be extensible — if we ever needed to swap to a local filesystem or Google Drive, the pipeline logic wouldn't need to change at all.

type PipelineStorage interface {
    ListPDFNames(inputDir string) ([]string, error)
    GetInputLocalPath(inputDir, name string) (localPath string, cleanup func() error, err error)
    SaveOutput(outputDir, name, localPath string) (string, error)
    ArchiveOriginal(inputDir, archiveDir, srcName, dstName string) error
    EnsureDir(path string) error
}

The Dropbox implementation (DropboxStorage) uses the official Dropbox SDK for Go. It communicates with the Dropbox API v2 to list files, download them to a local temp directory for OCR processing, upload results, and move originals to an archive folder — all through the API, no local Dropbox sync client needed.

Design Decision: Dependency Injection — The DropboxStorage instance is created in main.go and injected into the Config struct via its Storage field. The pipeline package has no direct dependency on Dropbox — it only knows about the PipelineStorage interface. This makes unit testing significantly easier and keeps the packages decoupled.

OAuth2 with Automatic Token Refresh

Dropbox deprecated long-lived access tokens in favor of short-lived tokens with refresh tokens. I built a complete OAuth2 flow that handles this transparently. The newDropboxOAuthClient function wraps Go's golang.org/x/oauth2 library to create an HTTP client that automatically refreshes expired tokens behind the scenes:

func newDropboxOAuthClient(appKey, appSecret, refreshToken string) *http.Client {
    cfg := &oauth2.Config{
        ClientID:     appKey,
        ClientSecret: appSecret,
        Endpoint: oauth2.Endpoint{
            TokenURL: "https://api.dropboxapi.com/oauth2/token",
        },
    }
    seed := &oauth2.Token{RefreshToken: refreshToken}
    ts := cfg.TokenSource(context.Background(), seed)
    return oauth2.NewClient(context.Background(), ts)
}

For first-time setup, I also built an interactive OAuth flow that spins up a temporary localhost HTTP server, opens the system browser for Dropbox authorization, captures the auth code via callback, exchanges it for a refresh token, and automatically persists it back to the .env file. This means onboarding is a single command — no manual token copying required.

The Workflow Logic: Step by Step

Stage 1: OCR Processing & Archival

When the pipeline starts, Stage 1 kicks off by listing all PDF files in the configured OCR_INPUT_DIR on Dropbox. It then spins up a pool of workers (equal to the number of CPU cores) that process files concurrently. Here's exactly what happens for each file:

  1. Download to Temp — The file is downloaded from Dropbox to a local temporary directory via the Dropbox API. This is necessary because ocrmypdf operates on local files. The download is wrapped in an exponential backoff retry mechanism (3 attempts) to handle transient network errors gracefully.
  2. SHA-256 Hash Calculation — A cryptographic hash of the file is computed. This hash acts as the unique fingerprint for the file's content.
  3. Deduplication Check (SQLite) — The hash is looked up in a local SQLite database. If it already exists, the file has been processed before — the pipeline skips it immediately. This is the first layer of deduplication, and it's incredibly fast since it's a primary key lookup.
  4. Digital Signature Detection — Before running OCR, the pipeline scans the PDF's raw bytes in 32KB chunks for signature dictionary markers (/ByteRange, /Type /Sig, /FT /Sig). If a digital signature is detected, OCR is skipped entirely to preserve the signature's validity. Running OCR on a signed PDF would invalidate the signature — a compliance disaster.
  5. OCR Processing — For unsigned PDFs, the pipeline invokes ocrmypdf as an external process with a configurable 30-minute timeout per file. It uses Tesseract under the hood and adds a searchable text layer on top of the scanned images without altering the original visuals.
  6. Upload Result — The OCR'd file is uploaded to OCR_FINAL_DIR on Dropbox, with collision handling (timestamp suffixes) to prevent overwrites.
  7. Archive Original — The original file is moved from OCR_INPUT_DIR to OCR_ARCHIVE_DIR on Dropbox, keeping the input folder clean for the next run.
  8. Record in DB — The hash, original filename, and timestamp are saved to the SQLite database for future deduplication.
  9. Cleanup — The temporary local file is deleted via the cleanup function returned by GetInputLocalPath.

Why SHA-256 for deduplication instead of filename? — Files can be renamed but contain the same content. Conversely, different files can share the same name. Content-based hashing guarantees that we never process the same document twice, regardless of what it's called.

Stage 2: Cloud Sync to DigitalOcean Spaces

After Stage 1 completes (and assuming no shutdown signal was received), Stage 2 begins. Its job is simple but critical: ensure every file in OCR_FINAL_DIR on Dropbox has a copy in the S3 bucket.

  1. List Final Directory — All PDFs in OCR_FINAL_DIR on Dropbox are listed.
  2. S3 HeadObject Check — For each file, an S3 HeadObject request checks if the corresponding key already exists in the bucket. This is the second layer of deduplication — even if the SQLite DB was lost, we'd never re-upload a file that's already in S3.
  3. Cloud-to-Cloud Streaming — If the file is missing from S3, it's streamed directly from Dropbox to DigitalOcean Spaces. The file passes through the application's memory as an io.Reader — it's never written to local disk.
func StreamToSpaces(ctx context.Context, dbxStore *storage.DropboxStorage,
    s3Client *s3.Client, dbxPath, bucket, s3Key string) error {

    // Open a download stream from Dropbox
    r, err := dbxStore.OpenStream(dbxPath)
    if err != nil {
        return fmt.Errorf("dropbox open stream: %w", err)
    }
    defer r.Close()

    // Pipe directly to S3 PutObject — no disk I/O
    _, err = s3Client.PutObject(ctx, &s3.PutObjectInput{
        Bucket: aws.String(bucket),
        Key:    aws.String(s3Key),
        Body:   r,
        ACL:    types.ObjectCannedACLPrivate,
    })
    if err != nil {
        return fmt.Errorf("s3 put object: %w", err)
    }
    return nil
}

Stage 2 runs with a 4-worker goroutine pool using Go channels. Each worker pulls file names from a buffered channel, checks S3, and streams if needed. Atomic counters track processed and failed files for the final summary.

Resilience & Reliability Features

  • Exponential Backoff Retries — Network operations (Dropbox download, S3 upload) are wrapped in a retry function with exponential backoff. Transient failures don't crash the pipeline.
  • Graceful Shutdown — The pipeline listens for SIGINT/SIGTERM via Go's signal.NotifyContext. Workers check ctx.Err() between operations and exit cleanly. History is always saved on shutdown via defer.
  • Multi-Layer Deduplication — Layer 1: SHA-256 hash in SQLite (fast, local). Layer 2: S3 HeadObject check (authoritative, remote). Even a full DB wipe won't cause re-uploads.
  • Soft Failure Handling — Locked files, corrupted PDFs, or OCR errors are logged and skipped — they don't block the remaining files. Each file's processing is independent.
  • Legacy DB Migration — When I moved the DB location from the executable's directory to %APPDATA%, I built an auto-migration that copies the old DB and renames it to .bak. Zero manual intervention during upgrades.
  • Dry Run Mode — Setting OCR_DRY_RUN=1 lets you preview what the pipeline would do without actually processing or uploading anything.

Key Technical Decisions & Trade-offs

Why SQLite over PostgreSQL?

The pipeline runs as a standalone scheduled task on a single server. There's no concurrent access from multiple instances and no need for a network-accessible database. SQLite gives us ACID transactions, zero configuration, and the entire database is a single portable file. I used modernc.org/sqlite — a pure-Go SQLite implementation — which means no CGo dependency and a clean cross-compilation story.

Why ocrmypdf as an External Process?

I evaluated several Go-native OCR libraries, but none came close to the quality and maturity of ocrmypdf. It handles page rotation, deskewing, image optimization, and Tesseract orchestration out of the box. Shelling out to it via os/exec with a per-file timeout context is simple and reliable. The trade-off is that ocrmypdf and its dependencies (Tesseract, Ghostscript) must be pre-installed on the host, but for a server deployment this is a one-time setup.

Why DigitalOcean Spaces over AWS S3?

DigitalOcean Spaces is S3-compatible, so the code uses the standard AWS SDK for Go (aws-sdk-go-v2). The choice was driven by cost (predictable flat pricing) and the fact that the rest of the company's infrastructure was already on DigitalOcean. Switching to actual AWS S3 would require changing only the endpoint URL and credentials — zero code changes.

Why Streaming Instead of Download-Then-Upload?

The server had limited disk space. Some batch runs involved thousands of files totalling several gigabytes. By streaming directly — opening a Dropbox download as an io.Reader and feeding it straight into S3's PutObject — the memory footprint stays constant regardless of file size.

Running in Production

The pipeline runs as a Windows Task Scheduler job, triggered once daily at 2:00 AM. A typical run processes 50–200 documents in under 15 minutes. The terminal output shows real-time progress bars and a final summary table:

╔════════════════════════════════════════════╗
║        OCR Automation Pipeline         ║
╚════════════════════════════════════════════╝

 Step 1: OCR Processing & Archival
 INFO  Found 127 files to process.
 ████████████████████████████████████ 127/127

 Step 2: Cloud Sync (S3)
 INFO  Found 342 candidate files in Output.
 ████████████████████████████████████ 342/342

┌──────────────────┬───────────┬────────────────┬────────┐
│ Stage          │ Processed │ Failed/Skipped │ Status │
├──────────────────┼───────────┼────────────────┼────────┤
│ OCR Processing │    119    │ 8            │ Done   │
│ Cloud Sync     │    127    │ 0            │ Done   │
└──────────────────┴───────────┴────────────────┴────────┘

 INFO  Total Duration: 12m 34s

All activity is also logged to a persistent log file at %APPDATA%/DailyOCR/daily_ocr_log.txt for audit trail and debugging.

What I Learned

  • Design for failure first. The retries, graceful shutdown, multi-layer dedup, and soft failure handling weren't afterthoughts — they were in the initial design. In a pipeline that runs unattended, every edge case will happen eventually.
  • Interfaces pay dividends. The PipelineStorage interface took 10 minutes to write but saved hours of refactoring when requirements evolved.
  • Go's concurrency model is a joy for data pipelines. Channels as job queues, WaitGroups for synchronization, atomic counters for metrics — the patterns map naturally to batch processing workloads.
  • Don't fight the ecosystem. Using ocrmypdf as an external tool instead of building a mediocre Go-native OCR was the right call. Leverage the best tool for each job.

This project was a great example of solving a real, unglamorous business problem with clean engineering. No microservices, no Kubernetes, no over-architecture — just a single Go binary, a cron job, and a well-thought-out pipeline that's been running reliably in production since December 2025 without a single manual intervention.

Built at Telekomindo. Written in Go 1.21+. Copyright © 2025 Telekomindo. All rights reserved.

This article reflects work done as part of my role at Telekomindo. The project, its source code, and intellectual property belong to the company.