Case study

BarkScan

A search engine for internet-connected devices — built to scan, fingerprint, and index the live internet on a continuous cycle.

Visit BarkScan →

The problem

Internet-wide scanning is a category dominated by a few entrenched players: Shodan, Censys, ZoomEye. They each own significant portions of the data set, and the market has converged on per-query pricing models that make exploratory work cost-prohibitive. For security researchers, threat hunters, and infrastructure teams, getting timely visibility into the global internet — what services are exposed where, what software versions are running, what changes occurred since yesterday — is harder than it should be.

BarkScan was built to address that gap. A new internet-wide scanner with continuous coverage, fast fingerprinting, and a query interface designed for researchers. The hard parts were scanning at scale without becoming a network nuisance, fingerprinting accurately across heterogeneous services, and serving the resulting dataset with sub-second query latency on a budget that did not require enterprise pricing.

Our approach

The scanner is a Go-based distributed system that performs internet-wide TCP probes followed by service-specific protocol fingerprinting. The probe layer is rate-limited per source IP, per destination AS, and globally — both to be a good network citizen and to avoid triggering automated blocking. Fingerprinting is layered: cheap protocol detection first, expensive deep banner grabs only when justified by what the cheap layer found.

Storage and query are the parts that benefit most from architectural care. Banner data and historical scans live in object storage (Cloudflare R2) for durability and cost, while the queryable fingerprint dataset lives in ClickHouse — which handles the high-cardinality, time-series, full-text-ish queries that internet scan data demands. Postgres serves the metadata and user accounts; the heavy lifting is in ClickHouse.

Continuous scanning is the operational hard part. The internet is large; one full pass takes time even with parallelism, and you have to decide what to re-scan more often versus what to leave for the next full cycle. We split the scanning workload by importance: core infrastructure ports get re-scanned weekly, esoteric ports monthly. Change detection runs continuously against the existing fingerprint dataset, so newly-exposed services surface within hours.

Stack

Scanner

  • Go (high concurrency, low memory)
  • Rate-limited per AS/IP/global
  • Protocol-aware fingerprinting

Backend

  • Go
  • PostgreSQL (metadata)
  • ClickHouse (queryable index)
  • Cloudflare R2 (banner archives)

Streaming

  • NATS for scan job dispatch
  • Async fingerprint pipeline
  • Worker pool autoscaling

Query

  • REST + OpenAPI
  • GraphQL subset
  • Saved-search alerts
  • Bulk export

Frontend

  • TypeScript
  • React + Next.js
  • Map visualization
  • Time-series charts

Infrastructure

  • Kubernetes
  • Multi-region scanner pool
  • CDN-fronted query layer

Outcome

BarkScan provides continuous internet-wide scan coverage with a query interface designed for researchers and security teams. The platform handles the operational realities of internet-wide scanning — rate limiting, courteous probing, distributed scanner pools — that determine whether a scanner can sustain coverage long-term or gets blackholed by upstream networks.

Query latency stays low (single-digit milliseconds for the common patterns) because the fingerprint dataset is structured for ClickHouse from the ingestion side, not retrofit afterwards. Banner archives in object storage stay cheap to retain indefinitely without inflating the queryable dataset.

The scan-and-index pipeline composes well: new fingerprint rules can be deployed without re-scanning, change detection runs continuously, and the queryable surface area can be extended without invalidating prior data.

What we learned

  • For data-platform work, the storage layer determines what queries are possible. ClickHouse for analytical query patterns; Postgres for the canonical metadata.
  • Object storage is the right place for raw artifacts that you might want later but rarely query. Keep the queryable index lean.
  • Distributed scanning is a network-citizenship problem as much as a throughput problem. Rate limit at multiple layers; assume your traffic is suspect by default.
  • Continuous-scan platforms compound value over time. Historical change detection is one of the most valuable features, and you only get it by running continuously from the start.
  • Internet-wide scanning at scale requires explicit cost discipline. Bandwidth, storage, and compute all grow with coverage; the architecture has to assume cost scales with success.

Have a project that needs the same standards?

Email us a paragraph about what you are building. We respond within one business day.

[email protected]