Blockchain intelligence for Web3: Building an in‑house analytics stack

Short summary: A practical, up-to-date blueprint for decision‑makers to build a high-signal, multi‑chain blockchain intelligence stack in-house—covering node choices, blob sidecar retention, Solana Geyser streaming, OP Stack fault‑proof realities, MEV-aware telemetry, schemas, and dbt-driven modeling—with concrete configs, APIs, and vendor alternatives you can implement this quarter.

Why build in‑house now (and what’s changed in 2024–2025)

Ethereum’s Dencun upgrade (EIP‑4844) introduced “blob” transactions with bounded retention (≈4096 epochs, ~18 days) and a per‑block target/max that materially lowers L2 DA costs—but forces teams that care about complete history to archive blob sidecars themselves. Each blob is 4,096 field elements × 32 bytes = 131,072 bytes (128 KiB); target blob gas per block ≈ 393,216 and max ≈ 786,432, i.e., roughly 3 target blobs and 6 max per block at launch parameters. (eips.ethereum.org)
Ethereum finality remains ~15 minutes today; single‑slot finality is still roadmap research. Translation: your analytics and alerting should reflect probabilistic vs finalized states explicitly. (ethereum.org)
Optimism activated permissionless fault proofs on OP Mainnet (June 10, 2024), moving OP Stack rollups to “Stage 1” decentralization while still keeping a Security Council fallback. Challenge windows around withdrawals remain central to how you model capital latency. (docs.optimism.io)
Solana’s state compression and cNFTs moved from novelty to common practice, changing how you index NFT/state data (Merkle proofs + read APIs rather than token accounts alone). (solana.com)
PBS/MEV realities: block building is already dominated by specialized builders and relays (mev‑boost), so “block provenance” becomes a useful analytic dimension in your warehouse. (ethereum.org)

Bottom line: costs for DA fell, speed rose, but reliable, long‑horizon intelligence now requires capturing multiple sources (execution, consensus/DA, builders) and normalizing them fast.

Reference architecture (pragmatic and proven)

Think in layers you can staff, test, and cost-control independently.

Source of truth (nodes and DA)

Ethereum EL client optimized for analytics traces: Erigon (fast, low storage, parity-style trace_ APIs) or Reth (modern, fast, Geth- and Parity-style tracing). For Erigon, plan 920 GB for full mode, ~1.77 TB for archive as of Sep 2025; minimal mode ~350 GB. NVMe required. (docs.erigon.tech)
Consensus client to fetch blob sidecars and finality: expose /eth/v1/beacon/blob_sidecars and persist; blob sidecars are gossiped and must be retained beyond the ~18‑day network availability window if you need history. (eips.ethereum.org)
Solana validator with Geyser plugins to decouple indexing from RPC load; stream accounts/tx/slot status to Kafka or PostgreSQL. (docs.solanalabs.com)

Ingestion and streaming

EVM execution: enable Erigon rpcdaemon with eth, debug, trace modules to support debug_traceTransaction and trace_filter; tune concurrency for bursts.
Example flags (adjust for your infra):
```
rpcdaemon --http.api=eth,erigon,web3,net,debug,trace,txpool --rpc.batch.concurrency=64 --db.read.concurrency=64
```
(github.com)
Consensus/DA: poll beacon REST for blob sidecars by block_id, and optionally mirror via S3; Coinbase/Base’s blob-archiver is a solid open implementation to crib from. (quicknode.com)
Solana: ship via Blockdaemon’s AccountsDB→Kafka plugin or Solana’s reference Postgres plugin; partition by slot to support rewind and reorg handling. (github.com)

Raw zone and lake/warehouse

Store raw JSON/SSZ/Parquet by chain, role (execution/consensus), and day; partition by block_date and by chain_id to keep pruning cheap.
For EVM, land familiar entities: blocks, transactions, receipts, logs, traces, state_diffs (optional), plus “builders/relays” when available via mev‑boost relay APIs. For quick starts use Blockchain‑ETL tooling and schemas. (github.com)

Transformations and semantic models

Use dbt (any warehouse) for tested, incremental models. If you prefer not to run ingestion infra, Dune’s Datashare exports curated raw/derived tables into Snowflake/BigQuery/Databricks; for teams already on Snowflake, Flipside’s zero‑copy Shares are a robust alternative. (dune.com)

Serving and BI

Keep two entry points: “fast” serving tables for dashboards and “deep” tables for investigations (traces, blobs, geyser streams). Connect notebooks/BI via Trino/Trino‑compatible connectors or native warehouse drivers. (docs.dune.com)

Exact configurations and APIs you will actually use

Ethereum execution: traces that scale

Reth supports Geth‑style debug_* and Parity‑style trace_* endpoints. Prefer Reth for mixed tracer parity and speed-sensitive workloads; keep parity outputs consistent across clients in code. Key endpoints: debug_traceTransaction, debug_traceCall, trace_filter, trace_replayTransaction. (reth.rs)
Erigon exposes the trace module (trace_filter, trace_block, trace_get, trace_transaction) and works well for range queries over addresses. (docs.erigon.tech)
In Python, web3.py’s tracing namespace now supports trace_* calls against Erigon/Nethermind straight out of the box. (web3py.readthedocs.io)

Example: fetch all traces touching a DeFi router over a block window

{"method":"trace_filter","params":[
  {"fromBlock":"0x16F14E0","toBlock":"0x16F9FA0",
   "toAddress":["0xE592427A0AEce92De3Edee1F18E0157C05861564"]}],
 "id":1,"jsonrpc":"2.0"}

(alchemy.com)

Ethereum consensus: blob sidecars are part of your truth set

Fetch blobs via
```
/eth/v1/beacon/blob_sidecars/{block_id}
```
and archive to S3/NVMe; EIP‑4844 requires clients to keep blob sidecars available for only ~18 days (MIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS = 4096). If you do not mirror them, historic L2 batch payloads may become incomplete for analyses that reconstruct L2 blocks. (eips.ethereum.org)
Reference the Deneb spec when validating inclusion proofs and versioned hashes; consensus layer is responsible for DA, not the execution layer. (ethereum.github.io)
If you need a ready service, Base’s blob‑archiver shows how to implement a long‑term storage/API layer for sidecars. (github.com)

OP Stack rollups: fault‑proofs, challenge windows, and finality

Model withdrawal latency explicitly: OP Stack uses a seven‑day challenge window (two‑step withdrawals prove→wait→finalize). Your risk dashboards should track “proven_at,” “challenge_expires_at,” and “finalized_at.” (specs.optimism.io)
Fault proofs are activated on OP Mainnet (June 10, 2024), Stage 1; Security Council remains as a safety net (important for risk notes). (docs.optimism.io)

Arbitrum: post‑4844 reality

ArbOS 20 “Atlas” made Arbitrum chains 4844‑aware; batch posters can (and should) submit blobs instead of calldata to cut costs. If you run Nitro infra or analytics, make sure your DA consumer can read both calldata and blob modes. (blog.arbitrum.io)
Arbitrum doc notes Rollup mode data may be in calldata or blobs; AnyTrust mode uses DACerts (index them separately). (docs.arbitrum.io)

Solana: Geyser streaming and state compression

Use the Geyser plugin to stream accounts/tx and slot status to Kafka/Postgres; then decode program‑specific events downstream. (docs.solanalabs.com)
For compressed assets, index via Read APIs and verify via Merkle proofs (tree root on‑chain). You’ll need to store: asset id, proof nodes, tree address, and canopy depth to build valid transactions and analytics. (solana.com)

Minimal Geyser→Kafka config snippet:

{
  "libpath": "target/release/libsolana_accountsdb_plugin_kafka.so",
  "kafka": { "bootstrap.servers": "kafka-1:9092" },
  "filters":[{
    "publish_all_accounts": true,
    "transaction_topic":"solana.mainnet.transactions",
    "update_account_topic":"solana.mainnet.accounts",
    "slot_status_topic":"solana.mainnet.slots"
  }]
}

(github.com)

Schemas that won’t paint you into a corner

EVM star schema
- Fact tables: transactions (by block_number, tx_hash), logs (tx_hash, log_index, topic0..3), traces (trace_address[], type, action, result), blob_refs (block_number, blob_idx, kzg_commitment, versioned_hash).
- Dimensions: addresses (checksum, EOA/contract, first_seen, label_id), contracts (creation_tx, runtime_code_hash, verified_source), relays/builders (name, endpoint).
Map function and event selectors early. Use:
- 4byte.directory API or the Sourcify‑backed OpenChain successor to resolve selector→signature reliably. Persist collisions and frequency. (4byte.directory)
External labels (with lineage):
- Etherscan’s v2 Nametag/Labels API and CSV export feed are now available; store source and last_updated for auditability. (docs.etherscan.io)

Transformations and modeling: keep it incremental

Use dbt incremental models keyed on (block_number, tx_hash) and (slot) to minimize refresh costs.
If you don’t want to operate ingestion infra, push transformations to where the data already lives:
- Dune Datashare pipes raw and curated datasets directly into Snowflake/BigQuery/Databricks.
- Dune’s new dbt Connector lets you run dbt “in place” on Dune’s warehouse with Git‑based CI (useful if you need curated DeFi/NFT models without managing compute). (dune.com)
- Flipside’s Snowflake Shares offer free, zero‑copy access across 30–35+ chains for enterprises standardizing on Snowflake. (flipside-3887e776.mintlify.app)

MEV‑aware telemetry: builders, relays, blobs per block

Track who built your block (relay/builder) and whether the payload included blobs. Expose “builder”, “relay”, “num_blobs”, and “value_eth” as columns; use relay dashboards/APIs for cross‑checks. (relayscan.io)
PBS enshrinement isn’t live yet; mev‑boost sidecars still mediate in production. Keep a watchlist of relays you trust and rotate if needed. (ethereum.org)

Costing and sizing you can present to finance

Ethereum EL (Erigon v3.x + Caplin) guidance, Sep 2025 figures:
- Minimal: ~350 GB
- Full: ~920 GB
- Archive: ~1.77 TB
- Erigon 3.1 reduced indexes significantly (e.g., Polygon archive −930 GB compared to 3.0). Use NVMe SSDs; plan 2–4 TB for growth headroom and reindexing. (docs.erigon.tech)
Bandwidth: 5–25 Mbps for non‑staking nodes as a floor; higher for fast sync and tracing. (docs.erigon.tech)
Blob storage: expect continuous growth; keep S3 lifecycle policies (warm → glacier) for sidecars older than N days while maintaining an API for historical queries. Use the Base blob‑archiver pattern. (github.com)

Practical examples: end‑to‑end, with new details

Reconstruct an L2 batch from L1 using blobs

From the EL block, extract blob versioned hashes for a batch posting tx.
Fetch sidecars via consensus REST:
```
/eth/v1/beacon/blob_sidecars/{block_id}
```
; validate KZG commitments vs versioned hashes; store blob body, index. (quicknode.com)
Parse rollup‑specific batch format (e.g., Arbitrum Nitro) into L2 tx frames; enrich with L2 receipts via your L2 node. Ensure your consumer handles both calldata and blob modes post‑Dencun. (docs.arbitrum.io)

Solana cNFT lifecycle analytics

Stream Geyser account updates to Kafka; derive an “asset_changes” table.
For each compressed asset, fetch Read API proof, verify against Merkle tree root (on‑chain), and materialize “cnft_current_owner” with proof metadata (depth, canopy). (solana.com)

EVM signature enrichment with collision handling

For each log topic0 and input selector, query 4byte/Sourcify dataset; if multiple candidates, record all candidates with frequency_rank and mark ambiguous=true. Re‑score with verified source from Sourcify where available. (docs.sourcify.dev)

OP Stack withdrawal risk panel

Build a model that listens for OptimismPortal.proven events (L1), computes challenge_expires_at = proven_at + 7 days, then marks finalization upon finalize events. Alert when a large withdrawal is within T hours of expiry with no disputes. (specs.optimism.io)

Archive blobs beyond the ~18‑day network window if you care about “replayable” L2 history. Mirroring sidecars to S3 with a small API is now table stakes for enterprise‑grade Ethereum analytics. (eips.ethereum.org)
Run heterogeneous EL clients for resilience and better coverage of tracer differences (e.g., Reth + Erigon). Track edge cases where debug/prestate tracers differ (documented issues exist). (github.com)
Treat builder/relay metadata as first‑class dimensions; PBS is not enshrined yet, but builder concentration affects price impact and censorship analysis. (ethereum.org)
Prefer authoritative selector sources and verified code for ABI mapping to reduce mis-decoding (“selector collisions” are real). (docs.sourcify.dev)
If you’re on Snowflake/BigQuery, evaluate vendor‑maintained feeds (Dune Datashare, Flipside Shares) for “coverage chains” where you don’t want to run infra; keep in‑house only the chains/components that differentiate your business (e.g., your L2, your NFT vertical). (dune.com)

Build vs. buy: a realistic split

Build in‑house when:
- You must control retention (blobs > 18 days), have custom tracers, or need sub‑minute SLAs on alerts.
- You run proprietary strategies (MEV monitoring, counter‑party risk) that shouldn’t leave your VPC.
Buy/augment when:
- You need broad multi‑chain coverage fast (Snowflake Shares via Flipside; Dune Datashare) and can accept vendor SLAs.
- You want curated DeFi/NFT tables and to run dbt without owning ingestion (Dune dbt Connector). (flipside-3887e776.mintlify.app)

Implementation checklist (30/60/90 days)

Day 0–30
- Stand up Erigon (full) + Reth (for debug/trace) and a consensus client; enable rpcdaemon with trace/debug.
- Deploy sidecar archiver (Base blob‑archiver fork) to S3 with lifecycle rules.
- Spin Solana validator with Geyser→Kafka; validate topics and partitions. (docs.erigon.tech)
Day 31–60
- Land raw Parquet in lake; bootstrap dbt project; implement core models (tx, logs, traces, blob_refs, cnft_assets).
- Wire selector/ABI lookup via 4byte/Sourcify; add etherscan label imports (record source and timestamp). (docs.sourcify.dev)
Day 61–90
- Add OP Stack and Arbitrum DA parsers (calldata+blobs).
- Add MEV relay/builder telemetry ingest and dashboards.
- Pilot alerts for “finalized vs head” drift, blob archiver lag, and challenge‑window expiries. (relayscan.io)

Appendix: handy commands and queries

Erigon trace filter for an address range:

curl -s -X POST localhost:8545 \
 -H 'content-type: application/json' \
 --data '{"jsonrpc":"2.0","id":1,"method":"trace_filter","params":[
   {"fromBlock":"0x1700000","toBlock":"0x170FFFF",
    "fromAddress":["0x..."],"toAddress":["0x..."]}]}'
'

(alchemy.com)

Beacon blobs by head:

curl -s https://<your-beacon>/eth/v1/beacon/blob_sidecars/head

(quicknode.com)

Check OP Stack withdrawal flow reference (model your events on this doc): (specs.optimism.io)
Arbitrum batch poster config to enable 4844 blobs (Rollup mode only): see “post‑4844‑blobs” config. (docs.arbitrum.io)

Final take

The winning pattern in 2025 is hybrid: own the parts that are uniquely “you” (blob retention, chain‑specific decoders, MEV‑aware telemetry) and rent the rest (broad multi‑chain coverage via Datashare/Shares). With the configs, endpoints, and schemas above, your team can ship a credible, resilient blockchain intelligence stack in one quarter—and iterate with confidence as Ethereum moves toward SSF and PBS enshrinement. (ethereum.org)

Blockchain intelligence for Web3: Building an in‑house analytics stack

Why build in‑house now (and what’s changed in 2024–2025)

Reference architecture (pragmatic and proven)

Exact configurations and APIs you will actually use

Ethereum execution: traces that scale

Ethereum consensus: blob sidecars are part of your truth set

OP Stack rollups: fault‑proofs, challenge windows, and finality

Arbitrum: post‑4844 reality

Solana: Geyser streaming and state compression

Schemas that won’t paint you into a corner

Transformations and modeling: keep it incremental

MEV‑aware telemetry: builders, relays, blobs per block

Costing and sizing you can present to finance

Practical examples: end‑to‑end, with new details

Build vs. buy: a realistic split

Implementation checklist (30/60/90 days)

Appendix: handy commands and queries

Final take

Like what you're reading? Let's build together.

Related Posts

Enterprise Blockchain Vendor Evaluation Criteria: Chainalysis Playbook, TRM Labs vs Chainalysis Blockchain Intelligence, and Chainalysis Reactor Demo

Blockchain Wallet RFP Template and RFP Questions for Blockchain Analytics and Blockchain Intelligence Tools

Gas Abstraction for a Blockchain Mobile Health App: WalletConnect UX and Is WalletConnect Safe?

Blockchain intelligence for Web3: Building an in‑house analytics stack

Why build in‑house now (and what’s changed in 2024–2025)

Reference architecture (pragmatic and proven)

Exact configurations and APIs you will actually use

Ethereum execution: traces that scale

Ethereum consensus: blob sidecars are part of your truth set

OP Stack rollups: fault‑proofs, challenge windows, and finality

Arbitrum: post‑4844 reality

Solana: Geyser streaming and state compression

Schemas that won’t paint you into a corner

Transformations and modeling: keep it incremental

MEV‑aware telemetry: builders, relays, blobs per block

Costing and sizing you can present to finance

Practical examples: end‑to‑end, with new details

Emerging best practices we recommend in 2025

Build vs. buy: a realistic split

Implementation checklist (30/60/90 days)

Appendix: handy commands and queries

Final take

Like what you're reading? Let's build together.

Related Posts

Enterprise Blockchain Vendor Evaluation Criteria: Chainalysis Playbook, TRM Labs vs Chainalysis Blockchain Intelligence, and Chainalysis Reactor Demo

Blockchain Wallet RFP Template and RFP Questions for Blockchain Analytics and Blockchain Intelligence Tools

Gas Abstraction for a Blockchain Mobile Health App: WalletConnect UX and Is WalletConnect Safe?