v2.1 · Diego Parra · CrimsonVector · View source on GitHub

Contents

1. Abstract
2. Prior Art & What's New
3. Conceptual Foundation
4. Infrastructure & Data Source
5. Discovery Pathways
6. Pipeline Enhancements
7. Validation Framework
8. Infrastructure Mapping & Pivot Queries
9. Cross-Pathway Confidence Scoring
10. Operator Fingerprint Taxonomy
11. Hunt Workflow — Complete Operational Cycle
12. Coverage Envelope Diagnostic Template
13. Limitations & Failure Modes
14. Findings Summary Table
15. Generalizable Search Patterns
16. Case Studies
Appendix: Quick-Reference Query Cheat Sheet

CT Behavioral Fingerprinting: A Multi-Pathway Methodology for Mapping Threat Actor Infrastructure Through Certificate Transparency Log Artifacts

CrimsonVector Methodology Document — v2.1 Author: Diego Parra Date: 2026-05-17 Classification: TLP:CLEAR — for publication Accompanies: UE26 Case Study Presentation

1. Abstract

CT Behavioral Fingerprinting is a novel OSINT technique for mapping threat actor infrastructure across unrelated domains by exploiting operator-embedded behavioral artifacts in Certificate Transparency (CT) log data. The technique operates through five complementary discovery pathways, each with its own coverage envelope and blind spots, to surface operator fingerprints invisible to standard CT search tools.

The methodology exploits a structural property of the CT ecosystem: automated certificate-request tooling embeds consistent strings (personal handles, configuration artifacts, campaign identifiers, brand-impersonation labels, batch-generation patterns) into subdomain labels and SAN lists during ACME validation. Because CT logs are public, permanent, and append-only, these strings become irrevocable behavioral fingerprints. Standard CT search tools (crt.sh, Censys) cannot find them because they do not perform substring search within deeply nested subdomain structures.

Two rounds of testing (May 9-13 and May 14-15, 2026) against ~246M domain rows and ~69M certificate rows surfaced 15+ validated operator fingerprints spanning Russian e-commerce phishing, Chinese brand impersonation, industrial Chinese gambling, German financial-services phishing, Japanese parasitic SEO, and multi-vertical scam portfolios — none previously reported by any threat intelligence provider.

The technique is best understood as a set of complementary lenses, not a single algorithm.

2. Prior Art & What's New

2.1 Established CT Techniques (Tool-Level)

#	Technique	Tools	Limitation
1	Subdomain enumeration	crt.sh, ct-exposer, CertStream, Censys	Requires knowing the target domain first. Cannot discover cross-domain relationships.
2	Phishing detection	phishing_catcher (x0rz), streamingphish, Phicious (RAID 2022), nettfiske	Pattern-based. Requires predefined target lists. Does not identify the operator.
3	Infrastructure clustering via shared certificates	Hunt.io, JA4X, Censys	Clusters certificates, not operators. Shared Let's Encrypt issuance proves nothing about shared operation.
4	SAN diversity analysis	Gigamon Blog (Oct 2022)	Identifies hosting platforms, not operators. Does not search within subdomain labels for behavioral artifacts.

2.2 Adjacent Academic & Industry Research

Three bodies of work are directly adjacent to this technique and must be acknowledged:

2.2.1 Intra-Label Content Analysis — Roberts & Levin (WPES 2019)

Paper: "When Certificate Transparency Is Too Transparent: Analyzing Information Leakage in HTTPS Domain Names." Proceedings of the 18th ACM Workshop on Privacy in the Electronic Society, 2019.

What it does: Demonstrates that subdomain labels within CT logs contain information-rich content that reveals organizational structure, internal project names, and infrastructure topology. The foundational observation — that CT-logged FQDNs contain analyzable content within subdomain labels, not just at the registered-domain level �� is shared with this technique.

Key distinction: Roberts & Levin analyze intra-label content from a defensive privacy perspective: what do CT logs leak about the certificate requester's own organization? The analytical goal is privacy impact assessment. This technique inverts the lens — analyzing intra-label content offensively to map threat actor infrastructure across unrelated domains. The same signal surface, applied to a fundamentally different analytical question: not "what does my org leak?" but "what does the operator's tooling reveal about their infrastructure footprint?"

What Roberts & Levin would NOT catch: Cross-domain operator mapping. Their analysis characterizes leakage within a single organization's certificate portfolio. The novel contribution here is using intra-label patterns to connect infrastructure across hundreds of unrelated parent domains — sbermegamarket appearing on 322 domains owned by different entities reveals an operator relationship that single-org privacy analysis cannot surface.

2.2.2 Unsupervised Anomaly Detection on CT Attributes — Ostertág (2024)

Paper: "Anomaly Detection in Certificate Transparency Logs." arXiv:2405.05206, May 2024.

What it does: Applies Isolation Forest (unsupervised ML) to certificate metadata attributes — issuer patterns, validity periods, key usage fields, certificate chain properties — to detect anomalous certificates that may indicate misissuance, compliance violations, or operational problems.

Key distinction: Ostertág operates on an entirely different signal surface: certificate metadata (structural attributes of the X.509 object itself). This technique operates on domain-name content (the strings humans and automation embed in FQDNs within SAN fields). Ostertág's Isolation Forest would flag a certificate with an unusual validity period or key size; it would not detect that sbermegamarket appears as a subdomain label across 322 unrelated hosts, because that information lives in the SAN content, not the certificate's structural attributes.

Where Ostertág partially overlaps: Pathway C (SAN-list clustering) shares the intuition that certificates with structurally unusual properties deserve scrutiny. Ostertág's approach could complement this technique's Pathway C by flagging certificates with anomalous SAN-list sizes for further content analysis — an unsupervised pre-filter feeding into content-based inspection.

2.2.3 Graph-Theoretic SAN Co-occurrence Clustering — Infoblox (March 2026)

Publication: "Using SSL Certificates and Graph Theory to Uncover Threat Actors." Infoblox Blog, March 2026.

What it does: Constructs a graph where domains are nodes and shared certificate appearances (SAN co-occurrence) create edges. Connected components identify infrastructure under common control. Uses hierarchical "graph of graphs" for complex actor networks. Reports 135% more malicious domains discovered through cluster expansion beyond seed indicators.

Key distinction: Infoblox clusters domains that share certificates — if two domains appear together in the same certificate's SAN list, they are likely co-controlled. This is a powerful technique for expanding known infrastructure (given one bad domain, find others in its cert cluster). However, it fundamentally requires that domains share a certificate.

What Infoblox would NOT catch: Operator fingerprints embedded in subdomain labels across domains with different certificates. sbermegamarket spans 322 parent domains, but these domains do NOT share certificates — they share only a subdomain-label string. Infoblox's graph has no edge between madrid777.com and largewood666.fscp.ru because they never co-occur in the same cert's SAN list. The behavioral fingerprint connecting them is invisible to graph-theoretic SAN co-occurrence.

Where Infoblox directly overlaps: This technique's §8.2 (SAN co-occurrence analysis) and elements of Pathway C are a simplified version of Infoblox's graph approach. When this technique examines multi-SAN certificates containing a fingerprint to discover what other domains the operator bundles in the same cert, it is performing a local version of Infoblox's co-occurrence graph traversal. The overlap is acknowledged; Infoblox's formalization is more rigorous for that specific sub-task.

2.3 Novelty Claim — Refined

Given the adjacent work above, the novelty of CT Behavioral Fingerprinting rests on a specific combination that no prior work achieves:

Intra-label content analysis applied offensively for cross-domain operator mapping — Roberts & Levin (2019) established that subdomain labels contain analyzable content, but applied it defensively. No prior work uses intra-label behavioral artifacts to connect unrelated domains to a common operator.
Multi-pathway architecture combining content, structure, and timing — Ostertág (2024) detects anomalous certificate attributes; Infoblox (2026) clusters SAN co-occurrence; this technique combines intra-label content analysis (Pathways A, B, E), SAN-structure analysis (Pathway C), and issuance-timing analysis (Pathway D) into complementary discovery pathways with explicit coverage envelopes.
Operator mapping, not just infrastructure clustering — Infoblox expands known malicious infrastructure through certificate relationships. This technique discovers previously unknown operators from behavioral artifacts — the fingerprint identifies the operator before any domain is known to be malicious.
The permanence exploitation — while all CT-based techniques benefit from log immutability, this technique specifically exploits the fact that operator tooling artifacts are inadvertently and irrevocably recorded. The operator cannot retroactively scrub their handle from CT logs.

What this technique does NOT claim novelty for: - SAN co-occurrence analysis (Infoblox's graph approach is more rigorous for that sub-task) - The observation that CT logs contain information in subdomain labels (Roberts & Levin, 2019) - Anomaly detection on certificate metadata (Ostertág, 2024) - Subdomain enumeration, phishing detection, or basic CT monitoring (well-established tooling)

2.4 The Permanence Advantage

CT logs are append-only and public by design. An operator who requested a certificate with their handle embedded in a subdomain label created a permanent, irrevocable record. This is a fundamentally different property than any other data source in the threat intelligence space — you cannot delete a CT log entry. The operator may not even realize the string is being logged.

3. Conceptual Foundation

3.1 Why CT Logs Are Uniquely Suited

Certificate Transparency logs are: - Public — anyone can read any log - Immutable — entries cannot be modified or deleted after submission - Real-time — certificates appear within seconds of issuance - Comprehensive — all publicly-trusted CAs must submit to CT logs (Chrome CT policy) - Operator-generated — the content reflects choices made by the certificate requester's tooling

When an operator's automation requests a Let's Encrypt certificate for avito.sber.sbermegamarket.youla.madrid777.com, the full FQDN — including all subdomain labels — is permanently recorded in multiple CT logs. The operator cannot suppress this without abandoning HTTPS entirely.

3.2 Why Standard Tools Miss This

crt.sh — the standard public CT search interface — supports identity searches (%.domain.com) and wildcard searches (%keyword%). However, it indexes certificate SAN values as complete entries. When an operator's fingerprint is embedded inside a subdomain label (e.g., 07znegeulfluxsisilafamille.smtpmail.radio-center.ru), the string znegeulfluxsisilafamille is not a separate SAN entry — it's a component of a longer FQDN. crt.sh does not perform substring search within these components.

Tested: all available crt.sh query patterns for known fingerprints returned zero results. The fingerprints are only discoverable through raw CT stream analysis with substring search capability.

3.3 The Dual-Discovery Architecture

Testing revealed a structural property: the analysis pipeline contains two fundamentally distinct discovery modes operating in parallel:

Scale-based detection — catches operators who reuse identifiers at volume (Pathways A, B, C)
Structure-based detection — catches operators whose patterns reveal coordination even when no individual identifier clears volume thresholds (Pathways D, E)

Neither alone is sufficient. The combination is what makes the methodology productive: - The Chinese brand cluster (1028cc1028gg, 38 parents) was caught by structure (batch-issuance timestamp), not scale - sbermegamarket (322 parents) was caught by scale, not structure - 198901* industrial gambling (434 SANs) was caught by both

4. Infrastructure & Data Source

4.1 Ingestion Pipeline

certstream (WebSocket) → CT monitor → daily CSV → DuckDB/Parquet

Implementation: SIGIL-lite running on a local Ubuntu desktop (scribe01). Certstream captures the global CT stream from all log operators (Google, Cloudflare, Let's Encrypt, etc.).

4.2 Storage & Query Infrastructure

Component	Detail
Server	scribe01 (Ubuntu 26.04, local desktop)
Python env	`/opt/sigil/query-env/bin/python3` (duckdb, httpx, dnspython)
Data path	`/opt/sigil/data/parquet/YYYY-MM-DD/`
Domain table	`domains.parquet` — columns: `fqdn:ID(Domain)`, `tld`, `source`, `status`, `first_seen`, `last_seen`
Certificate table	`certificates.parquet` — columns: `fingerprint_sha256:ID(Certificate)`, `issuer`, `subject_cn`, `san_list`, `not_before`, `not_after`, `source`
Scale	~5-8M certificate records/day, ~24-25M domain rows/day
Storage	~1.5 GB Parquet per day
Hardware	16GB RAM desktop (sufficient for days-to-weeks of data)

4.3 Temporal Windowing

Standard analysis window: rolling 5 days. The pipeline (ATALAYA) runs nightly over the trailing 5 days of CT data, providing sufficient volume for threshold-based detection while keeping the publication cadence fresh.
Delta analysis: newly-surfaced fingerprints (not present in the prior night's run) are flagged as first-seen on the current run. Previously-surfaced fingerprints update their last_seen_dt and growth metrics.
Full backfill: CT log archives (Google Argon, Cloudflare Nimbus) for historical coverage — significant data engineering effort, not part of the standard cycle.

4.4 OPSEC

All discovery queries run locally on scribe01 against stored Parquet data — no external connections
External verification (WHOIS, Shodan, crt.sh) uses Mullvad WireGuard tunnel bound to 10.67.232.55
httpx.HTTPTransport(local_address="10.67.232.55") for Python requests
Never directly connect to discovered operator infrastructure

4.5 Implementation Note (AI-assisted development disclosure)

The substring indexing pipeline, validation tooling, and analysis scripts in this work were developed with LLM-assisted implementation (Claude, Anthropic). The methodology design, pathway architecture, threshold selection, manual validation calls, and attribution conclusions are the author's own. This disclosure is included because the field is in a moment of evolving norms around AI-assisted research tooling, and methodological transparency about tools should run alongside transparency about methods.

5. Discovery Pathways

Pathway preamble — Exclusion list as configuration

All discovery pathways must exclude already-validated fingerprints to prevent double-counting during new-discovery hunts. Rather than hard-coding exclusions inline in each query (AND first_label NOT LIKE '%znegeulflux%'), maintain the exclusion list in a configuration file:

# config/excluded_fingerprints.yaml
# Validated fingerprints excluded from new-discovery hunts to prevent
# double-counting. Each entry is matched as a substring (case-insensitive)
# against first_label / candidate / san_list as appropriate.

excluded:
  - znegeulflux
  - sbermegamarket
  - bfqde2023llsplde12qd27qdl
  - taiyangchengyulecheng
  - 1028cc1028gg
  - verifizierung-traderepublic
  - berufsunfhigkeitsversicherung
  # ... extended as fingerprints accumulate

excluded_platforms:
  # Legitimate platform fingerprints (not operator infrastructure)
  - betting-widgets
  - simulateur-obseques
  - gemeinsam-trauern

The query layer reads this file at run time and injects the NOT LIKE clauses dynamically. This pattern scales as the catalog grows beyond the handful currently hardcoded.

Pathway A — Threshold-Based First-Label Aggregation

Purpose: Count unique first-label substrings across all FQDNs in the analysis window, identify strings appearing across anomalously many parent domains.

Coverage envelope: Operations above the parent-count threshold (currently top-100 ≈ ≥36 parents) where the operator reuses a consistent leading subdomain label.

Known blind spots: - Operators using random per-domain suffixes (e.g., huawei-ndfjejfh-e09f0 — distinct first_labels per deployment) - Operations with fewer than threshold parents (<36 in current data) - Punycode/IDN strings that look different at the byte level vs. the visual level

Executable query:

-- PATHWAY A: First-label aggregation across parent domains
-- Finds consistent subdomain labels deployed on many unrelated hosts
WITH extracted AS (
    SELECT
        "fqdn:ID(Domain)" as fqdn,
        regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
        split_part("fqdn:ID(Domain)", '.', 1) as first_label
    FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
    WHERE length("fqdn:ID(Domain)") > 40
)
SELECT
    first_label,
    count(DISTINCT parent_domain) as domain_count,
    count(*) as total_fqdns,
    list(DISTINCT parent_domain ORDER BY parent_domain)[:5] as sample_domains
FROM extracted
WHERE length(first_label) >= 12
  -- (exclusions injected from config/excluded_fingerprints.yaml)
  AND first_label NOT LIKE 'www%'
  AND first_label NOT LIKE 'mail%'
  AND first_label NOT LIKE 'autodiscover%'
  AND first_label NOT LIKE '%notexists%'
GROUP BY first_label
HAVING domain_count >= 3
ORDER BY domain_count DESC
LIMIT 100

Signal/noise heuristics: - Filter: WHM panel encodings (*whm, *mvd), UUID identifiers, wildcard detection strings - Filter: Common infrastructure labels (www, mail, autodiscover, webmail, cpanel) - Keep: Non-dictionary strings 15+ chars, strings with unexpected language for TLD distribution

Example findings: - sbermegamarket — 322 parent domains, Russian phishing infrastructure - bfqde2023llsplde12qd27qdl — 67 parent domains, unknown operator handle - ai-assistant — 42 parent domains, German SEO exploitation - taiyangchengyulecheng — 96 parent domains, Chinese gambling SEO

Pathway B — Substring Frequency Analysis (14+ chars)

Purpose: Search for any long substring (14+ characters) occurring across many parent domains, catching operator artifacts embedded anywhere in the FQDN structure (not just leading label).

Coverage envelope: Substrings above a length threshold (14+ chars) and above a domain-count threshold, regardless of position in the FQDN.

Known blind spots: - Shorter operator identifiers (510qq at 5 chars, 7K at 2 chars, h5. at 2 chars) - Results dominated by noise patterns (alphabet enumeration, WHM panel artifacts) that crowd out meaningful results in top-N - Diacritic-stripped multilingual strings require normalization to match (see §6.2)

Executable query:

-- PATHWAY B: Substring frequency analysis
-- Finds repeated unusual substrings (14+ chars) across domains
SELECT
    regexp_extract("fqdn:ID(Domain)", '([a-z]{12,30})', 1) as candidate,
    count(DISTINCT regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1)) as domain_count,
    count(*) as total
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE length("fqdn:ID(Domain)") > 50
  -- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY candidate
HAVING domain_count >= 5 AND length(candidate) >= 14
ORDER BY domain_count DESC
LIMIT 100

Signal/noise heuristics: - Filter: Known CMS patterns, CDN slugs, ACME client defaults - Filter: Dictionary words (infrastructure, communications, administration) - Filter: Reversed-alphabet sequences (hgfedcbaupdate, jihgfedcbaupdate) - Keep: Non-dictionary strings in unexpected languages, strings with consistent structure across deployments

Example findings: - berufsunfhigkeitsversicherung — 20 parent domains, German insurance phishing (ä stripped) - sbermegamarket — also surfaced here (cross-pathway confirmation) - znegeulfluxsisilafamille — 141 parent domains (the original case study)

Pathway C — SAN-List Pattern Clustering

Purpose: Identify certificates with unusually large or structured SAN lists, revealing industrial-scale operator infrastructure that bundles many domains in single certificate issuances.

Coverage envelope: Operators using batch certificate issuance with large SAN bundles (50-500+ domains per cert).

Known blind spots: - Operations using per-domain individual certificates rather than SAN bundling - Legitimate brand defensive portfolios (MercadoLibre, CSL Behring, Kaiser Permanente, CBRE) that look structurally identical to scam portfolios - Operations bundled with distractor/legitimate SANs that camouflage operator content

Executable query:

-- PATHWAY C: SAN-list pattern clustering
-- Finds certificates with unusually large/repeated SAN lists
SELECT
    "san_list",
    count(*) as cert_count,
    length("san_list") - length(replace("san_list", ',', '')) + 1 as san_count
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE length("san_list") > 200
  -- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY "san_list"
HAVING cert_count >= 3
ORDER BY cert_count DESC
LIMIT 50

Signal/noise heuristics: - Filter: Known legitimate brand portfolios (see §6.5 allowlist) - Filter: AWS/Google test infrastructure (sadbirds.aws.dev, certsbridge.com) - Keep: SAN lists with TLD-sweep patterns, sequential numeric domains, random-prefix clusters - Flag: Multi-handle structure (3+ distinct sequential families in one cert — see §6.6)

Example findings: - 198901* multi-handle cluster — 434 SANs, industrial Chinese gambling - .garden TLD cluster — 191 SANs, coordinated registration - 510qq — 100 SANs per cert, 900+ certificates, industrial gambling - betting-widgets platform — 1,114 parent domains - endofobama9 political/financial scam portfolio — 93 SANs per cert (DigiCert)

Pathway D — Batch Issuance Timestamp Clustering

Purpose: Group certificates by issuance timestamp at minute-level granularity, identify anomalous co-occurrences that reveal coordinated operator infrastructure — even when individual cert SANs are not unusually large.

Coverage envelope: Operations that batch-issue certificates in coordinated runs, regardless of SAN-list size or subdomain-label consistency.

Known blind spots: - Operators who spread issuance across hours or days to avoid timestamp clustering - Let's Encrypt rate limiting may naturally distribute issuance for some operators - High-volume legitimate services (CDNs, hosting panels) create baseline noise at minute granularity

Executable query:

-- PATHWAY D: Batch issuance timestamp clustering
-- Identifies anomalous cert-issuance bursts at minute granularity
WITH cert_minutes AS (
    SELECT
        date_trunc('minute', CAST("not_before" AS TIMESTAMP)) as issue_minute,
        "issuer",
        "san_list",
        "fingerprint_sha256:ID(Certificate)" as cert_id
    FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
    WHERE "not_before" IS NOT NULL
)
SELECT
    issue_minute,
    issuer,
    count(*) as certs_in_minute,
    count(DISTINCT san_list) as unique_san_lists,
    list(san_list)[:3] as sample_sans
FROM cert_minutes
GROUP BY issue_minute, issuer
HAVING certs_in_minute >= 10
ORDER BY certs_in_minute DESC
LIMIT 50

Structural analysis follow-up (for flagged minute-buckets):

-- For a flagged minute bucket, extract and analyze all SANs
WITH flagged_certs AS (
    SELECT "san_list"
    FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
    WHERE date_trunc('minute', CAST("not_before" AS TIMESTAMP)) = '{flagged_minute}'
      AND "issuer" LIKE '%Let''s Encrypt%'
)
SELECT
    unnest(string_split(san_list, ',')) as san_entry,
    count(*) as occurrences
FROM flagged_certs
GROUP BY san_entry
ORDER BY occurrences DESC
LIMIT 200

Signal/noise heuristics: - Filter: Known high-volume legitimate issuers at expected times (CDN renewals, hosting panel auto-renewals) - Keep: Let's Encrypt bursts with multiple distinct SAN lists (indicates distinct certs issued simultaneously) - Flag: Minute-buckets where SANs share common parent-domain substrings or coordinated brand prefixes

Example findings: - Chinese brand cluster (1028cc1028gg) — all certificates issued at 2026-05-14 16:02 UTC, 10 brand subdomains across operator-generated .cc/.vip/.mobi/.win domains - This pathway found the highest-quality TEST_02 findings that threshold-based pathways missed

Pathway E — Targeted Brand/Keyword Sweep

Purpose: Search CT data for specific brand names, financial-service terms, or operator-relevant keywords embedded in subdomain labels. Catches brand-impersonation phishing below other thresholds.

Coverage envelope: Operations that embed recognizable brand names or service terms in subdomain labels, regardless of scale.

Known blind spots: - Operators not using recognizable brand names (opaque handles like bfqde2023llsplde12qd27qdl) - Brand names used legitimately (requires validation step) - Keyword corpus is never complete — novel targets will be missed

Executable query:

-- PATHWAY E: Targeted brand/keyword sweep
-- Parameterized: replace {keyword} with target brand/term
SELECT
    "fqdn:ID(Domain)" as fqdn,
    regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
    split_part("fqdn:ID(Domain)", '.', 1) as first_label
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" ILIKE '%{keyword}%'
  AND regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) NOT ILIKE '%{keyword}%'
      -- exclude cases where keyword IS the registered domain
GROUP BY fqdn, parent_domain, first_label
ORDER BY parent_domain

Keyword corpus (maintained and expanded per region):

Region	Keywords
German financial	traderepublic, sparkasse, volksbank, commerzbank, verifizierung, legitimation, tan-verfahren
Russian e-commerce	sber, sbermegamarket, avito, ozon, yandex, cdek, pochta, blablacar
Chinese tech/telecom	huawei, xiaomi, taobao, jingdong, tianmao, yidong, dianxin, liantong
Global financial	paypal, coinbase, binance, kraken, revolut
Insurance (DE)	berufsunfähigkeit, haftpflicht, versicherung, altersvorsorge

Signal/noise heuristics: - Filter: Results where the keyword IS the registered domain (legitimate brand sites) - Filter: Known CDN/hosting subdomains for those brands - Keep: Keywords appearing as subdomain labels on unrelated parent domains - Flag: Cross-country deployment pattern (same keyword on domains in 5+ countries = automated deployment)

Example findings: - verifizierung-traderepublic — 14 compromised parent domains in 12 countries - berufsunfhigkeitsversicherung — 20 parent domains (German insurance phishing)

6. Pipeline Enhancements

6.1 Codify Pathway D (batch-issuance timestamp clustering)

Status: Now codified as a first-class pathway (see §5, Pathway D above).

Rationale: This pathway found the Chinese brand cluster (1028cc1028gg) — the highest-quality TEST_02 finding — which was invisible to all threshold-based queries because its 38 parent domains were distributed across 10 brand prefixes with random suffixes.

What it captures that Pathways A-C don't: Operators who use random per-domain suffixes (so no first_label aggregates), but batch-issue certs in coordinated runs.

6.2 Multilingual / Diacritic-Aware String Normalization (Pathway B Enhancement)

Rationale: The berufsunfhigkeitsversicherung finding (German "occupational disability insurance," ä stripped to nothing) revealed that operator automation normalizes diacritics inconsistently. The same artifact applies to French (é/è), Spanish (ñ), Portuguese (ã/ç), Czech (č/š/ž), Turkish (ı/ş/ğ).

Implementation:

-- Generate ASCII-folded variants for Pathway B candidates
-- Approach: strip diacritics, then also check digraph transliteration
WITH candidates AS (
    -- ... standard Pathway B query ...
),
normalized AS (
    SELECT
        candidate,
        -- NFD decomposition + strip combining marks (Python-side)
        -- Also generate: ä→ae, ö→oe, ü→ue, ß→ss transliterations
        regexp_replace(candidate, '[äáàâã]', 'a', 'g') as folded_a,
        regexp_replace(candidate, '[öóòô]', 'o', 'g') as folded_o,
        regexp_replace(candidate, '[üúùû]', 'u', 'g') as folded_u
    FROM candidates
)
SELECT * FROM normalized

Full implementation requires Python-side processing:

import unicodedata

def generate_normalization_variants(s):
    """Generate all plausible diacritic-normalization variants."""
    # Variant 1: Strip diacritics entirely (operator's observed choice)
    nfd = unicodedata.normalize('NFD', s)
    stripped = ''.join(c for c in nfd if unicodedata.category(c) != 'Mn')

    # Variant 2: German transliteration (ä→ae, ö→oe, ü→ue, ß���ss)
    german = s.replace('ä', 'ae').replace('ö', 'oe').replace('ü', 'ue').replace('ß', 'ss')

    # Variant 3: Simple deletion (same as NFD strip for most cases)
    deleted = s.replace('ä', 'a').replace('ö', 'o').replace('ü', 'u')

    return [s, stripped, german, deleted]

Coverage gap closed: Multilingual phishing operations whose automation handles diacritics in ways the analyst doesn't anticipate.

6.3 IDN/Punycode-Aware Aggregation (Pathway A Enhancement)

Rationale: The Japanese parasitic SEO cluster (xn--n8jub3cxopfw59v90r725esqg = "失敗しないカニ通販") was found through batch inspection, not Pathway A, because punycode strings don't aggregate intuitively at the byte level.

Implementation:

import encodings.idna

def decode_idn_label(label):
    """Decode punycode first-label to Unicode for aggregation."""
    if label.startswith('xn--'):
        try:
            return label.encode('ascii').decode('idna')
        except (UnicodeError, UnicodeDecodeError):
            return label
    return label

# In Pathway A post-processing:
# 1. For every first_label starting with xn--, decode to Unicode
# 2. Aggregate also by decoded form
# 3. Flag clusters where multiple xn-- strings appear on overlapping parent-domain sets

Enhanced Pathway A query (IDN-aware):

-- Flag punycode first-labels for decode-and-reaggregate step
SELECT
    first_label,
    domain_count,
    CASE WHEN first_label LIKE 'xn--%' THEN 'IDN_DECODE_NEEDED' ELSE 'ASCII' END as label_type
FROM (
    -- ... standard Pathway A query ...
)
WHERE first_label LIKE 'xn--%'
ORDER BY domain_count DESC

Coverage gap closed: SEO poisoning, parasitic SEO, and IDN-encoded brand impersonation targeting non-Latin-script markets (Japanese, Chinese, Korean, Cyrillic, Arabic).

6.4 Lower-Threshold Sweep with Signature Filters

Rationale: Several high-value findings were below the standard top-100 threshold (verifizierung-traderepublic at 14 parents, Chinese brand cluster at 38 with distributed prefixes). The current threshold is optimized for noise filtering but may be too aggressive.

Implementation:

-- LOWER-THRESHOLD SWEEP: longer strings, stricter signature filters
-- Runs as secondary pass with separate review queue
WITH extracted AS (
    SELECT
        "fqdn:ID(Domain)" as fqdn,
        regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
        split_part("fqdn:ID(Domain)", '.', 1) as first_label
    FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
    WHERE length("fqdn:ID(Domain)") > 40
)
SELECT
    first_label,
    count(DISTINCT parent_domain) as domain_count,
    count(*) as total_fqdns
FROM extracted
WHERE length(first_label) >= 20  -- longer = more likely operator signature
  AND (
    first_label LIKE '%verifizierung%'
    OR first_label LIKE '%confirm%'
    OR first_label LIKE '%secure%'
    OR first_label LIKE '%update%'
    OR first_label LIKE 'xn--%'  -- punycode
    OR first_label ~ '[a-z]{20,}'  -- 20+ consecutive alpha chars
  )
  -- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY first_label
HAVING domain_count >= 10  -- much lower threshold
ORDER BY domain_count DESC
LIMIT 200

Coverage gap closed: Smaller-scale operations that are nevertheless meaningful (active phishing targeting actual users) but invisible to top-100 filtering.

6.5 Legitimate-Brand Portfolio Allowlist

Rationale: MercadoLibre (250-cert SAN bundle with 200+ defensive registrations), CSL Behring, Kaiser Permanente, CBRE, Tata Motors all surface in Pathway C as massive SAN bundles structurally identical to scam portfolios.

Implementation:

# Maintained allowlist of known legitimate large brand portfolios
BRAND_ALLOWLIST = {
    'mercadolibre': ['mercadolibre.com', 'mercadopago.com', 'melilink.com'],
    'csl_behring': ['cslbehring.com'],
    'kaiser': ['kaiserpermanente.org', 'kp.org'],
    'cbre': ['cbre.com'],
    'sabre': ['sabre.com', 'radixx.com'],
    # ... extend as discovered
}

def is_likely_brand_portfolio(san_list):
    """Check if SAN list matches known legitimate brand defensive registration."""
    for brand, markers in BRAND_ALLOWLIST.items():
        if any(marker in san_list.lower() for marker in markers):
            return brand
    return None

Coverage gap closed: Recurring false positives. Frees analyst attention for genuine operator infrastructure.

6.6 Multi-Handle SAN Structure Detection

Rationale: The 198901* finding's most distinctive feature is not any individual handle prefix — it's the parallel-handle structure. A single certificate bundling six distinct sequential-prefix families (k9gj9*, s66om*, tmgj9*, w6h987*, su78993*, 198901*) is a fingerprint of industrial domain-generation tooling.

Implementation:

-- MULTI-HANDLE DETECTION: Find certs with multiple sequential prefix families
WITH san_entries AS (
    SELECT
        "fingerprint_sha256:ID(Certificate)" as cert_id,
        unnest(string_split("san_list", ',')) as san_entry
    FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
    WHERE length("san_list") > 1000  -- large SAN lists only
),
prefixes AS (
    SELECT
        cert_id,
        regexp_extract(san_entry, '^([a-z]+)', 1) as alpha_prefix,
        san_entry
    FROM san_entries
    WHERE san_entry ~ '^[a-z]+[0-9]+\.'  -- alpha prefix + numeric suffix pattern
)
SELECT
    cert_id,
    count(DISTINCT alpha_prefix) as distinct_prefixes,
    list(DISTINCT alpha_prefix)[:10] as prefix_families,
    count(*) as total_sans
FROM prefixes
GROUP BY cert_id
HAVING distinct_prefixes >= 3  -- 3+ distinct sequential families
ORDER BY distinct_prefixes DESC
LIMIT 50

Coverage gap closed: Industrial domain-generation toolkit patterns (510qq, 198901*) detected more robustly than per-handle aggregation.

7. Validation Framework

7.1 Phase 2 Validation (from discovery to confirmed fingerprint)

For each candidate string from Phase 1 (any pathway):

Hosting provider artifact check — does the string appear only on domains hosted by one provider? (Could be hosting-side injection rather than operator-side)
Known software pattern check — does it match CMS slugs, ACME client defaults, control panel artifacts (WHM, cPanel, Plesk)?
Cross-hosting-provider presence — if the same string appears on TIMEWEB, JSC Datacenter, and EVANZO hosting, it's almost certainly operator-side
Operational context — does the string appear alongside operational indicators (phishing subdomains, admin panels, exfiltration channels, VPN impersonation)?

7.2 Infrastructure Stratification

Once validated, classify the parent domain set into operational strata (demonstrated on sbermegamarket's 322 domains):

Stratum	Characteristics	Example
Operator-controlled	Shared registrar, coordinated naming, operator-themed content	Casino fronts (`1win-ggg8.xyz`), scam portfolios (`shield`), disposable throwaways (`order-NNNNN.shop`)
Compromised	Diverse registrars, diverse creation dates, legitimate prior history	Small businesses, expired domains, `.ru` businesses
TLD-adjacent / DNS-service	Wildcard DNS, dynamic DNS, subdomain-registrar services	`sslip.io`, `ddnsfree.com`, `de.com`

Stratification query:

-- Extract parent domain set for stratification
SELECT
    regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
    regexp_extract("fqdn:ID(Domain)", '\.([^.]+)$', 1) as tld,
    count(*) as fqdn_count,
    count(DISTINCT split_part("fqdn:ID(Domain)", '.', 1)) as unique_first_labels
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" LIKE '%{fingerprint_string}%'
GROUP BY parent_domain, tld
ORDER BY fqdn_count DESC

7.3 Brand-Portfolio False-Positive Filtering

Legitimate brand defensive-portfolio registrations that look structurally identical to scam portfolios: - SAN bundles with 100+ domains covering typos, country variants, product extensions - All domains traceable to a single known corporation - Consistent naming pattern (brand + variant suffix)

Mitigation: Cross-reference against corporate registry (Crunchbase, EGRUL, Companies House) before operator attribution. Maintain the allowlist from §6.5.

8. Infrastructure Mapping & Pivot Queries

8.1 Fingerprint Expansion (map all infrastructure)

-- Map all domains touched by a validated fingerprint
SELECT
    regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
    count(*) as fqdn_count,
    list(DISTINCT split_part("fqdn:ID(Domain)", '.', 1))[:10] as sample_first_labels
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" LIKE '%{fingerprint_string}%'
GROUP BY parent_domain
ORDER BY fqdn_count DESC

8.2 SAN Co-occurrence Analysis

-- Find multi-SAN certificates containing the fingerprint
-- Reveals what other infrastructure the operator bundles alongside the fingerprint
SELECT
    "fingerprint_sha256:ID(Certificate)" as cert_id,
    "san_list",
    "not_before",
    "issuer",
    length("san_list") - length(replace("san_list", ',', '')) + 1 as san_count
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE "san_list" LIKE '%{fingerprint_string}%'
  AND "san_list" LIKE '%,%'  -- multi-SAN certs only
ORDER BY san_count DESC

8.3 Parent Domain Classification

After extracting the parent domain set, classify via:

Classification	Signal
Operator-controlled	Shared registrar, shared DNS (e.g., Yandex NS), registrar clustering, coordinated creation dates, Russian registrar on non-.ru TLD
Compromised	Diverse registrars, diverse creation dates, no clustering — consistent with WHM/cPanel brute-force
Development/testing	High FQDN count with fuzzing patterns, GitLab paths, alphabet sequences

8.4 Attribution Pivot Order

From mapped domains, pivot through (in decreasing reliability):

WHOIS/RDAP — registrant names, organizations, taxpayer IDs (INN), emails
Corporate registries — EGRUL (Russia), Companies House (UK), state SOS databases (US)
DNS records — shared nameservers, MX records, SPF/DKIM keys
Shodan/Censys — shared service profiles, exposed databases, control panels
Certificate timeline — issuance timestamps correlate campaigns to operational windows

9. Cross-Pathway Confidence Scoring

Confidence	Criteria	Action
HIGH	Confirmed by 2+ independent pathways	Report as validated operator fingerprint
MODERATE	Single pathway only, passes validation framework	Flag for further investigation, report with caveat
LOW	Single pathway, ambiguous validation	Hold for additional data windows

Examples from testing:

Finding	Pathways	Confidence
sbermegamarket	A + B + C	HIGH
198901* cluster	C + D	HIGH
1028cc1028gg	D (batch-timestamp)	HIGH (elevated: structural evidence overwhelming)
berufsunfhigkeitsversicherung	B	HIGH (elevated: operational context clear)
verifizierung-traderepublic	E (keyword)	HIGH (elevated: cross-country deployment pattern)
ai-assistant	A	MODERATE
.garden cluster	C	MODERATE (purpose TBD)

Note: Single-pathway findings can be elevated to HIGH when structural evidence (batch-issuance timing, cross-country deployment, multi-brand coordination) provides the equivalent of independent confirmation.

10. Operator Fingerprint Taxonomy

10.1 Russian E-Commerce Phishing

Archetype: sbermegamarket (322→473 parents across 2 test windows)

Characteristics: - Russian brand impersonation as deeply nested subdomain labels (Sber, Avito, Ozon, CDEK, Yandex, Blablacar) - Deployed on compromised infrastructure (85%) + operator-controlled disposables (13%) + DNS-service noise (2%) - Multi-vertical: same operator runs gambling fronts, US-targeted scams (*shield, COVID, 401k), and e-commerce phishing - Growth rate: ~75 new parent domains/day

10.2 Chinese Brand Impersonation

Archetype: 1028cc1028gg (38 parents, batch-issued)

Characteristics: - Chinese tech/telecom brand names as subdomain prefixes (Huawei, Vivo, Xiaomi, Samsung, Taobao, Tmall, JD, China Mobile/Telecom/Unicom) - Random per-domain suffixes (campaign/build identifiers) - Operator-generated parent domains with embedded identifier: {seq4}{hex}{1028cc1028gg}.{tld} - TLD rotation (.cc, .vip, .mobi, .win) - Batch certificate issuance (all at single timestamp) - Targets Chinese consumers

10.3 Industrial Chinese Gambling

Archetypes: 510qq (500+ parents), 198901* (434 SANs), 1994901* (192 SANs), a-a-game TLD-sweep, h5.* mobile cluster

Characteristics: - Industrial-scale domain generation (sequential numbering, random alphanumeric prefixes) - 100-SAN certificates with exclusively random-prefix domains (.top, .com) - Let's Encrypt issuer exclusively - Multiple distinct handle-prefix families bundled in single certificates (multi-handle pattern) - h5. prefix = HTML5 mobile-optimized landing pages (Chinese mobile gambling convention) - TLD-sweep registration (10+ TLDs for same label = a-a-game) - QQ references (Chinese messaging platform)

10.4 German Financial-Services Phishing

Archetypes: verifizierung-traderepublic (14 parents), berufsunfhigkeitsversicherung (20 parents)

Characteristics: - German-language subdomain labels as phishing lures - Deployed on compromised international small-business sites (12+ countries) - Diacritic-stripping artifact (ä → nothing, not ä → ae) reveals automation characteristics - Likely single operator running parallel verticals (neobank verification + insurance) - Let's Encrypt certificates, May 2026 issuance window

10.5 Japanese Parasitic SEO

Archetype: IDN cluster (xn--n8jub3cxopfw59v90r725esqg = "失敗しないカ��通販")

Characteristics: - Punycode-encoded Japanese consumer search terms as subdomain labels - Targets Japanese search queries ("is Scalp-D fake?", "crab mail-order reviews") - Deployed on compromised .jp domains + international sites - Creates HTTPS-valid landing pages for search engine manipulation - Multiple related IDN strings on overlapping parent-domain sets (single operator)

10.6 Multi-Vertical Operator Portfolios

Archetype: sbermegamarket operator (gambling + scam + phishing sharing infrastructure)

Characteristics: - Single operator running Russian e-commerce phishing, casino-brand fronts (22 gambling domains), US-targeted English scams (*shield, COVID, 401k — 15 domains), and disposable e-commerce throwaways - Shared certificate-issuance infrastructure across verticals - Multi-market (Russia, US, EU) operation

10.7 Platform/SaaS Fingerprints (Legitimate)

Archetypes: betting-widgets (1,114 parents), French funeral SaaS (simulateur-obseques), German funeral SaaS (gemeinsam-trauern)

Characteristics: - Consistent subdomain patterns across client websites - Not malicious but demonstrate technique's ability to map SaaS provider customer bases - Useful for market intelligence, not threat intelligence - Must be filtered from operator-fingerprint results

10.8 Operator Handle Fingerprints

Archetypes: bfqde2023llsplde12qd27qdl (67 parents), znegeulfluxsisilafamille (141 parents)

Characteristics: - Long opaque strings (24-25 chars) that are clearly personal identifiers or tooling artifacts - Non-dictionary, high-entropy - Cross-TLD, cross-hosting-provider distribution - Co-occur with operational infrastructure (typosquats, admin panels, WHM access) - May contain embedded date markers (2023), language artifacts (French slang) - Highest attribution potential (handle → registrant → corporate registry → individual)

11. Hunt Workflow — Complete Operational Cycle

Step 1: Run All Five Pathways in Parallel

Execute Pathways A-E against the analysis window's CT data. Each produces its own ranked output file.

# Run all pathways via Python/DuckDB on scribe01
cat << 'EOF' | ssh scribe01 '/opt/sigil/query-env/bin/python3'
import duckdb

con = duckdb.connect()

# Pathway A
pathway_a = con.execute("""...""").fetchdf()
pathway_a.to_json('/opt/sigil/data/investigations/hunt_YYYY-MM-DD/pathway_a.json')

# Pathway B
pathway_b = con.execute("""...""").fetchdf()
pathway_b.to_json('/opt/sigil/data/investigations/hunt_YYYY-MM-DD/pathway_b.json')

# ... Pathways C, D, E similarly
EOF

Step 2: Cross-Pathway Validation

Any finding from one pathway is cross-checked against the others: - Does the Pathway A candidate also appear in Pathway C SAN bundles? - Does the Pathway D timestamp cluster contain strings that would clear Pathway B? - Findings confirmed by 2+ pathways → HIGH CONFIDENCE

Step 3: Multilingual + IDN Normalization Pass

Run the diacritic-normalization variants (§6.2) against Pathway B results. Decode all xn-- first_labels in Pathway A results (§6.3). Check for cross-linguistic operator artifacts.

Step 4: Batch-Timestamp Anomaly Review

Review Pathway D output for anomalous minute-buckets. For each flagged bucket: - Extract the union of all SANs - Check for shared parent-domain substrings - Check for repeated subdomain prefixes (brand impersonation pattern) - Apply structural analysis: TLD distribution, prefix entropy, parent-domain co-occurrence

Step 5: Legitimate-Brand Allowlist Filter

Apply the brand-portfolio allowlist (§6.5) to Pathway C output. Auto-tag matching entries as "likely legitimate brand defense" and exclude from the operator-fingerprint review queue.

Step 6: Multi-Handle SAN Structure Flag

Run the multi-handle detection query (§6.6) against Pathway C output. Flag any certificate with 3+ distinct sequential-prefix families.

Step 7: Coverage Envelope Diagnostic

Append the coverage envelope analysis (§12) to the hunt report. Document which pathway caught each finding and what the pipeline could NOT have found.

Step 8: Manual Operator-Validation Step

Apply the validation framework (§7) to the consolidated multi-pathway output: - Confirm cross-hosting-provider distribution - Verify operational context (phishing targets, admin panels, etc.) - Stratify parent domains (operator-controlled vs. compromised vs. noise) - Assign confidence levels

Output Format

/opt/sigil/data/investigations/hunt_YYYY-MM-DD/
├── pathway_a.json      # Raw Pathway A results
├── pathway_b.json      # Raw Pathway B results
├── pathway_c.json      # Raw Pathway C results
├── pathway_d.json      # Raw Pathway D results
├── pathway_e.json      # Raw Pathway E results
├── candidates.json     # Consolidated candidates post-filtering
├── validated.json      # Validated fingerprints post-Phase-2
└── report.md           # Narrative summary with assessments

12. Coverage Envelope Diagnostic Template

12.1 Post-Hunt Attribution Matrix

For each new fingerprint surfaced, document which pathway(s) caught it:

Finding	Pathway A	Pathway B	Pathway C	Pathway D	Pathway E
(finding name)	✅/��	✅/❌	✅/❌	✅/❌	✅/❌

12.2 "What We Could Not Have Found" (standing discipline)

Append to every hunt report:

"This hunt's pipeline would not have surfaced: - Operators using single-cert-per-domain issuance (Pathway C blind) - Operators with <10-parent infrastructure (all thresholds blind) - Operators using diacritic-transliteration normalization differently from our normalization variants - Operators using IDN strings in scripts not covered by our decode step - Operators who spread certificate issuance across 24+ hours (Pathway D blind) - Operators using short identifiers (<12 chars) without matching our keyword corpus (Pathway B/E blind) - Operators rotating identifiers per-campaign with no consistent string across deployments"

12.3 Coverage Matrix Reference

Attribute	Captured by
High parent-domain count (>36)	Pathway A, B
Subdomain-prefix consistency	Pathway A
Long unusual substring (14+ chars)	Pathway B
Large SAN-bundle structure	Pathway C
Multi-handle industrial generation	Pathway C + §6.6
Batch-issuance coordination	Pathway D
Brand-keyword presence	Pathway E
Punycode/IDN	§6.3 enhancement
Diacritic variants	§6.2 enhancement

13. Limitations & Failure Modes

13.1 Structural Limitations

The technique operates through multiple complementary discovery pathways, each with its own coverage envelope. No single pathway alone is sufficient, and even the combination has documented blind spots:

Operators using random per-domain suffixes — invisible to first-label aggregation (Pathway A)
Operators using single-cert-per-domain issuance — invisible to SAN-bundle inspection (Pathway C)
Operators using non-Latin-script identifiers — invisible to byte-level substring search without IDN decoding
Operators spreading issuance over time — invisible to batch-timestamp clustering (Pathway D)
Operators using short or rotating identifiers — invisible to all threshold-based detection
Operations below 10 parent domains — below all practical thresholds
Pathway B captures only the first 12-30 character alphabetic substring per FQDN — regexp_extract with capture group 1 returns the first match only. Operator strings appearing later in the FQDN structure are not captured by this pathway unless they happen to be the first alphabetic run. Cross-pathway confirmation (Pathway A's first-label aggregation, Pathway C's SAN inspection) compensates for this, but it is a documented blind spot of Pathway B in isolation.

The honest framing: the technique is best understood as a set of complementary lenses, not a single algorithm. Each lens reveals a class of operator infrastructure that the others miss.

13.2 Operational Limitations

Opportunistic, not universal. Only works if the operator's tooling embeds a consistent or structurally distinctive artifact. Many operators don't make this mistake.
Requires raw CT stream ingestion. A Parquet-based DuckDB setup on modest hardware (16GB RAM desktop) is sufficient for days-to-weeks of data, but you need the pipeline.
False positive risk with short strings. Strings under 10 characters produce too many matches. Most reliable with 15+ character strings.
Temporal coverage is bounded. Local CT data covers the collection window only. Full historical coverage requires backfilling from CT log archives — significant data engineering effort.
Detectable by the operator. Publication allows operators to audit and patch their tooling. However, historical CT log entries are immutable — past fingerprints cannot be erased.
Not a silver bullet for attribution. The fingerprint maps infrastructure, not identity. The path from fingerprint → domain → WHOIS → corporate registry → named individual requires separate investigation steps, each with its own confidence level.
Volume bias toward Let's Encrypt. Free, automated issuance means Let's Encrypt dominates CT volume and makes the technique most productive against operators using Let's Encrypt automation. Operators using paid CAs leave less volume but stronger attribution signals (DigiCert → payment trail).
Signal-to-noise ratio. Discovery queries surface 200-250 raw candidates; validation filters these to 5-15 genuine fingerprints. The validation step requires significant analyst expertise.

13.3 Legitimate-Brand False Positives

Defensive registration portfolios from large corporations (MercadoLibre: 200+ typo variants, CSL Behring, Kaiser Permanente, CBRE, UFSCar) produce SAN bundles structurally identical to scam portfolios. Validation MUST include brand-portfolio cross-reference against major enterprises.

14. Findings Summary Table

All validated fingerprints across TEST_01 (May 9-13) and TEST_02 (May 14-15):

Fingerprint	Type	Parents	Pathway(s)	Confidence	Status
`sbermegamarket`	Russian phishing	322→473	A, B, C	HIGH	GROWING
`betting-widgets-static/gql/scoreboard-gql`	Platform (gambling)	1,114	A, C	HIGH	GROWING
`bfqde2023llsplde12qd27qdl`	Operator handle	67	A, B	HIGH	STABLE
`510qq`	Chinese gambling	500+	C	HIGH	ABSENT (May 14-15)
`taiyangchengyulecheng`	Chinese gambling SEO	96	A, B	HIGH	STABLE
`endofobama9` portfolio	Political/financial scam	93 (SAN bundle)	C	MODERATE	STABLE
`simulateur-obseques`	Platform (French funeral)	76-112	A	MODERATE	STABLE
`1028cc1028gg`	Chinese brand impersonation	38	D	HIGH	NEW
`verifizierung-traderepublic`	German neobank phishing	14	E	HIGH	NEW
Japanese IDN cluster (3 strings)	Parasitic SEO	10-21 each	D (batch)	MODERATE	NEW
`198901*` multi-handle	Chinese gambling (.com)	434 SANs	C, D	HIGH	NEW
`1994901` / `19949a`	Chinese gambling (.com)	192 SANs	C	HIGH	NEW
`.garden` TLD cluster	Unknown (infrastructure)	191 SANs	C	HIGH	NEW
`berufsunfhigkeitsversicherung`	German insurance phishing	20	B	HIGH	NEW
`ai-assistant` + `beste-de-*`	German SEO exploitation	42	A	MODERATE	NEW
`a-a-game` TLD-sweep	Chinese gambling	500 certs	C	HIGH	NEW
`h5.*` mobile gambling	Chinese gambling	336 certs	C	HIGH	NEW
`gemeinsam-trauern`	Platform (German funeral)	29	A	MODERATE	NEW
`znegeulfluxsisilafamille`	Operator handle	141	A, B	HIGH	KNOWN (excluded)

15. Generalizable Search Patterns

Beyond the specific findings above, look for these categories of operator artifacts in subdomain labels:

Category	Pattern	Example
Personal handles	Non-dictionary strings 15+ chars, mixed language	`znegeulfluxsisilafamille` (French + opaque)
Date-stamped test markers	`YYYY-MM-DD` + consistent string	`2022-12-23znegeulflux...`
Tool configuration artifacts	Software versions, build IDs, CI/CD markers	`bfqde2023llsplde12qd27qdl` (contains "2023")
Campaign identifiers	Codenames, project names, client refs	`1028cc1028gg` (operator infrastructure ID)
Language-specific artifacts	Strings in unexpected languages for TLD	French strings on Russian domains
Brand impersonation	Known brand + action word (`verifizierung-`)	`verifizierung-traderepublic`
Sequential numbering	Handle + incremental suffix	`510qq1` through `510qq600`
Alphabet/fuzzing patterns	Reverse-alphabet, sequential test strings	`hgfedcbaupdate`, `jihgfedcbaupdate`
Pinyin transliterations	Chinese terms in Latin alphabet	`taiyangchengyulecheng` (太阳城娱乐城)
Diacritic artifacts	Missing/stripped diacritics in European terms	`berufsunfhigkeitsversicherung` (ä → ∅)
IDN/Punycode	`xn--` prefixes encoding non-Latin terms	`xn--n8jub3cxopfw59v90r725esqg` (Japanese)
TLD-sweep registration	Same label across 5+ TLDs	`a-a-game.{biz.id,cfd,click,cyou,icu,...}`

16. Case Studies

16.1 `znegeulfluxsisilafamille` — Operator Handle to Named Individual

String: 24 characters — znegeulflux (opaque handle) + sisilafamille (French slang, "yes yes the family")
Discovery: Manual analysis during phishing platform investigation
Scale: 1,447 FQDNs across 141 parent domains, 15+ countries
crt.sh detection: Zero results (all query approaches failed)
Attribution chain: Fingerprint → tlkregion.ru (WHOIS) → INN 7730648020 (EGRUL) → LLC TLK Region → Maksim G. Ermolaev
Infrastructure: 2 operating domains, 1 dev server, ~35 operator-controlled domains, ~96 compromised domains
Key insight: A single operator handle embedded in automated tooling permanently mapped an entire infrastructure invisible to standard OSINT

16.2 `sbermegamarket` — Scale and Multi-Vertical Operations

String: 14 characters — Russian brand (Sber Megamarket)
Discovery: Pathway A (first-label aggregation, top-3 result)
Scale: 322 parent domains (May 9-13) → 473 parent domains (May 14-15), growth rate ~75/day
Stratification: 13% operator-owned (gambling fronts + US scams + disposables), 85% compromised, 2% DNS-service noise
Multi-vertical: Same operator runs Russian e-commerce phishing, casino-affiliate fronts (22 domains), US-targeted scams (*shield/COVID/401k — 15 domains), and disposable throwaways
Key insight: Infrastructure stratification reveals a single operation spanning 3+ scam verticals across 2+ markets sharing certificate-issuance infrastructure

16.3 `1028cc1028gg` — Batch-Issuance Discovery

String: 12 characters — operator infrastructure identifier embedded in parent domains
Discovery: Pathway D (batch-issuance timestamp clustering) — all certs at 2026-05-14 16:02 UTC
Scale: 10 Chinese brand impersonations × 38 parent domains
Why threshold-based missed it: 38 parents distributed across 10 brand prefixes with random suffixes — no individual first_label aggregates above threshold
Key insight: Batch-timestamp clustering catches operator coordination invisible to all volume-based methods

16.4 `verifizierung-traderepublic` — Brand-Keyword Discovery

String: 27 characters — German ("verification Trade Republic")
Discovery: Pathway E (targeted brand/keyword sweep)
Scale: 14 compromised parent domains in 12 countries
Why threshold-based missed it: 14 parents — well below top-100 threshold
Connection: Likely same operator as berufsunfhigkeitsversicherung (German insurance phishing, 20 parents) — both target German financial consumers, deploy on compromised international hosts, appear in same window
Key insight: Targeted keyword sweeps catch active phishing campaigns too small for statistical detection but operationally significant

Appendix: Quick-Reference Query Cheat Sheet

-- ═══════════════════════════════════════════════════════════════
-- PATHWAY A: First-label aggregation
-- ══��══════════════════════════���═════════════════════════════════
WITH extracted AS (
    SELECT
        "fqdn:ID(Domain)" as fqdn,
        regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
        split_part("fqdn:ID(Domain)", '.', 1) as first_label
    FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
    WHERE length("fqdn:ID(Domain)") > 40
)
SELECT first_label, count(DISTINCT parent_domain) as domain_count, count(*) as total_fqdns,
    list(DISTINCT parent_domain ORDER BY parent_domain)[:5] as sample_domains
FROM extracted
WHERE length(first_label) >= 12
  -- (exclusions injected from config/excluded_fingerprints.yaml)
  AND first_label NOT LIKE 'www%' AND first_label NOT LIKE 'mail%'
  AND first_label NOT LIKE 'autodiscover%' AND first_label NOT LIKE '%notexists%'
GROUP BY first_label HAVING domain_count >= 3
ORDER BY domain_count DESC LIMIT 100;

-- ════���══════════════════════════════════════════════════════════
-- PATHWAY B: Substring frequency (14+ chars)
-- ═══════════════════════════════════════��═══════════════════════
SELECT
    regexp_extract("fqdn:ID(Domain)", '([a-z]{12,30})', 1) as candidate,
    count(DISTINCT regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1)) as domain_count,
    count(*) as total
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE length("fqdn:ID(Domain)") > 50
  -- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY candidate HAVING domain_count >= 5 AND length(candidate) >= 14
ORDER BY domain_count DESC LIMIT 100;

-- ════���══════════���═══════════════════════════════════════════════
-- PATHWAY C: SAN-list pattern clustering
-- ════════��═══════════════��══════════════════════════════════════
SELECT "san_list", count(*) as cert_count,
    length("san_list") - length(replace("san_list", ',', '')) + 1 as san_count
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE length("san_list") > 200
  -- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY "san_list" HAVING cert_count >= 3
ORDER BY cert_count DESC LIMIT 50;

-- ════���═════════════════���════════════════════════════════════════
-- PATHWAY D: Batch-issuance timestamp clustering
-- ═════════��═════════════════════════════════════════════════════
WITH cert_minutes AS (
    SELECT date_trunc('minute', CAST("not_before" AS TIMESTAMP)) as issue_minute,
        "issuer", "san_list", "fingerprint_sha256:ID(Certificate)" as cert_id
    FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
    WHERE "not_before" IS NOT NULL
)
SELECT issue_minute, issuer, count(*) as certs_in_minute,
    count(DISTINCT san_list) as unique_san_lists
FROM cert_minutes
GROUP BY issue_minute, issuer HAVING certs_in_minute >= 10
ORDER BY certs_in_minute DESC LIMIT 50;

-- ══════════════════════════════════════��════════════════════════
-- PATHWAY E: Targeted brand/keyword sweep
-- Replace {keyword} with target term
-- ═══��═══════════��═══════════════════════════════════════════════
SELECT "fqdn:ID(Domain)" as fqdn,
    regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
    split_part("fqdn:ID(Domain)", '.', 1) as first_label
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" ILIKE '%{keyword}%'
  AND regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) NOT ILIKE '%{keyword}%'
GROUP BY fqdn, parent_domain, first_label
ORDER BY parent_domain;

-- ══��══════════════════════���════════════════════════���════════════
-- EXPANSION: Map all infrastructure for validated fingerprint
-- ═══════════════════════════════════════════════════════════════
SELECT regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
    count(*) as fqdn_count,
    list(DISTINCT split_part("fqdn:ID(Domain)", '.', 1))[:10] as sample_labels
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" LIKE '%{fingerprint_string}%'
GROUP BY parent_domain ORDER BY fqdn_count DESC;

-- ══════��════════════════���═══════════════════════════════════════
-- SAN CO-OCCURRENCE: Multi-SAN certs with fingerprint
-- ══════��════════════════���═══════════════════════════════════════
SELECT "fingerprint_sha256:ID(Certificate)" as cert_id, "san_list", "not_before", "issuer",
    length("san_list") - length(replace("san_list", ',', '')) + 1 as san_count
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE "san_list" LIKE '%{fingerprint_string}%'
  AND "san_list" LIKE '%,%'
ORDER BY san_count DESC;

-- ════��════════════════════════════��═════════════════════════════
-- KNOWN STRING SEARCH: Direct lookup
-- ═══════��═════════════════��════════════════════════════════════���
SELECT "fqdn:ID(Domain)" as fqdn
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" LIKE '%suspectedstring%';

This methodology document synthesizes CrimsonVector investigations CV-INV-05 ("si si la famille"), TEST_01 (May 9-13 2026), and TEST_02 (May 14-15 2026). It accompanies the UE26 case study presentation on CT Behavioral Fingerprinting.

CT Behavioral Fingerprinting: A Multi-Pathway Methodology for Mapping Threat Actor Infrastructure Through Certificate Transparency Log Artifacts

1. Abstract

2. Prior Art & What's New

2.1 Established CT Techniques (Tool-Level)

2.2 Adjacent Academic & Industry Research

2.2.1 Intra-Label Content Analysis — Roberts & Levin (WPES 2019)

2.2.2 Unsupervised Anomaly Detection on CT Attributes — Ostertág (2024)

2.2.3 Graph-Theoretic SAN Co-occurrence Clustering — Infoblox (March 2026)

2.3 Novelty Claim — Refined

2.4 The Permanence Advantage

3. Conceptual Foundation

3.1 Why CT Logs Are Uniquely Suited

3.2 Why Standard Tools Miss This

3.3 The Dual-Discovery Architecture

4. Infrastructure & Data Source

4.1 Ingestion Pipeline

4.2 Storage & Query Infrastructure

4.3 Temporal Windowing

4.4 OPSEC

4.5 Implementation Note (AI-assisted development disclosure)

5. Discovery Pathways

Pathway preamble — Exclusion list as configuration

Pathway A — Threshold-Based First-Label Aggregation

Pathway B — Substring Frequency Analysis (14+ chars)

Pathway C — SAN-List Pattern Clustering

Pathway D — Batch Issuance Timestamp Clustering

Pathway E — Targeted Brand/Keyword Sweep

6. Pipeline Enhancements

6.1 Codify Pathway D (batch-issuance timestamp clustering)

6.2 Multilingual / Diacritic-Aware String Normalization (Pathway B Enhancement)

6.3 IDN/Punycode-Aware Aggregation (Pathway A Enhancement)

6.4 Lower-Threshold Sweep with Signature Filters

6.5 Legitimate-Brand Portfolio Allowlist

6.6 Multi-Handle SAN Structure Detection

7. Validation Framework

7.1 Phase 2 Validation (from discovery to confirmed fingerprint)

7.2 Infrastructure Stratification

7.3 Brand-Portfolio False-Positive Filtering

8. Infrastructure Mapping & Pivot Queries

8.1 Fingerprint Expansion (map all infrastructure)

8.2 SAN Co-occurrence Analysis

8.3 Parent Domain Classification

8.4 Attribution Pivot Order

9. Cross-Pathway Confidence Scoring

10. Operator Fingerprint Taxonomy

10.1 Russian E-Commerce Phishing

10.2 Chinese Brand Impersonation

10.3 Industrial Chinese Gambling

10.4 German Financial-Services Phishing

10.5 Japanese Parasitic SEO

10.6 Multi-Vertical Operator Portfolios

10.7 Platform/SaaS Fingerprints (Legitimate)

10.8 Operator Handle Fingerprints

11. Hunt Workflow — Complete Operational Cycle

Step 1: Run All Five Pathways in Parallel

Step 2: Cross-Pathway Validation

Step 3: Multilingual + IDN Normalization Pass

Step 4: Batch-Timestamp Anomaly Review

Step 5: Legitimate-Brand Allowlist Filter

Step 6: Multi-Handle SAN Structure Flag

Step 7: Coverage Envelope Diagnostic

Step 8: Manual Operator-Validation Step

Output Format

12. Coverage Envelope Diagnostic Template

12.1 Post-Hunt Attribution Matrix

12.2 "What We Could Not Have Found" (standing discipline)

12.3 Coverage Matrix Reference

13. Limitations & Failure Modes

13.1 Structural Limitations

13.2 Operational Limitations

13.3 Legitimate-Brand False Positives

14. Findings Summary Table

15. Generalizable Search Patterns

16. Case Studies

16.1 znegeulfluxsisilafamille — Operator Handle to Named Individual

16.2 sbermegamarket — Scale and Multi-Vertical Operations

16.3 1028cc1028gg — Batch-Issuance Discovery

16.4 verifizierung-traderepublic — Brand-Keyword Discovery

Appendix: Quick-Reference Query Cheat Sheet

16.1 `znegeulfluxsisilafamille` — Operator Handle to Named Individual

16.2 `sbermegamarket` — Scale and Multi-Vertical Operations

16.3 `1028cc1028gg` — Batch-Issuance Discovery

16.4 `verifizierung-traderepublic` — Brand-Keyword Discovery