v2.1 · Diego Parra · CrimsonVector · View source on GitHub
- 1. Abstract
- 2. Prior Art & What's New
- 3. Conceptual Foundation
- 4. Infrastructure & Data Source
- 5. Discovery Pathways
- 6. Pipeline Enhancements
- 6.1 Codify Pathway D (batch-issuance timestamp clustering)
- 6.2 Multilingual / Diacritic-Aware String Normalization (Pathway B Enhancement)
- 6.3 IDN/Punycode-Aware Aggregation (Pathway A Enhancement)
- 6.4 Lower-Threshold Sweep with Signature Filters
- 6.5 Legitimate-Brand Portfolio Allowlist
- 6.6 Multi-Handle SAN Structure Detection
- 7. Validation Framework
- 8. Infrastructure Mapping & Pivot Queries
- 9. Cross-Pathway Confidence Scoring
- 10. Operator Fingerprint Taxonomy
- 11. Hunt Workflow — Complete Operational Cycle
- Step 1: Run All Five Pathways in Parallel
- Step 2: Cross-Pathway Validation
- Step 3: Multilingual + IDN Normalization Pass
- Step 4: Batch-Timestamp Anomaly Review
- Step 5: Legitimate-Brand Allowlist Filter
- Step 6: Multi-Handle SAN Structure Flag
- Step 7: Coverage Envelope Diagnostic
- Step 8: Manual Operator-Validation Step
- Output Format
- 12. Coverage Envelope Diagnostic Template
- 13. Limitations & Failure Modes
- 14. Findings Summary Table
- 15. Generalizable Search Patterns
- 16. Case Studies
- Appendix: Quick-Reference Query Cheat Sheet
CT Behavioral Fingerprinting: A Multi-Pathway Methodology for Mapping Threat Actor Infrastructure Through Certificate Transparency Log Artifacts
CrimsonVector Methodology Document — v2.1 Author: Diego Parra Date: 2026-05-17 Classification: TLP:CLEAR — for publication Accompanies: UE26 Case Study Presentation
1. Abstract
CT Behavioral Fingerprinting is a novel OSINT technique for mapping threat actor infrastructure across unrelated domains by exploiting operator-embedded behavioral artifacts in Certificate Transparency (CT) log data. The technique operates through five complementary discovery pathways, each with its own coverage envelope and blind spots, to surface operator fingerprints invisible to standard CT search tools.
The methodology exploits a structural property of the CT ecosystem: automated certificate-request tooling embeds consistent strings (personal handles, configuration artifacts, campaign identifiers, brand-impersonation labels, batch-generation patterns) into subdomain labels and SAN lists during ACME validation. Because CT logs are public, permanent, and append-only, these strings become irrevocable behavioral fingerprints. Standard CT search tools (crt.sh, Censys) cannot find them because they do not perform substring search within deeply nested subdomain structures.
Two rounds of testing (May 9-13 and May 14-15, 2026) against ~246M domain rows and ~69M certificate rows surfaced 15+ validated operator fingerprints spanning Russian e-commerce phishing, Chinese brand impersonation, industrial Chinese gambling, German financial-services phishing, Japanese parasitic SEO, and multi-vertical scam portfolios — none previously reported by any threat intelligence provider.
The technique is best understood as a set of complementary lenses, not a single algorithm.
2. Prior Art & What's New
2.1 Established CT Techniques (Tool-Level)
| # | Technique | Tools | Limitation |
|---|---|---|---|
| 1 | Subdomain enumeration | crt.sh, ct-exposer, CertStream, Censys | Requires knowing the target domain first. Cannot discover cross-domain relationships. |
| 2 | Phishing detection | phishing_catcher (x0rz), streamingphish, Phicious (RAID 2022), nettfiske | Pattern-based. Requires predefined target lists. Does not identify the operator. |
| 3 | Infrastructure clustering via shared certificates | Hunt.io, JA4X, Censys | Clusters certificates, not operators. Shared Let's Encrypt issuance proves nothing about shared operation. |
| 4 | SAN diversity analysis | Gigamon Blog (Oct 2022) | Identifies hosting platforms, not operators. Does not search within subdomain labels for behavioral artifacts. |
2.2 Adjacent Academic & Industry Research
Three bodies of work are directly adjacent to this technique and must be acknowledged:
2.2.1 Intra-Label Content Analysis — Roberts & Levin (WPES 2019)
Paper: "When Certificate Transparency Is Too Transparent: Analyzing Information Leakage in HTTPS Domain Names." Proceedings of the 18th ACM Workshop on Privacy in the Electronic Society, 2019.
What it does: Demonstrates that subdomain labels within CT logs contain information-rich content that reveals organizational structure, internal project names, and infrastructure topology. The foundational observation — that CT-logged FQDNs contain analyzable content within subdomain labels, not just at the registered-domain level �� is shared with this technique.
Key distinction: Roberts & Levin analyze intra-label content from a defensive privacy perspective: what do CT logs leak about the certificate requester's own organization? The analytical goal is privacy impact assessment. This technique inverts the lens — analyzing intra-label content offensively to map threat actor infrastructure across unrelated domains. The same signal surface, applied to a fundamentally different analytical question: not "what does my org leak?" but "what does the operator's tooling reveal about their infrastructure footprint?"
What Roberts & Levin would NOT catch: Cross-domain operator mapping. Their analysis characterizes leakage within a single organization's certificate portfolio. The novel contribution here is using intra-label patterns to connect infrastructure across hundreds of unrelated parent domains — sbermegamarket appearing on 322 domains owned by different entities reveals an operator relationship that single-org privacy analysis cannot surface.
2.2.2 Unsupervised Anomaly Detection on CT Attributes — Ostertág (2024)
Paper: "Anomaly Detection in Certificate Transparency Logs." arXiv:2405.05206, May 2024.
What it does: Applies Isolation Forest (unsupervised ML) to certificate metadata attributes — issuer patterns, validity periods, key usage fields, certificate chain properties — to detect anomalous certificates that may indicate misissuance, compliance violations, or operational problems.
Key distinction: Ostertág operates on an entirely different signal surface: certificate metadata (structural attributes of the X.509 object itself). This technique operates on domain-name content (the strings humans and automation embed in FQDNs within SAN fields). Ostertág's Isolation Forest would flag a certificate with an unusual validity period or key size; it would not detect that sbermegamarket appears as a subdomain label across 322 unrelated hosts, because that information lives in the SAN content, not the certificate's structural attributes.
Where Ostertág partially overlaps: Pathway C (SAN-list clustering) shares the intuition that certificates with structurally unusual properties deserve scrutiny. Ostertág's approach could complement this technique's Pathway C by flagging certificates with anomalous SAN-list sizes for further content analysis — an unsupervised pre-filter feeding into content-based inspection.
2.2.3 Graph-Theoretic SAN Co-occurrence Clustering — Infoblox (March 2026)
Publication: "Using SSL Certificates and Graph Theory to Uncover Threat Actors." Infoblox Blog, March 2026.
What it does: Constructs a graph where domains are nodes and shared certificate appearances (SAN co-occurrence) create edges. Connected components identify infrastructure under common control. Uses hierarchical "graph of graphs" for complex actor networks. Reports 135% more malicious domains discovered through cluster expansion beyond seed indicators.
Key distinction: Infoblox clusters domains that share certificates — if two domains appear together in the same certificate's SAN list, they are likely co-controlled. This is a powerful technique for expanding known infrastructure (given one bad domain, find others in its cert cluster). However, it fundamentally requires that domains share a certificate.
What Infoblox would NOT catch: Operator fingerprints embedded in subdomain labels across domains with different certificates. sbermegamarket spans 322 parent domains, but these domains do NOT share certificates — they share only a subdomain-label string. Infoblox's graph has no edge between madrid777.com and largewood666.fscp.ru because they never co-occur in the same cert's SAN list. The behavioral fingerprint connecting them is invisible to graph-theoretic SAN co-occurrence.
Where Infoblox directly overlaps: This technique's §8.2 (SAN co-occurrence analysis) and elements of Pathway C are a simplified version of Infoblox's graph approach. When this technique examines multi-SAN certificates containing a fingerprint to discover what other domains the operator bundles in the same cert, it is performing a local version of Infoblox's co-occurrence graph traversal. The overlap is acknowledged; Infoblox's formalization is more rigorous for that specific sub-task.
2.3 Novelty Claim — Refined
Given the adjacent work above, the novelty of CT Behavioral Fingerprinting rests on a specific combination that no prior work achieves:
-
Intra-label content analysis applied offensively for cross-domain operator mapping — Roberts & Levin (2019) established that subdomain labels contain analyzable content, but applied it defensively. No prior work uses intra-label behavioral artifacts to connect unrelated domains to a common operator.
-
Multi-pathway architecture combining content, structure, and timing — Ostertág (2024) detects anomalous certificate attributes; Infoblox (2026) clusters SAN co-occurrence; this technique combines intra-label content analysis (Pathways A, B, E), SAN-structure analysis (Pathway C), and issuance-timing analysis (Pathway D) into complementary discovery pathways with explicit coverage envelopes.
-
Operator mapping, not just infrastructure clustering — Infoblox expands known malicious infrastructure through certificate relationships. This technique discovers previously unknown operators from behavioral artifacts — the fingerprint identifies the operator before any domain is known to be malicious.
-
The permanence exploitation — while all CT-based techniques benefit from log immutability, this technique specifically exploits the fact that operator tooling artifacts are inadvertently and irrevocably recorded. The operator cannot retroactively scrub their handle from CT logs.
What this technique does NOT claim novelty for: - SAN co-occurrence analysis (Infoblox's graph approach is more rigorous for that sub-task) - The observation that CT logs contain information in subdomain labels (Roberts & Levin, 2019) - Anomaly detection on certificate metadata (Ostertág, 2024) - Subdomain enumeration, phishing detection, or basic CT monitoring (well-established tooling)
2.4 The Permanence Advantage
CT logs are append-only and public by design. An operator who requested a certificate with their handle embedded in a subdomain label created a permanent, irrevocable record. This is a fundamentally different property than any other data source in the threat intelligence space — you cannot delete a CT log entry. The operator may not even realize the string is being logged.
3. Conceptual Foundation
3.1 Why CT Logs Are Uniquely Suited
Certificate Transparency logs are: - Public — anyone can read any log - Immutable — entries cannot be modified or deleted after submission - Real-time — certificates appear within seconds of issuance - Comprehensive — all publicly-trusted CAs must submit to CT logs (Chrome CT policy) - Operator-generated — the content reflects choices made by the certificate requester's tooling
When an operator's automation requests a Let's Encrypt certificate for avito.sber.sbermegamarket.youla.madrid777.com, the full FQDN — including all subdomain labels — is permanently recorded in multiple CT logs. The operator cannot suppress this without abandoning HTTPS entirely.
3.2 Why Standard Tools Miss This
crt.sh — the standard public CT search interface — supports identity searches (%.domain.com) and wildcard searches (%keyword%). However, it indexes certificate SAN values as complete entries. When an operator's fingerprint is embedded inside a subdomain label (e.g., 07znegeulfluxsisilafamille.smtpmail.radio-center.ru), the string znegeulfluxsisilafamille is not a separate SAN entry — it's a component of a longer FQDN. crt.sh does not perform substring search within these components.
Tested: all available crt.sh query patterns for known fingerprints returned zero results. The fingerprints are only discoverable through raw CT stream analysis with substring search capability.
3.3 The Dual-Discovery Architecture
Testing revealed a structural property: the analysis pipeline contains two fundamentally distinct discovery modes operating in parallel:
- Scale-based detection — catches operators who reuse identifiers at volume (Pathways A, B, C)
- Structure-based detection — catches operators whose patterns reveal coordination even when no individual identifier clears volume thresholds (Pathways D, E)
Neither alone is sufficient. The combination is what makes the methodology productive:
- The Chinese brand cluster (1028cc1028gg, 38 parents) was caught by structure (batch-issuance timestamp), not scale
- sbermegamarket (322 parents) was caught by scale, not structure
- 198901* industrial gambling (434 SANs) was caught by both
4. Infrastructure & Data Source
4.1 Ingestion Pipeline
certstream (WebSocket) → CT monitor → daily CSV → DuckDB/Parquet
Implementation: SIGIL-lite running on a local Ubuntu desktop (scribe01). Certstream captures the global CT stream from all log operators (Google, Cloudflare, Let's Encrypt, etc.).
4.2 Storage & Query Infrastructure
| Component | Detail |
|---|---|
| Server | scribe01 (Ubuntu 26.04, local desktop) |
| Python env | /opt/sigil/query-env/bin/python3 (duckdb, httpx, dnspython) |
| Data path | /opt/sigil/data/parquet/YYYY-MM-DD/ |
| Domain table | domains.parquet — columns: fqdn:ID(Domain), tld, source, status, first_seen, last_seen |
| Certificate table | certificates.parquet — columns: fingerprint_sha256:ID(Certificate), issuer, subject_cn, san_list, not_before, not_after, source |
| Scale | ~5-8M certificate records/day, ~24-25M domain rows/day |
| Storage | ~1.5 GB Parquet per day |
| Hardware | 16GB RAM desktop (sufficient for days-to-weeks of data) |
4.3 Temporal Windowing
- Standard analysis window: rolling 5 days. The pipeline (ATALAYA) runs nightly over the trailing 5 days of CT data, providing sufficient volume for threshold-based detection while keeping the publication cadence fresh.
- Delta analysis: newly-surfaced fingerprints (not present in the prior night's run) are flagged as first-seen on the current run. Previously-surfaced fingerprints update their
last_seen_dtand growth metrics. - Full backfill: CT log archives (Google Argon, Cloudflare Nimbus) for historical coverage — significant data engineering effort, not part of the standard cycle.
4.4 OPSEC
- All discovery queries run locally on scribe01 against stored Parquet data — no external connections
- External verification (WHOIS, Shodan, crt.sh) uses Mullvad WireGuard tunnel bound to
10.67.232.55 httpx.HTTPTransport(local_address="10.67.232.55")for Python requests- Never directly connect to discovered operator infrastructure
4.5 Implementation Note (AI-assisted development disclosure)
The substring indexing pipeline, validation tooling, and analysis scripts in this work were developed with LLM-assisted implementation (Claude, Anthropic). The methodology design, pathway architecture, threshold selection, manual validation calls, and attribution conclusions are the author's own. This disclosure is included because the field is in a moment of evolving norms around AI-assisted research tooling, and methodological transparency about tools should run alongside transparency about methods.
5. Discovery Pathways
Pathway preamble — Exclusion list as configuration
All discovery pathways must exclude already-validated fingerprints to prevent double-counting during new-discovery hunts. Rather than hard-coding exclusions inline in each query (AND first_label NOT LIKE '%znegeulflux%'), maintain the exclusion list in a configuration file:
# config/excluded_fingerprints.yaml
# Validated fingerprints excluded from new-discovery hunts to prevent
# double-counting. Each entry is matched as a substring (case-insensitive)
# against first_label / candidate / san_list as appropriate.
excluded:
- znegeulflux
- sbermegamarket
- bfqde2023llsplde12qd27qdl
- taiyangchengyulecheng
- 1028cc1028gg
- verifizierung-traderepublic
- berufsunfhigkeitsversicherung
# ... extended as fingerprints accumulate
excluded_platforms:
# Legitimate platform fingerprints (not operator infrastructure)
- betting-widgets
- simulateur-obseques
- gemeinsam-trauern
The query layer reads this file at run time and injects the NOT LIKE clauses dynamically. This pattern scales as the catalog grows beyond the handful currently hardcoded.
Pathway A — Threshold-Based First-Label Aggregation
Purpose: Count unique first-label substrings across all FQDNs in the analysis window, identify strings appearing across anomalously many parent domains.
Coverage envelope: Operations above the parent-count threshold (currently top-100 ≈ ≥36 parents) where the operator reuses a consistent leading subdomain label.
Known blind spots:
- Operators using random per-domain suffixes (e.g., huawei-ndfjejfh-e09f0 — distinct first_labels per deployment)
- Operations with fewer than threshold parents (<36 in current data)
- Punycode/IDN strings that look different at the byte level vs. the visual level
Executable query:
-- PATHWAY A: First-label aggregation across parent domains
-- Finds consistent subdomain labels deployed on many unrelated hosts
WITH extracted AS (
SELECT
"fqdn:ID(Domain)" as fqdn,
regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
split_part("fqdn:ID(Domain)", '.', 1) as first_label
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE length("fqdn:ID(Domain)") > 40
)
SELECT
first_label,
count(DISTINCT parent_domain) as domain_count,
count(*) as total_fqdns,
list(DISTINCT parent_domain ORDER BY parent_domain)[:5] as sample_domains
FROM extracted
WHERE length(first_label) >= 12
-- (exclusions injected from config/excluded_fingerprints.yaml)
AND first_label NOT LIKE 'www%'
AND first_label NOT LIKE 'mail%'
AND first_label NOT LIKE 'autodiscover%'
AND first_label NOT LIKE '%notexists%'
GROUP BY first_label
HAVING domain_count >= 3
ORDER BY domain_count DESC
LIMIT 100
Signal/noise heuristics:
- Filter: WHM panel encodings (*whm, *mvd), UUID identifiers, wildcard detection strings
- Filter: Common infrastructure labels (www, mail, autodiscover, webmail, cpanel)
- Keep: Non-dictionary strings 15+ chars, strings with unexpected language for TLD distribution
Example findings:
- sbermegamarket — 322 parent domains, Russian phishing infrastructure
- bfqde2023llsplde12qd27qdl — 67 parent domains, unknown operator handle
- ai-assistant — 42 parent domains, German SEO exploitation
- taiyangchengyulecheng — 96 parent domains, Chinese gambling SEO
Pathway B — Substring Frequency Analysis (14+ chars)
Purpose: Search for any long substring (14+ characters) occurring across many parent domains, catching operator artifacts embedded anywhere in the FQDN structure (not just leading label).
Coverage envelope: Substrings above a length threshold (14+ chars) and above a domain-count threshold, regardless of position in the FQDN.
Known blind spots:
- Shorter operator identifiers (510qq at 5 chars, 7K at 2 chars, h5. at 2 chars)
- Results dominated by noise patterns (alphabet enumeration, WHM panel artifacts) that crowd out meaningful results in top-N
- Diacritic-stripped multilingual strings require normalization to match (see §6.2)
Executable query:
-- PATHWAY B: Substring frequency analysis
-- Finds repeated unusual substrings (14+ chars) across domains
SELECT
regexp_extract("fqdn:ID(Domain)", '([a-z]{12,30})', 1) as candidate,
count(DISTINCT regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1)) as domain_count,
count(*) as total
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE length("fqdn:ID(Domain)") > 50
-- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY candidate
HAVING domain_count >= 5 AND length(candidate) >= 14
ORDER BY domain_count DESC
LIMIT 100
Signal/noise heuristics:
- Filter: Known CMS patterns, CDN slugs, ACME client defaults
- Filter: Dictionary words (infrastructure, communications, administration)
- Filter: Reversed-alphabet sequences (hgfedcbaupdate, jihgfedcbaupdate)
- Keep: Non-dictionary strings in unexpected languages, strings with consistent structure across deployments
Example findings:
- berufsunfhigkeitsversicherung — 20 parent domains, German insurance phishing (ä stripped)
- sbermegamarket — also surfaced here (cross-pathway confirmation)
- znegeulfluxsisilafamille — 141 parent domains (the original case study)
Pathway C — SAN-List Pattern Clustering
Purpose: Identify certificates with unusually large or structured SAN lists, revealing industrial-scale operator infrastructure that bundles many domains in single certificate issuances.
Coverage envelope: Operators using batch certificate issuance with large SAN bundles (50-500+ domains per cert).
Known blind spots: - Operations using per-domain individual certificates rather than SAN bundling - Legitimate brand defensive portfolios (MercadoLibre, CSL Behring, Kaiser Permanente, CBRE) that look structurally identical to scam portfolios - Operations bundled with distractor/legitimate SANs that camouflage operator content
Executable query:
-- PATHWAY C: SAN-list pattern clustering
-- Finds certificates with unusually large/repeated SAN lists
SELECT
"san_list",
count(*) as cert_count,
length("san_list") - length(replace("san_list", ',', '')) + 1 as san_count
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE length("san_list") > 200
-- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY "san_list"
HAVING cert_count >= 3
ORDER BY cert_count DESC
LIMIT 50
Signal/noise heuristics:
- Filter: Known legitimate brand portfolios (see §6.5 allowlist)
- Filter: AWS/Google test infrastructure (sadbirds.aws.dev, certsbridge.com)
- Keep: SAN lists with TLD-sweep patterns, sequential numeric domains, random-prefix clusters
- Flag: Multi-handle structure (3+ distinct sequential families in one cert — see §6.6)
Example findings:
- 198901* multi-handle cluster — 434 SANs, industrial Chinese gambling
- .garden TLD cluster — 191 SANs, coordinated registration
- 510qq — 100 SANs per cert, 900+ certificates, industrial gambling
- betting-widgets platform — 1,114 parent domains
- endofobama9 political/financial scam portfolio — 93 SANs per cert (DigiCert)
Pathway D — Batch Issuance Timestamp Clustering
Purpose: Group certificates by issuance timestamp at minute-level granularity, identify anomalous co-occurrences that reveal coordinated operator infrastructure — even when individual cert SANs are not unusually large.
Coverage envelope: Operations that batch-issue certificates in coordinated runs, regardless of SAN-list size or subdomain-label consistency.
Known blind spots: - Operators who spread issuance across hours or days to avoid timestamp clustering - Let's Encrypt rate limiting may naturally distribute issuance for some operators - High-volume legitimate services (CDNs, hosting panels) create baseline noise at minute granularity
Executable query:
-- PATHWAY D: Batch issuance timestamp clustering
-- Identifies anomalous cert-issuance bursts at minute granularity
WITH cert_minutes AS (
SELECT
date_trunc('minute', CAST("not_before" AS TIMESTAMP)) as issue_minute,
"issuer",
"san_list",
"fingerprint_sha256:ID(Certificate)" as cert_id
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE "not_before" IS NOT NULL
)
SELECT
issue_minute,
issuer,
count(*) as certs_in_minute,
count(DISTINCT san_list) as unique_san_lists,
list(san_list)[:3] as sample_sans
FROM cert_minutes
GROUP BY issue_minute, issuer
HAVING certs_in_minute >= 10
ORDER BY certs_in_minute DESC
LIMIT 50
Structural analysis follow-up (for flagged minute-buckets):
-- For a flagged minute bucket, extract and analyze all SANs
WITH flagged_certs AS (
SELECT "san_list"
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE date_trunc('minute', CAST("not_before" AS TIMESTAMP)) = '{flagged_minute}'
AND "issuer" LIKE '%Let''s Encrypt%'
)
SELECT
unnest(string_split(san_list, ',')) as san_entry,
count(*) as occurrences
FROM flagged_certs
GROUP BY san_entry
ORDER BY occurrences DESC
LIMIT 200
Signal/noise heuristics: - Filter: Known high-volume legitimate issuers at expected times (CDN renewals, hosting panel auto-renewals) - Keep: Let's Encrypt bursts with multiple distinct SAN lists (indicates distinct certs issued simultaneously) - Flag: Minute-buckets where SANs share common parent-domain substrings or coordinated brand prefixes
Example findings:
- Chinese brand cluster (1028cc1028gg) — all certificates issued at 2026-05-14 16:02 UTC, 10 brand subdomains across operator-generated .cc/.vip/.mobi/.win domains
- This pathway found the highest-quality TEST_02 findings that threshold-based pathways missed
Pathway E — Targeted Brand/Keyword Sweep
Purpose: Search CT data for specific brand names, financial-service terms, or operator-relevant keywords embedded in subdomain labels. Catches brand-impersonation phishing below other thresholds.
Coverage envelope: Operations that embed recognizable brand names or service terms in subdomain labels, regardless of scale.
Known blind spots:
- Operators not using recognizable brand names (opaque handles like bfqde2023llsplde12qd27qdl)
- Brand names used legitimately (requires validation step)
- Keyword corpus is never complete — novel targets will be missed
Executable query:
-- PATHWAY E: Targeted brand/keyword sweep
-- Parameterized: replace {keyword} with target brand/term
SELECT
"fqdn:ID(Domain)" as fqdn,
regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
split_part("fqdn:ID(Domain)", '.', 1) as first_label
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" ILIKE '%{keyword}%'
AND regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) NOT ILIKE '%{keyword}%'
-- exclude cases where keyword IS the registered domain
GROUP BY fqdn, parent_domain, first_label
ORDER BY parent_domain
Keyword corpus (maintained and expanded per region):
| Region | Keywords |
|---|---|
| German financial | traderepublic, sparkasse, volksbank, commerzbank, verifizierung, legitimation, tan-verfahren |
| Russian e-commerce | sber, sbermegamarket, avito, ozon, yandex, cdek, pochta, blablacar |
| Chinese tech/telecom | huawei, xiaomi, taobao, jingdong, tianmao, yidong, dianxin, liantong |
| Global financial | paypal, coinbase, binance, kraken, revolut |
| Insurance (DE) | berufsunfähigkeit, haftpflicht, versicherung, altersvorsorge |
Signal/noise heuristics: - Filter: Results where the keyword IS the registered domain (legitimate brand sites) - Filter: Known CDN/hosting subdomains for those brands - Keep: Keywords appearing as subdomain labels on unrelated parent domains - Flag: Cross-country deployment pattern (same keyword on domains in 5+ countries = automated deployment)
Example findings:
- verifizierung-traderepublic — 14 compromised parent domains in 12 countries
- berufsunfhigkeitsversicherung — 20 parent domains (German insurance phishing)
6. Pipeline Enhancements
6.1 Codify Pathway D (batch-issuance timestamp clustering)
Status: Now codified as a first-class pathway (see §5, Pathway D above).
Rationale: This pathway found the Chinese brand cluster (1028cc1028gg) — the highest-quality TEST_02 finding — which was invisible to all threshold-based queries because its 38 parent domains were distributed across 10 brand prefixes with random suffixes.
What it captures that Pathways A-C don't: Operators who use random per-domain suffixes (so no first_label aggregates), but batch-issue certs in coordinated runs.
6.2 Multilingual / Diacritic-Aware String Normalization (Pathway B Enhancement)
Rationale: The berufsunfhigkeitsversicherung finding (German "occupational disability insurance," ä stripped to nothing) revealed that operator automation normalizes diacritics inconsistently. The same artifact applies to French (é/è), Spanish (ñ), Portuguese (ã/ç), Czech (č/š/ž), Turkish (ı/ş/ğ).
Implementation:
-- Generate ASCII-folded variants for Pathway B candidates
-- Approach: strip diacritics, then also check digraph transliteration
WITH candidates AS (
-- ... standard Pathway B query ...
),
normalized AS (
SELECT
candidate,
-- NFD decomposition + strip combining marks (Python-side)
-- Also generate: ä→ae, ö→oe, ü→ue, ß→ss transliterations
regexp_replace(candidate, '[äáàâã]', 'a', 'g') as folded_a,
regexp_replace(candidate, '[öóòô]', 'o', 'g') as folded_o,
regexp_replace(candidate, '[üúùû]', 'u', 'g') as folded_u
FROM candidates
)
SELECT * FROM normalized
Full implementation requires Python-side processing:
import unicodedata
def generate_normalization_variants(s):
"""Generate all plausible diacritic-normalization variants."""
# Variant 1: Strip diacritics entirely (operator's observed choice)
nfd = unicodedata.normalize('NFD', s)
stripped = ''.join(c for c in nfd if unicodedata.category(c) != 'Mn')
# Variant 2: German transliteration (ä→ae, ö→oe, ü→ue, ß���ss)
german = s.replace('ä', 'ae').replace('ö', 'oe').replace('ü', 'ue').replace('ß', 'ss')
# Variant 3: Simple deletion (same as NFD strip for most cases)
deleted = s.replace('ä', 'a').replace('ö', 'o').replace('ü', 'u')
return [s, stripped, german, deleted]
Coverage gap closed: Multilingual phishing operations whose automation handles diacritics in ways the analyst doesn't anticipate.
6.3 IDN/Punycode-Aware Aggregation (Pathway A Enhancement)
Rationale: The Japanese parasitic SEO cluster (xn--n8jub3cxopfw59v90r725esqg = "失敗しないカニ通販") was found through batch inspection, not Pathway A, because punycode strings don't aggregate intuitively at the byte level.
Implementation:
import encodings.idna
def decode_idn_label(label):
"""Decode punycode first-label to Unicode for aggregation."""
if label.startswith('xn--'):
try:
return label.encode('ascii').decode('idna')
except (UnicodeError, UnicodeDecodeError):
return label
return label
# In Pathway A post-processing:
# 1. For every first_label starting with xn--, decode to Unicode
# 2. Aggregate also by decoded form
# 3. Flag clusters where multiple xn-- strings appear on overlapping parent-domain sets
Enhanced Pathway A query (IDN-aware):
-- Flag punycode first-labels for decode-and-reaggregate step
SELECT
first_label,
domain_count,
CASE WHEN first_label LIKE 'xn--%' THEN 'IDN_DECODE_NEEDED' ELSE 'ASCII' END as label_type
FROM (
-- ... standard Pathway A query ...
)
WHERE first_label LIKE 'xn--%'
ORDER BY domain_count DESC
Coverage gap closed: SEO poisoning, parasitic SEO, and IDN-encoded brand impersonation targeting non-Latin-script markets (Japanese, Chinese, Korean, Cyrillic, Arabic).
6.4 Lower-Threshold Sweep with Signature Filters
Rationale: Several high-value findings were below the standard top-100 threshold (verifizierung-traderepublic at 14 parents, Chinese brand cluster at 38 with distributed prefixes). The current threshold is optimized for noise filtering but may be too aggressive.
Implementation:
-- LOWER-THRESHOLD SWEEP: longer strings, stricter signature filters
-- Runs as secondary pass with separate review queue
WITH extracted AS (
SELECT
"fqdn:ID(Domain)" as fqdn,
regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
split_part("fqdn:ID(Domain)", '.', 1) as first_label
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE length("fqdn:ID(Domain)") > 40
)
SELECT
first_label,
count(DISTINCT parent_domain) as domain_count,
count(*) as total_fqdns
FROM extracted
WHERE length(first_label) >= 20 -- longer = more likely operator signature
AND (
first_label LIKE '%verifizierung%'
OR first_label LIKE '%confirm%'
OR first_label LIKE '%secure%'
OR first_label LIKE '%update%'
OR first_label LIKE 'xn--%' -- punycode
OR first_label ~ '[a-z]{20,}' -- 20+ consecutive alpha chars
)
-- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY first_label
HAVING domain_count >= 10 -- much lower threshold
ORDER BY domain_count DESC
LIMIT 200
Coverage gap closed: Smaller-scale operations that are nevertheless meaningful (active phishing targeting actual users) but invisible to top-100 filtering.
6.5 Legitimate-Brand Portfolio Allowlist
Rationale: MercadoLibre (250-cert SAN bundle with 200+ defensive registrations), CSL Behring, Kaiser Permanente, CBRE, Tata Motors all surface in Pathway C as massive SAN bundles structurally identical to scam portfolios.
Implementation:
# Maintained allowlist of known legitimate large brand portfolios
BRAND_ALLOWLIST = {
'mercadolibre': ['mercadolibre.com', 'mercadopago.com', 'melilink.com'],
'csl_behring': ['cslbehring.com'],
'kaiser': ['kaiserpermanente.org', 'kp.org'],
'cbre': ['cbre.com'],
'sabre': ['sabre.com', 'radixx.com'],
# ... extend as discovered
}
def is_likely_brand_portfolio(san_list):
"""Check if SAN list matches known legitimate brand defensive registration."""
for brand, markers in BRAND_ALLOWLIST.items():
if any(marker in san_list.lower() for marker in markers):
return brand
return None
Coverage gap closed: Recurring false positives. Frees analyst attention for genuine operator infrastructure.
6.6 Multi-Handle SAN Structure Detection
Rationale: The 198901* finding's most distinctive feature is not any individual handle prefix — it's the parallel-handle structure. A single certificate bundling six distinct sequential-prefix families (k9gj9*, s66om*, tmgj9*, w6h987*, su78993*, 198901*) is a fingerprint of industrial domain-generation tooling.
Implementation:
-- MULTI-HANDLE DETECTION: Find certs with multiple sequential prefix families
WITH san_entries AS (
SELECT
"fingerprint_sha256:ID(Certificate)" as cert_id,
unnest(string_split("san_list", ',')) as san_entry
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE length("san_list") > 1000 -- large SAN lists only
),
prefixes AS (
SELECT
cert_id,
regexp_extract(san_entry, '^([a-z]+)', 1) as alpha_prefix,
san_entry
FROM san_entries
WHERE san_entry ~ '^[a-z]+[0-9]+\.' -- alpha prefix + numeric suffix pattern
)
SELECT
cert_id,
count(DISTINCT alpha_prefix) as distinct_prefixes,
list(DISTINCT alpha_prefix)[:10] as prefix_families,
count(*) as total_sans
FROM prefixes
GROUP BY cert_id
HAVING distinct_prefixes >= 3 -- 3+ distinct sequential families
ORDER BY distinct_prefixes DESC
LIMIT 50
Coverage gap closed: Industrial domain-generation toolkit patterns (510qq, 198901*) detected more robustly than per-handle aggregation.
7. Validation Framework
7.1 Phase 2 Validation (from discovery to confirmed fingerprint)
For each candidate string from Phase 1 (any pathway):
- Hosting provider artifact check — does the string appear only on domains hosted by one provider? (Could be hosting-side injection rather than operator-side)
- Known software pattern check — does it match CMS slugs, ACME client defaults, control panel artifacts (WHM, cPanel, Plesk)?
- Cross-hosting-provider presence — if the same string appears on TIMEWEB, JSC Datacenter, and EVANZO hosting, it's almost certainly operator-side
- Operational context — does the string appear alongside operational indicators (phishing subdomains, admin panels, exfiltration channels, VPN impersonation)?
7.2 Infrastructure Stratification
Once validated, classify the parent domain set into operational strata (demonstrated on sbermegamarket's 322 domains):
| Stratum | Characteristics | Example |
|---|---|---|
| Operator-controlled | Shared registrar, coordinated naming, operator-themed content | Casino fronts (1win-ggg8.xyz), scam portfolios (*shield*), disposable throwaways (order-NNNNN.shop) |
| Compromised | Diverse registrars, diverse creation dates, legitimate prior history | Small businesses, expired domains, .ru businesses |
| TLD-adjacent / DNS-service | Wildcard DNS, dynamic DNS, subdomain-registrar services | sslip.io, ddnsfree.com, de.com |
Stratification query:
-- Extract parent domain set for stratification
SELECT
regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
regexp_extract("fqdn:ID(Domain)", '\.([^.]+)$', 1) as tld,
count(*) as fqdn_count,
count(DISTINCT split_part("fqdn:ID(Domain)", '.', 1)) as unique_first_labels
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" LIKE '%{fingerprint_string}%'
GROUP BY parent_domain, tld
ORDER BY fqdn_count DESC
7.3 Brand-Portfolio False-Positive Filtering
Legitimate brand defensive-portfolio registrations that look structurally identical to scam portfolios: - SAN bundles with 100+ domains covering typos, country variants, product extensions - All domains traceable to a single known corporation - Consistent naming pattern (brand + variant suffix)
Mitigation: Cross-reference against corporate registry (Crunchbase, EGRUL, Companies House) before operator attribution. Maintain the allowlist from §6.5.
8. Infrastructure Mapping & Pivot Queries
8.1 Fingerprint Expansion (map all infrastructure)
-- Map all domains touched by a validated fingerprint
SELECT
regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
count(*) as fqdn_count,
list(DISTINCT split_part("fqdn:ID(Domain)", '.', 1))[:10] as sample_first_labels
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" LIKE '%{fingerprint_string}%'
GROUP BY parent_domain
ORDER BY fqdn_count DESC
8.2 SAN Co-occurrence Analysis
-- Find multi-SAN certificates containing the fingerprint
-- Reveals what other infrastructure the operator bundles alongside the fingerprint
SELECT
"fingerprint_sha256:ID(Certificate)" as cert_id,
"san_list",
"not_before",
"issuer",
length("san_list") - length(replace("san_list", ',', '')) + 1 as san_count
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE "san_list" LIKE '%{fingerprint_string}%'
AND "san_list" LIKE '%,%' -- multi-SAN certs only
ORDER BY san_count DESC
8.3 Parent Domain Classification
After extracting the parent domain set, classify via:
| Classification | Signal |
|---|---|
| Operator-controlled | Shared registrar, shared DNS (e.g., Yandex NS), registrar clustering, coordinated creation dates, Russian registrar on non-.ru TLD |
| Compromised | Diverse registrars, diverse creation dates, no clustering — consistent with WHM/cPanel brute-force |
| Development/testing | High FQDN count with fuzzing patterns, GitLab paths, alphabet sequences |
8.4 Attribution Pivot Order
From mapped domains, pivot through (in decreasing reliability):
- WHOIS/RDAP — registrant names, organizations, taxpayer IDs (INN), emails
- Corporate registries — EGRUL (Russia), Companies House (UK), state SOS databases (US)
- DNS records — shared nameservers, MX records, SPF/DKIM keys
- Shodan/Censys — shared service profiles, exposed databases, control panels
- Certificate timeline — issuance timestamps correlate campaigns to operational windows
9. Cross-Pathway Confidence Scoring
| Confidence | Criteria | Action |
|---|---|---|
| HIGH | Confirmed by 2+ independent pathways | Report as validated operator fingerprint |
| MODERATE | Single pathway only, passes validation framework | Flag for further investigation, report with caveat |
| LOW | Single pathway, ambiguous validation | Hold for additional data windows |
Examples from testing:
| Finding | Pathways | Confidence |
|---|---|---|
| sbermegamarket | A + B + C | HIGH |
| 198901* cluster | C + D | HIGH |
| 1028cc1028gg | D (batch-timestamp) | HIGH (elevated: structural evidence overwhelming) |
| berufsunfhigkeitsversicherung | B | HIGH (elevated: operational context clear) |
| verifizierung-traderepublic | E (keyword) | HIGH (elevated: cross-country deployment pattern) |
| ai-assistant | A | MODERATE |
| .garden cluster | C | MODERATE (purpose TBD) |
Note: Single-pathway findings can be elevated to HIGH when structural evidence (batch-issuance timing, cross-country deployment, multi-brand coordination) provides the equivalent of independent confirmation.
10. Operator Fingerprint Taxonomy
10.1 Russian E-Commerce Phishing
Archetype: sbermegamarket (322→473 parents across 2 test windows)
Characteristics: - Russian brand impersonation as deeply nested subdomain labels (Sber, Avito, Ozon, CDEK, Yandex, Blablacar) - Deployed on compromised infrastructure (85%) + operator-controlled disposables (13%) + DNS-service noise (2%) - Multi-vertical: same operator runs gambling fronts, US-targeted scams (*shield, COVID, 401k), and e-commerce phishing - Growth rate: ~75 new parent domains/day
10.2 Chinese Brand Impersonation
Archetype: 1028cc1028gg (38 parents, batch-issued)
Characteristics:
- Chinese tech/telecom brand names as subdomain prefixes (Huawei, Vivo, Xiaomi, Samsung, Taobao, Tmall, JD, China Mobile/Telecom/Unicom)
- Random per-domain suffixes (campaign/build identifiers)
- Operator-generated parent domains with embedded identifier: {seq4}{hex}{1028cc1028gg}.{tld}
- TLD rotation (.cc, .vip, .mobi, .win)
- Batch certificate issuance (all at single timestamp)
- Targets Chinese consumers
10.3 Industrial Chinese Gambling
Archetypes: 510qq (500+ parents), 198901* (434 SANs), 1994901* (192 SANs), a-a-game TLD-sweep, h5.* mobile cluster
Characteristics:
- Industrial-scale domain generation (sequential numbering, random alphanumeric prefixes)
- 100-SAN certificates with exclusively random-prefix domains (.top, .com)
- Let's Encrypt issuer exclusively
- Multiple distinct handle-prefix families bundled in single certificates (multi-handle pattern)
- h5. prefix = HTML5 mobile-optimized landing pages (Chinese mobile gambling convention)
- TLD-sweep registration (10+ TLDs for same label = a-a-game)
- QQ references (Chinese messaging platform)
10.4 German Financial-Services Phishing
Archetypes: verifizierung-traderepublic (14 parents), berufsunfhigkeitsversicherung (20 parents)
Characteristics: - German-language subdomain labels as phishing lures - Deployed on compromised international small-business sites (12+ countries) - Diacritic-stripping artifact (ä → nothing, not ä → ae) reveals automation characteristics - Likely single operator running parallel verticals (neobank verification + insurance) - Let's Encrypt certificates, May 2026 issuance window
10.5 Japanese Parasitic SEO
Archetype: IDN cluster (xn--n8jub3cxopfw59v90r725esqg = "失敗しないカ��通販")
Characteristics:
- Punycode-encoded Japanese consumer search terms as subdomain labels
- Targets Japanese search queries ("is Scalp-D fake?", "crab mail-order reviews")
- Deployed on compromised .jp domains + international sites
- Creates HTTPS-valid landing pages for search engine manipulation
- Multiple related IDN strings on overlapping parent-domain sets (single operator)
10.6 Multi-Vertical Operator Portfolios
Archetype: sbermegamarket operator (gambling + scam + phishing sharing infrastructure)
Characteristics: - Single operator running Russian e-commerce phishing, casino-brand fronts (22 gambling domains), US-targeted English scams (*shield, COVID, 401k — 15 domains), and disposable e-commerce throwaways - Shared certificate-issuance infrastructure across verticals - Multi-market (Russia, US, EU) operation
10.7 Platform/SaaS Fingerprints (Legitimate)
Archetypes: betting-widgets (1,114 parents), French funeral SaaS (simulateur-obseques), German funeral SaaS (gemeinsam-trauern)
Characteristics: - Consistent subdomain patterns across client websites - Not malicious but demonstrate technique's ability to map SaaS provider customer bases - Useful for market intelligence, not threat intelligence - Must be filtered from operator-fingerprint results
10.8 Operator Handle Fingerprints
Archetypes: bfqde2023llsplde12qd27qdl (67 parents), znegeulfluxsisilafamille (141 parents)
Characteristics:
- Long opaque strings (24-25 chars) that are clearly personal identifiers or tooling artifacts
- Non-dictionary, high-entropy
- Cross-TLD, cross-hosting-provider distribution
- Co-occur with operational infrastructure (typosquats, admin panels, WHM access)
- May contain embedded date markers (2023), language artifacts (French slang)
- Highest attribution potential (handle → registrant → corporate registry → individual)
11. Hunt Workflow — Complete Operational Cycle
Step 1: Run All Five Pathways in Parallel
Execute Pathways A-E against the analysis window's CT data. Each produces its own ranked output file.
# Run all pathways via Python/DuckDB on scribe01
cat << 'EOF' | ssh scribe01 '/opt/sigil/query-env/bin/python3'
import duckdb
con = duckdb.connect()
# Pathway A
pathway_a = con.execute("""...""").fetchdf()
pathway_a.to_json('/opt/sigil/data/investigations/hunt_YYYY-MM-DD/pathway_a.json')
# Pathway B
pathway_b = con.execute("""...""").fetchdf()
pathway_b.to_json('/opt/sigil/data/investigations/hunt_YYYY-MM-DD/pathway_b.json')
# ... Pathways C, D, E similarly
EOF
Step 2: Cross-Pathway Validation
Any finding from one pathway is cross-checked against the others: - Does the Pathway A candidate also appear in Pathway C SAN bundles? - Does the Pathway D timestamp cluster contain strings that would clear Pathway B? - Findings confirmed by 2+ pathways → HIGH CONFIDENCE
Step 3: Multilingual + IDN Normalization Pass
Run the diacritic-normalization variants (§6.2) against Pathway B results. Decode all xn-- first_labels in Pathway A results (§6.3). Check for cross-linguistic operator artifacts.
Step 4: Batch-Timestamp Anomaly Review
Review Pathway D output for anomalous minute-buckets. For each flagged bucket: - Extract the union of all SANs - Check for shared parent-domain substrings - Check for repeated subdomain prefixes (brand impersonation pattern) - Apply structural analysis: TLD distribution, prefix entropy, parent-domain co-occurrence
Step 5: Legitimate-Brand Allowlist Filter
Apply the brand-portfolio allowlist (§6.5) to Pathway C output. Auto-tag matching entries as "likely legitimate brand defense" and exclude from the operator-fingerprint review queue.
Step 6: Multi-Handle SAN Structure Flag
Run the multi-handle detection query (§6.6) against Pathway C output. Flag any certificate with 3+ distinct sequential-prefix families.
Step 7: Coverage Envelope Diagnostic
Append the coverage envelope analysis (§12) to the hunt report. Document which pathway caught each finding and what the pipeline could NOT have found.
Step 8: Manual Operator-Validation Step
Apply the validation framework (§7) to the consolidated multi-pathway output: - Confirm cross-hosting-provider distribution - Verify operational context (phishing targets, admin panels, etc.) - Stratify parent domains (operator-controlled vs. compromised vs. noise) - Assign confidence levels
Output Format
/opt/sigil/data/investigations/hunt_YYYY-MM-DD/
├── pathway_a.json # Raw Pathway A results
├── pathway_b.json # Raw Pathway B results
├── pathway_c.json # Raw Pathway C results
├── pathway_d.json # Raw Pathway D results
├── pathway_e.json # Raw Pathway E results
├── candidates.json # Consolidated candidates post-filtering
├── validated.json # Validated fingerprints post-Phase-2
└── report.md # Narrative summary with assessments
12. Coverage Envelope Diagnostic Template
12.1 Post-Hunt Attribution Matrix
For each new fingerprint surfaced, document which pathway(s) caught it:
| Finding | Pathway A | Pathway B | Pathway C | Pathway D | Pathway E |
|---|---|---|---|---|---|
| (finding name) | ✅/�� | ✅/❌ | ✅/❌ | ✅/❌ | ✅/❌ |
12.2 "What We Could Not Have Found" (standing discipline)
Append to every hunt report:
"This hunt's pipeline would not have surfaced: - Operators using single-cert-per-domain issuance (Pathway C blind) - Operators with <10-parent infrastructure (all thresholds blind) - Operators using diacritic-transliteration normalization differently from our normalization variants - Operators using IDN strings in scripts not covered by our decode step - Operators who spread certificate issuance across 24+ hours (Pathway D blind) - Operators using short identifiers (<12 chars) without matching our keyword corpus (Pathway B/E blind) - Operators rotating identifiers per-campaign with no consistent string across deployments"
12.3 Coverage Matrix Reference
| Attribute | Captured by |
|---|---|
| High parent-domain count (>36) | Pathway A, B |
| Subdomain-prefix consistency | Pathway A |
| Long unusual substring (14+ chars) | Pathway B |
| Large SAN-bundle structure | Pathway C |
| Multi-handle industrial generation | Pathway C + §6.6 |
| Batch-issuance coordination | Pathway D |
| Brand-keyword presence | Pathway E |
| Punycode/IDN | §6.3 enhancement |
| Diacritic variants | §6.2 enhancement |
13. Limitations & Failure Modes
13.1 Structural Limitations
The technique operates through multiple complementary discovery pathways, each with its own coverage envelope. No single pathway alone is sufficient, and even the combination has documented blind spots:
- Operators using random per-domain suffixes — invisible to first-label aggregation (Pathway A)
- Operators using single-cert-per-domain issuance — invisible to SAN-bundle inspection (Pathway C)
- Operators using non-Latin-script identifiers — invisible to byte-level substring search without IDN decoding
- Operators spreading issuance over time — invisible to batch-timestamp clustering (Pathway D)
- Operators using short or rotating identifiers — invisible to all threshold-based detection
- Operations below 10 parent domains — below all practical thresholds
- Pathway B captures only the first 12-30 character alphabetic substring per FQDN —
regexp_extractwith capture group 1 returns the first match only. Operator strings appearing later in the FQDN structure are not captured by this pathway unless they happen to be the first alphabetic run. Cross-pathway confirmation (Pathway A's first-label aggregation, Pathway C's SAN inspection) compensates for this, but it is a documented blind spot of Pathway B in isolation.
The honest framing: the technique is best understood as a set of complementary lenses, not a single algorithm. Each lens reveals a class of operator infrastructure that the others miss.
13.2 Operational Limitations
-
Opportunistic, not universal. Only works if the operator's tooling embeds a consistent or structurally distinctive artifact. Many operators don't make this mistake.
-
Requires raw CT stream ingestion. A Parquet-based DuckDB setup on modest hardware (16GB RAM desktop) is sufficient for days-to-weeks of data, but you need the pipeline.
-
False positive risk with short strings. Strings under 10 characters produce too many matches. Most reliable with 15+ character strings.
-
Temporal coverage is bounded. Local CT data covers the collection window only. Full historical coverage requires backfilling from CT log archives — significant data engineering effort.
-
Detectable by the operator. Publication allows operators to audit and patch their tooling. However, historical CT log entries are immutable — past fingerprints cannot be erased.
-
Not a silver bullet for attribution. The fingerprint maps infrastructure, not identity. The path from fingerprint → domain → WHOIS → corporate registry → named individual requires separate investigation steps, each with its own confidence level.
-
Volume bias toward Let's Encrypt. Free, automated issuance means Let's Encrypt dominates CT volume and makes the technique most productive against operators using Let's Encrypt automation. Operators using paid CAs leave less volume but stronger attribution signals (DigiCert → payment trail).
-
Signal-to-noise ratio. Discovery queries surface 200-250 raw candidates; validation filters these to 5-15 genuine fingerprints. The validation step requires significant analyst expertise.
13.3 Legitimate-Brand False Positives
Defensive registration portfolios from large corporations (MercadoLibre: 200+ typo variants, CSL Behring, Kaiser Permanente, CBRE, UFSCar) produce SAN bundles structurally identical to scam portfolios. Validation MUST include brand-portfolio cross-reference against major enterprises.
14. Findings Summary Table
All validated fingerprints across TEST_01 (May 9-13) and TEST_02 (May 14-15):
| Fingerprint | Type | Parents | Pathway(s) | Confidence | Status |
|---|---|---|---|---|---|
sbermegamarket |
Russian phishing | 322→473 | A, B, C | HIGH | GROWING |
betting-widgets-static/gql/scoreboard-gql |
Platform (gambling) | 1,114 | A, C | HIGH | GROWING |
bfqde2023llsplde12qd27qdl |
Operator handle | 67 | A, B | HIGH | STABLE |
510qq |
Chinese gambling | 500+ | C | HIGH | ABSENT (May 14-15) |
taiyangchengyulecheng |
Chinese gambling SEO | 96 | A, B | HIGH | STABLE |
endofobama9 portfolio |
Political/financial scam | 93 (SAN bundle) | C | MODERATE | STABLE |
simulateur-obseques |
Platform (French funeral) | 76-112 | A | MODERATE | STABLE |
1028cc1028gg |
Chinese brand impersonation | 38 | D | HIGH | NEW |
verifizierung-traderepublic |
German neobank phishing | 14 | E | HIGH | NEW |
| Japanese IDN cluster (3 strings) | Parasitic SEO | 10-21 each | D (batch) | MODERATE | NEW |
198901* multi-handle |
Chinese gambling (.com) | 434 SANs | C, D | HIGH | NEW |
1994901* / 19949a* |
Chinese gambling (.com) | 192 SANs | C | HIGH | NEW |
.garden TLD cluster |
Unknown (infrastructure) | 191 SANs | C | HIGH | NEW |
berufsunfhigkeitsversicherung |
German insurance phishing | 20 | B | HIGH | NEW |
ai-assistant + beste-de-* |
German SEO exploitation | 42 | A | MODERATE | NEW |
a-a-game TLD-sweep |
Chinese gambling | 500 certs | C | HIGH | NEW |
h5.* mobile gambling |
Chinese gambling | 336 certs | C | HIGH | NEW |
gemeinsam-trauern |
Platform (German funeral) | 29 | A | MODERATE | NEW |
znegeulfluxsisilafamille |
Operator handle | 141 | A, B | HIGH | KNOWN (excluded) |
15. Generalizable Search Patterns
Beyond the specific findings above, look for these categories of operator artifacts in subdomain labels:
| Category | Pattern | Example |
|---|---|---|
| Personal handles | Non-dictionary strings 15+ chars, mixed language | znegeulfluxsisilafamille (French + opaque) |
| Date-stamped test markers | YYYY-MM-DD + consistent string |
2022-12-23znegeulflux... |
| Tool configuration artifacts | Software versions, build IDs, CI/CD markers | bfqde2023llsplde12qd27qdl (contains "2023") |
| Campaign identifiers | Codenames, project names, client refs | 1028cc1028gg (operator infrastructure ID) |
| Language-specific artifacts | Strings in unexpected languages for TLD | French strings on Russian domains |
| Brand impersonation | Known brand + action word (verifizierung-) |
verifizierung-traderepublic |
| Sequential numbering | Handle + incremental suffix | 510qq1 through 510qq600 |
| Alphabet/fuzzing patterns | Reverse-alphabet, sequential test strings | hgfedcbaupdate, jihgfedcbaupdate |
| Pinyin transliterations | Chinese terms in Latin alphabet | taiyangchengyulecheng (太阳城娱乐城) |
| Diacritic artifacts | Missing/stripped diacritics in European terms | berufsunfhigkeitsversicherung (ä → ∅) |
| IDN/Punycode | xn-- prefixes encoding non-Latin terms |
xn--n8jub3cxopfw59v90r725esqg (Japanese) |
| TLD-sweep registration | Same label across 5+ TLDs | a-a-game.{biz.id,cfd,click,cyou,icu,...} |
16. Case Studies
16.1 znegeulfluxsisilafamille — Operator Handle to Named Individual
- String: 24 characters —
znegeulflux(opaque handle) +sisilafamille(French slang, "yes yes the family") - Discovery: Manual analysis during phishing platform investigation
- Scale: 1,447 FQDNs across 141 parent domains, 15+ countries
- crt.sh detection: Zero results (all query approaches failed)
- Attribution chain: Fingerprint → tlkregion.ru (WHOIS) → INN 7730648020 (EGRUL) → LLC TLK Region → Maksim G. Ermolaev
- Infrastructure: 2 operating domains, 1 dev server, ~35 operator-controlled domains, ~96 compromised domains
- Key insight: A single operator handle embedded in automated tooling permanently mapped an entire infrastructure invisible to standard OSINT
16.2 sbermegamarket — Scale and Multi-Vertical Operations
- String: 14 characters — Russian brand (Sber Megamarket)
- Discovery: Pathway A (first-label aggregation, top-3 result)
- Scale: 322 parent domains (May 9-13) → 473 parent domains (May 14-15), growth rate ~75/day
- Stratification: 13% operator-owned (gambling fronts + US scams + disposables), 85% compromised, 2% DNS-service noise
- Multi-vertical: Same operator runs Russian e-commerce phishing, casino-affiliate fronts (22 domains), US-targeted scams (*shield/COVID/401k — 15 domains), and disposable throwaways
- Key insight: Infrastructure stratification reveals a single operation spanning 3+ scam verticals across 2+ markets sharing certificate-issuance infrastructure
16.3 1028cc1028gg — Batch-Issuance Discovery
- String: 12 characters — operator infrastructure identifier embedded in parent domains
- Discovery: Pathway D (batch-issuance timestamp clustering) — all certs at 2026-05-14 16:02 UTC
- Scale: 10 Chinese brand impersonations × 38 parent domains
- Why threshold-based missed it: 38 parents distributed across 10 brand prefixes with random suffixes — no individual first_label aggregates above threshold
- Key insight: Batch-timestamp clustering catches operator coordination invisible to all volume-based methods
16.4 verifizierung-traderepublic — Brand-Keyword Discovery
- String: 27 characters — German ("verification Trade Republic")
- Discovery: Pathway E (targeted brand/keyword sweep)
- Scale: 14 compromised parent domains in 12 countries
- Why threshold-based missed it: 14 parents — well below top-100 threshold
- Connection: Likely same operator as
berufsunfhigkeitsversicherung(German insurance phishing, 20 parents) — both target German financial consumers, deploy on compromised international hosts, appear in same window - Key insight: Targeted keyword sweeps catch active phishing campaigns too small for statistical detection but operationally significant
Appendix: Quick-Reference Query Cheat Sheet
-- ═══════════════════════════════════════════════════════════════
-- PATHWAY A: First-label aggregation
-- ══��══════════════════════════���═════════════════════════════════
WITH extracted AS (
SELECT
"fqdn:ID(Domain)" as fqdn,
regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
split_part("fqdn:ID(Domain)", '.', 1) as first_label
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE length("fqdn:ID(Domain)") > 40
)
SELECT first_label, count(DISTINCT parent_domain) as domain_count, count(*) as total_fqdns,
list(DISTINCT parent_domain ORDER BY parent_domain)[:5] as sample_domains
FROM extracted
WHERE length(first_label) >= 12
-- (exclusions injected from config/excluded_fingerprints.yaml)
AND first_label NOT LIKE 'www%' AND first_label NOT LIKE 'mail%'
AND first_label NOT LIKE 'autodiscover%' AND first_label NOT LIKE '%notexists%'
GROUP BY first_label HAVING domain_count >= 3
ORDER BY domain_count DESC LIMIT 100;
-- ════���══════════════════════════════════════════════════════════
-- PATHWAY B: Substring frequency (14+ chars)
-- ═══════════════════════════════════════��═══════════════════════
SELECT
regexp_extract("fqdn:ID(Domain)", '([a-z]{12,30})', 1) as candidate,
count(DISTINCT regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1)) as domain_count,
count(*) as total
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE length("fqdn:ID(Domain)") > 50
-- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY candidate HAVING domain_count >= 5 AND length(candidate) >= 14
ORDER BY domain_count DESC LIMIT 100;
-- ════���══════════���═══════════════════════════════════════════════
-- PATHWAY C: SAN-list pattern clustering
-- ════════��═══════════════��══════════════════════════════════════
SELECT "san_list", count(*) as cert_count,
length("san_list") - length(replace("san_list", ',', '')) + 1 as san_count
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE length("san_list") > 200
-- (exclusions injected from config/excluded_fingerprints.yaml)
GROUP BY "san_list" HAVING cert_count >= 3
ORDER BY cert_count DESC LIMIT 50;
-- ════���═════════════════���════════════════════════════════════════
-- PATHWAY D: Batch-issuance timestamp clustering
-- ═════════��═════════════════════════════════════════════════════
WITH cert_minutes AS (
SELECT date_trunc('minute', CAST("not_before" AS TIMESTAMP)) as issue_minute,
"issuer", "san_list", "fingerprint_sha256:ID(Certificate)" as cert_id
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE "not_before" IS NOT NULL
)
SELECT issue_minute, issuer, count(*) as certs_in_minute,
count(DISTINCT san_list) as unique_san_lists
FROM cert_minutes
GROUP BY issue_minute, issuer HAVING certs_in_minute >= 10
ORDER BY certs_in_minute DESC LIMIT 50;
-- ══════════════════════════════════════��════════════════════════
-- PATHWAY E: Targeted brand/keyword sweep
-- Replace {keyword} with target term
-- ═══��═══════════��═══════════════════════════════════════════════
SELECT "fqdn:ID(Domain)" as fqdn,
regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
split_part("fqdn:ID(Domain)", '.', 1) as first_label
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" ILIKE '%{keyword}%'
AND regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) NOT ILIKE '%{keyword}%'
GROUP BY fqdn, parent_domain, first_label
ORDER BY parent_domain;
-- ══��══════════════════════���════════════════════════���════════════
-- EXPANSION: Map all infrastructure for validated fingerprint
-- ═══════════════════════════════════════════════════════════════
SELECT regexp_extract("fqdn:ID(Domain)", '\.([^.]+\.[^.]+)$', 1) as parent_domain,
count(*) as fqdn_count,
list(DISTINCT split_part("fqdn:ID(Domain)", '.', 1))[:10] as sample_labels
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" LIKE '%{fingerprint_string}%'
GROUP BY parent_domain ORDER BY fqdn_count DESC;
-- ══════��════════════════���═══════════════════════════════════════
-- SAN CO-OCCURRENCE: Multi-SAN certs with fingerprint
-- ══════��════════════════���═══════════════════════════════════════
SELECT "fingerprint_sha256:ID(Certificate)" as cert_id, "san_list", "not_before", "issuer",
length("san_list") - length(replace("san_list", ',', '')) + 1 as san_count
FROM read_parquet('/opt/sigil/data/parquet/*/certificates.parquet', union_by_name=true)
WHERE "san_list" LIKE '%{fingerprint_string}%'
AND "san_list" LIKE '%,%'
ORDER BY san_count DESC;
-- ════��════════════════════════════��═════════════════════════════
-- KNOWN STRING SEARCH: Direct lookup
-- ═══════��═════════════════��════════════════════════════════════���
SELECT "fqdn:ID(Domain)" as fqdn
FROM read_parquet('/opt/sigil/data/parquet/*/domains.parquet', union_by_name=true)
WHERE "fqdn:ID(Domain)" LIKE '%suspectedstring%';
This methodology document synthesizes CrimsonVector investigations CV-INV-05 ("si si la famille"), TEST_01 (May 9-13 2026), and TEST_02 (May 14-15 2026). It accompanies the UE26 case study presentation on CT Behavioral Fingerprinting.