We Dramatically Improved our Machine Learning Model to Price Kalshi Earnings Contracts. It Went 11/13.
The new model combines historical frequency data with AI narrative analysis to find mispriced Kalshi contracts. Here's how it performed.
Coming off a whiff on our recent high conviction play, we went back and rekeyed our machine learning model; we added AI-Enhanced abilities that provide context to the actual words spoken and the new results are simply impressive.
CrowdStrike reported earnings on March 3rd. Kalshi listed 13 binary contracts on whether specific words would be mentioned during the call. Accept our apologies, as with the new model weights we were not entirely comfortable releasing our predictions to readers before we could validate the new system.
The model nailed 11 out of 13 directional calls. Every high-conviction bet was correct. Every fade was right. The only misses were two event-driven surprises that no historical model could have predicted and we caught one of those with a manual research overlay before the market opened.
Here’s exactly how we’ve improved our model, which we will be utilizing to provide fair values models on contracts to our readers going forward.
The Problem With How People Trade Mention Contracts
Most traders on Kalshi approach earnings mention contracts the same way: they pull up the last transcript, ctrl-F the keyword, and make a gut call. Maybe they check two or three transcripts if they’re diligent. The market price reflects this — a loose consensus of people eyeballing recent history.
That approach has two blind spots. First, it doesn’t weight recency properly. A keyword mentioned 8 of the last 8 quarters feels like a lock, but if 6 of those mentions were three years ago and only 2 were recent, the probability is very different than if all 8 were consecutive. Second, and more importantly, it treats all mentions equally. “We are investing heavily in our consolidation platform” and “as we’ve moved past the outage incident” are both keyword hits. But the first one is a CEO describing core strategy. The second is a CFO giving a one-sentence brush-off to an analyst’s question. One predicts future mentions. The other predicts silence.
We’ve made a tenfold technical improvement to our model, and the new and improved system moves to addresses both problems.
The Results
Our model correctly forecasted the following:
Correct YES calls (6/6):
Falcon Flex at 95¢
Consolidation at 93¢
AWS at 75¢
Hyperscaler at 73¢
Nvidia at 68¢
Acquisition at 60¢.
Correct NO calls (5/5):
Outage at 45¢
Generative AI at 35¢
China at 28¢
Hack at 12¢
Shadow at 40¢
The two INCORRECT calls:
Both Microsoft and Signal model forecasts were incorrect, however, upon human review we more likely than not would’ve published these strikes as “No-Action” (e.g., do not trade it.)
The full technical breakdown of our new two-layer model, including how the LLM context classifier works, the seven adjustment signals, and the complete strike-by-strike classifications. The technical details are available below for paid subscribers.
Going forward, this is the system we’re using to generate fair value estimates on every Kalshi earnings mention contract. Paid subscribers receive our probability tables and trade signals before the market opens. Free subscribers will receive analysis on one or two keywords we decide to publish publicly.
We went 11/13 on CrowdStrike; and the next cycle is already loading.
This post was originally paywalled, it has since been lifted temporarily.
Paid subscribers, once more thank you for your patronage. Your subscription directly supports the financial obligations required for our machine learning system (API calls, LLM Tokens, etc.) In step with our commitment to continue providing a strong value proposition, please enjoy reading about how we’ve improved the model to help support your trades. And, in the interest of your time the following is very technical, if you’re interested in how the model you pay for actually works, read on!
How the Model Works: Two Layers
Layer 1: The Frequency Model
We trained a random forest classifier on 3,144 historical earnings call outcomes across seven public companies: CrowdStrike, Google, Meta, AMD, Costco, Tesla, and Goldman Sachs. We are actively training the model on more relevant earnings calls, and companies. The current training data spans roughly 12 years of transcripts.
For each company and keyword pair, we extracted 43 features from the historical record. The model learned which patterns predict a YES and which predict a NO. It achieved 80.4% accuracy in cross-validation with an AUC of 0.881 — meaning it correctly ranks higher-probability keywords above lower-probability ones 88% of the time.
We tested calibration across every probability bucket. When the model says there’s an 80% chance, the actual hit rate is approximately 90%. When it says 20%, the hit rate is approximately 10%. The probabilities are trustworthy at face value.
This layer alone is better than gut feel. But it has a ceiling. It counts mentions without understanding them. A keyword with a six-quarter hit streak gets a high probability regardless of whether those mentions were enthusiastic or dismissive.
Layer 2: The LLM Context Classifier
This is where we’ve improved the model after our miss on the Dominos call.
Any two-bit high school math student can probably tell you just counting the frequency of an event does not accurately showcase the context of such an event.
To account for this every keyword/strike in the last four to eight quarters of transcripts, we extract the surrounding passage — roughly 200 words of context — and send it to an LLM. The LLM reads the actual language and classifies each mention into one of four categories:
Structural — the keyword is part of the company’s core narrative. Management brings it up unprompted as a strategic priority. These mentions persist because they’re embedded in how the company explains itself. If a strike is assigned structural they’re considered a very high conviction YES.
Promotional — management is actively pushing a product name or initiative. They’re selling the story. These mentions persist as long as the product is new and growing, but they have a shelf life. In hindsight, Craveable would’ve been promotional and if we had our new model at the time our confidence score would’ve been adjusted accordingly.
Reactive — management is responding to an analyst question or external event. They didn’t choose to say this word. These mentions are fragile — once the catalyst passes, they disappear. A good example of reactive would be 'SNAP - Family Center’
Legacy — a backward-looking reference. The keyword appears while management discusses transitions or history. The word is in the transcript, but the conversation has moved on.
Each mention also receives an intensity score from 1 to 3. A passing reference scores 1. A full paragraph scores 2. A major topic that dominates a section of the call scores 3.
The distribution of these categories — and how that distribution shifts over time — reveals whether a keyword’s frequency is sustainable or an illusion.
The Killer App: Outage
This strike is the best demonstration of why the context layer exists.
The frequency model said 85%. CrowdStrike had mentioned “outage” in six consecutive quarters following the July 2024 incident. The base rate was high, the streak was long, and the recency score was strong. Any trader looking at the numbers alone would price this above 80¢.
The context classifier told a different story. Across 12 classified mentions in the four most recent quarters, 92% were reactive and 8% were legacy. Zero structural. Zero promotional. Average intensity was 1.0 — the minimum possible. Every single mention was an analyst asking a question and management giving a one-sentence answer. Nobody was choosing to talk about the outage. They were being asked about it, and the answers were getting shorter.
The volume trend confirmed the fade: four mentions in Q1, five in Q2, down to just one in Q3.
The context-adjusted fair value was 45¢. The market had already priced it at similar levels. Outage did not hit on the March 3 call. The frequency model would have lost money buying YES at any price above 45¢. The context layer prevented that mistake.
Performance by Conviction Tier
The model’s accuracy correlated directly with conviction level:
Locks above 90¢ — two trades, both correct. Falcon Flex and Consolidation were identified as near-certainties based on structural keyword profiles. Both hit.
Strong YES between 60-89¢ — four trades, all correct. AWS, Hyperscaler, Nvidia, and Acquisition were identified as likely mentions with varying degrees of promotional vs. structural support. All hit.
Speculative zone between 30-50¢ — three strikes in this range (Microsoft, Shadow, Outage). Shadow and Outage were correctly called. Microsoft was the miss.
Strong NO below 30¢ — four trades, three correct. China, Generative AI, Signal, and Hack. Signal was the miss.
The pattern is clear: when frequency and context agree, the model is nearly perfect. When they diverge or when the probability falls into the speculative middle, variance increases. The takeaway for trading is straightforward — size your bets by conviction tier.
What We Got Wrong and Why
Microsoft was priced at 30¢ in our model. It hit. The context classifier correctly identified that Microsoft mentions had been declining for quarters — from 5 mentions per call down to zero for two consecutive quarters, with 100% promotional classification. The competitive win narrative against Microsoft had run its course.
What neither layer could see was that on February 18, CrowdStrike and Microsoft announced a major Azure Marketplace partnership. This injected a completely new narrative — not competition, but collaboration — that management was eager to discuss on the call. Our model saw a dead keyword. Reality saw a resurrected one.
If we would’ve published a CrowdStrike call analysis based on the new model, we would’ve taken into consideration this partnership and would’ve advised ‘no-action.’
Signal was priced at 20¢. It hit. The model saw scattered mentions with no pattern across eight quarters. What it couldn’t know was that CrowdStrike’s $740 million acquisition of SGNL in January 2026 turned “signal” from a generic word into a product-specific term embedded in their identity security pitch.
Both misses share the same root cause: discrete corporate events that fundamentally change a keyword’s relationship to the business. These are not predictable from historical transcripts. They require a real-time news and event monitoring layer, which is we would’ve advised ‘no-action’ or a potential lean towards yes.
Complete Technical Architecture
Training Data Construction
The frequency model was trained on a dataset of 3,144 historical quarter-keyword outcomes. We began by collecting earnings call transcripts for seven companies (CRWD, GOOGL, META, AMD, COST, TSLA, GS) through a transcript API, caching results locally to avoid redundant calls.
For each company, we defined a basket of keywords matching actual Kalshi contract strikes. The full basket covered 102 unique ticker-keyword pairs across 93 distinct keywords. We filtered to 75 pairs in the 5-95% historical hit rate range — pairs outside that range provide no variance for the model to learn from. A keyword that appears 100% of the time or 0% of the time teaches the classifier nothing.
Each row in the training set represents one historical quarter for one keyword. The label is binary: did the keyword appear on the call (1) or not (0). All features are computed using only data available before the target quarter — no future leakage.
The 43 Features
The feature extractor computes features across nine groups:
Frequency features captures raw and windowed hit rates: overall hit rate, hits in the last 2, 4, and 8 quarters, and the rate over those windows. These are the baseline signals that any manual analysis approximates by eye.
Recency-weighted features apply exponential decay to historical outcomes. We compute three variants with decay factors of 0.75, 0.85, and 0.9, plus a piecewise weighting that gives the most recent quarter 4x the weight of quarters older than 8. The top-performing single feature in the model is the 0.75 exponential recency weight, which correctly captures rapid regime changes.
Streak features track whether the keyword is currently on a hit or miss streak, the length of that streak, and a binary flag for whether the most recent call included the keyword.
Trend features measure the slope of a rolling hit rate, whether the trend is monotonically increasing or decreasing, and the ratio of recent to historical mention density.
Call-level features include average call word count, mention density per 1,000 words when the keyword is present, the standard deviation of mention counts, and the maximum mentions observed in any single call. These capture whether a keyword is a passing reference or a major discussion topic.
Mention count features track the average and maximum mention counts when present, the coefficient of variation (how volatile the count is), and the correlation between call length and mention presence.
Speaker features were designed to capture whether mentions come from the CEO, CFO, or other executives, but the transcript formatting did not support reliable speaker attribution. These features remained at zero and were excluded from the model. This feature is largely dependent upon how the transcripts are published, and if they include the title of the speaker. We are actively working on improving this feature to add to the model.
After removing zero-variance features (8 sector/speaker columns) and filling 819 null values in the length-mention correlation feature, the final feature set contained 43 usable features.
LLM Context Classification
For each keyword mention in the transcript, we extract a 600-character window of surrounding text. This passage is sent to Claude’s API (Sonnet) with a classification prompt that asks for a category (structural, promotional, reactive, legacy) and an intensity score (1-3).
The quick scoring function scans the last 4 quarters by default, with an option to extend to 8 quarters for keywords with thin recent data. It aggregates classifications into a summary that includes category distribution percentages, average intensity, dominant category, and a narrative health assessment.
The Seven Adjustment Signals
The context-adjusted probability starts from the frequency model’s output and applies seven sequential adjustments:
Signal 1 — Recent silence. If the keyword had zero mentions in the most recent quarter, the probability is multiplied by 0.6. If zero mentions in the two most recent quarters, the probability is capped at 0.25. This was the single most impactful adjustment in backtesting.
Signal 2 — Volume trend (context-gated). We compare average mention count in the two most recent quarters against the two older quarters. If volume has declined by more than 70%, and the context shows legacy mentions above 15% or structural mentions below 20%, the probability is multiplied by 0.6. If volume declined by 40-70% under the same context conditions, multiplied by 0.8. If volume increased by 50% or more and current probability is below 70%, it is boosted by 30%. The context gate prevents false penalties on keywords like Falcon Flex, where raw mention volume declined from launch hype to steady-state but the narrative strengthened.
Signal 3 — Reactive dominance. If 70% or more of classified mentions are reactive, the probability is halved. If 50-70% are reactive, it is multiplied by 0.7. Reactive mentions depend on external catalysts that may not recur.
Signal 4 — Legacy migration. If 40% or more of mentions are legacy, multiply by 0.7. If 25-40% are legacy, multiply by 0.85. Legacy mentions indicate the keyword is being discussed in the past tense.
Signal 5 — Structural boost. If 70% or more of mentions are structural and average intensity is 1.5 or above, multiply by 1.1 (capped at 0.99). Structurally embedded keywords have the highest persistence.
Signal 6 — Thin data penalty. If total mentions across all scanned quarters are 2 or fewer and the probability exceeds 40%, multiply by 0.8. Small sample sizes inflate confidence.
Signal 7 — Intensity fade. If average intensity is 1.0 or below (minimum — passing mentions only) and the probability exceeds 50%, multiply by 0.8. Low-intensity mentions signal a keyword hanging on by a thread.
Strike-by-Strike Context Classifications
Falcon Flex — 20 mentions across 4 quarters. Category split: 45% structural, 55% promotional. Intensity: 2.1 (highest on the board). Quarter-by-quarter: Q4 FY2025 was 60% structural, shifting to 60% promotional by Q2 FY2026, then back to 60% structural in Q3 FY2026. This promotional-to-structural migration pattern indicates a product transitioning from launch narrative to permanent vocabulary. No adjustment applied — frequency model’s 96.9% stands. Fair value: 95¢.
Consolidation — 17 mentions across 4 quarters. Category: 88% structural, 12% reactive, zero legacy. All four quarters showed structural mentions. Q2 FY2026 was 100% structural across all 5 classified passages. Structural boost applied: 96.4% → 99%. Fair value: 93¢.
AWS — 11 mentions across 3 of 4 quarters (silent in Q2). Category: 64% promotional, 18% structural, 18% reactive. Most recent quarter was 100% promotional at intensity 2 — management actively namedropping the partnership. One silent quarter prevents a boost but the promotional energy is strong. No net adjustment. Fair value: 75¢.
Hyperscaler — 8 mentions across all 4 quarters. Category: 50% structural, 25% promotional, 12% legacy, 12% reactive. The structural foundation represents real revenue from hyperscaler customers. One legacy mention in Q2 didn’t repeat. No net adjustment. Fair value: 73¢.
Nvidia — 7 mentions across 3 of 4 quarters. Category: 100% promotional, intensity 1.9. CrowdStrike promoting the Nvidia AI security partnership. No structural anchor means this could vanish if management finds a new partnership to highlight, but recent momentum is strong. No net adjustment. Fair value: 68¢.
Acquisition — 14 mentions across 3 of 4 quarters (silent in Q1). Category: 29% structural, 29% promotional, 7% reactive, 36% legacy — the highest legacy rate on the board. Legacy penalty applied: 80.2% → 68.2%. The keyword is transitioning from active narrative to historical references. Fair value: 60¢.
Outage — 12 mentions across all 4 quarters. Category: 92% reactive, 8% legacy, zero structural, zero promotional. Intensity: 1.0 (minimum). Volume declining from 4-5 mentions to 1. Reactive penalty and intensity fade applied: 84.8% → 42.4%. This was the model’s signature call — the frequency model was dangerously wrong, and the context layer corrected it. Fair value: 45¢.
Shadow — 2 mentions in 2 most recent quarters (emerging). Category: 50% structural, 50% promotional, intensity 2.0. New product or capability entering the vocabulary. Too thin for confidence in either direction. No adjustment from thin base. Fair value: 40¢.
Generative AI — 6 mentions across 3 of 4 quarters, with the most recent quarter silent. Category: 50% structural, 17% promotional, 33% reactive. The silence penalty fired: 51.8% → 31.1%. Management absorbed the concept into “agentic AI” without using the specific Kalshi strike phrase. Fair value: 35¢.
Microsoft — 2 mentions across 2 of 4 quarters, with 2 consecutive quarters of silence. Category: 100% promotional, zero structural. Double silence penalty applied: 57.5% → 15%. The competitive win narrative had clearly run its course by Q2 FY2026. Fair value: 30¢.
China — 2 mentions across 2 of 4 quarters. Category: 100% structural, intensity 2.0. Both mentions related to nation-state threat landscape. Structural quality is high but frequency is roughly annual. Small structural boost partially offset by thin data. Fair value: 28¢.
Signal — 1 mention across 4 quarters, with the most recent quarter silent. Category: 100% promotional, intensity 1.0. Silence penalty and thin data penalty applied: 30.8% → 18.5%. Scattered appearances with no clustering or trend. Fair value: 20¢.
Hack — 1 mention across 4 quarters (only Q3 FY2026). Category: 100% structural, intensity 2.0. CrowdStrike is a cybersecurity company but actively avoids the word “hack” in favor of sanitized terms like “threat actor” and “adversary.” 48 quarters of history show near-zero usage. Thin data keeps it flat. Fair value: 12¢.
First Strike Research publishes data-driven analysis of prediction markets, short equity research, and investigative financial journalism.
Not investment advice. Trade at your own risk.




Love the breakdown and technical details. Might try to play around and build some of my own for fun. Any traction on another play yet>?