Two days ago I noticed something. The financial data backlog on invest-like - the queue of tickers waiting to be re-enriched from Financial Modeling Prep - had been sitting at ~4,500 rows for what felt like forever. I'd told myself the FMP backfill cron was clearing about 400 stocks a day, on track for a full refresh every month.
The actual number was 370. Not 400.
That 30-a-day delta doesn't sound like much. But the cron was supposed to run four times a day, processing 100 tickers each. Four hundred. I'd done that math in my head a dozen times. Where were the missing 30?
It turned out the cron was timing out. Mid-slice. Quietly. Every single run.
What the cron does
A few hundred lines of TypeScript that pulls fresh data from FMP's /stable endpoints for one ticker at a time. Fourteen endpoints per ticker: profile, ratios TTM, key metrics, income / balance / cash flow statements (annual + quarterly), 5-year price history, insider trades, analyst grades, DCF. The whole lot gets transformed and upserted into Supabase across five tables.
Four times a day, Vercel hits the route at /api/agents/fmp-fresh-backfill/?chunk=100. The route picks the 100 oldest-synced tickers and processes them.
Or, well - picks the 100 oldest-synced tickers and tries to process them. Tries is doing a lot of work in that sentence.
The math I forgot to do
Here's the math I should have done in February when I tuned chunk=100:
- FMP Starter plan: 250 calls per minute, hard ceiling.
- Per ticker: 14 API calls.
- Module-level rate gate in the FMP client: 250ms between request starts. This gives a sustained 240 calls/min - just under the limit, with a small safety margin.
- Chunk of 100 tickers = 1,400 API calls.
- 1,400 calls × 250ms = 350 seconds of API time.
Vercel function timeout: 300 seconds.
So every chunk-of-100 run was hitting 300s, getting killed by the runtime, and committing whatever it had processed up to that point. Empirically, that was ~70-85 tickers per run, depending on FMP latency variance and Supabase upsert speed. The clean math says 100 should fit. The real math, including the timeout, says 86 is the absolute ceiling.
I'd built the schedule, run it once, eyeballed the result, and shipped it.
The fix
Two changes:
Right-size the chunk. Drop to chunk=80. At 240 calls/min that's 14 × 80 × 250ms = 280s of API time, well inside the 300s budget with margin for variance. Single-cron throughput is now actually 80, not "maybe 75-85, depending on the wind."
Triple the cadence. Switch from 4 crons/day to 12 crons/day - one every two hours. Vercel Pro lets you schedule up to 40 cron jobs per project; I had 16 total, this brought it to 24. Each cron is serial (its own function invocation), so they don't fight for the same rate budget - they just step through, one slot at a time.
12 crons × 80 tickers = 960 tickers/day theoretical. ~900/day measured after the inevitable failed-ticker attrition.
I shipped both changes on a Sunday afternoon. By the next morning, the live status endpoint was reporting:
{
"never_synced": 3155,
"synced_last_24h": 1339,
"days_to_clear": 2.36
}
1,339 sync events in the last 24 hours. The 4,500-row backlog - which I'd been telling myself would take "about two weeks" to clear - was actually going to be empty in 2.4 days.
Real throughput beat the predicted 960/day because some chunks finished faster than budget (FMP's median latency is lower than the 95th percentile I tuned against), so the 12-cron stack overlaps less and gets more work done per day than the napkin math.
What I learned
Run the timeout math, even when it feels redundant. I'd done the rate-limit math. I'd done the function-budget math. I'd not done the multiplication of the two against the chunk size. That's where the bug lived: in the product of two numbers I'd both individually verified.
Cron timeouts are silent. Vercel kills the function, logs nothing the developer dashboard surfaces by default, and the cron continues firing on schedule like nothing's wrong. No alert. No "your job exited early" email. The backlog just... doesn't move as fast as it should.
The fix to that was a 170-line read-only status endpoint at /api/agents/fmp-fresh-status/ that I now hit every morning. One Bearer-authed GET returns the never-synced count, the last-1h / last-6h / last-24h throughput, the timestamp of the most recent sync, and a days_to_clear projection. If the projection drifts above 5, I know the cron is timing out again without having to babysit Vercel logs.
Vercel Pro cron limits matter. 40 cron jobs sounds like a lot. It isn't, once you have multiple backfill paths and weekly digest jobs and content rebuilds. I'm at 24 now. There's a near-future world where I want a per-sector backfill at a higher cadence than the global one. Best to think about cron budget the same way I think about function-invocation budget - a finite resource that gets allocated by importance, not by "is this nice to have."
Tracking the right number. I'd been watching never_synced count drop. That's a lagging indicator. What I needed was synced_last_24h - the rolling throughput - and days_to_clear, the projection. Those tell you whether the worker is healthy right now, not whether it was healthy yesterday. Different teams will care about different rolling windows, but the principle is the same: optimize the lagging metric, instrument the leading one.
What I'd do differently if I started over
I'd build the status endpoint on day one, not on the day I discovered the timeout. Backfill workers are exactly the class of code where silent failures cost weeks before you notice. Five extra minutes per agent to wire a /api/agents/[name]-status/ route that returns whatever the worker promises to make true - backlog count, last successful run, throughput projection - saves a future-me from staring at a slow-moving counter and assuming the worker is fine.
I'd also commit the timeout math as a comment in the source file, not just live in my head. The vercel.json cron config now has a comment block above the FMP entries explaining: if you change the chunk size, recompute 14 × chunk × 250ms and verify it's under 300. The math has bitten us once. If a future-me with two months of distance from this problem tries to bump chunk back to 100 because "100 sounds neat," the comment is the friction layer that makes them double-check.
And I'd watch the leading indicator from day one. Not the dropping backlog. The flowing throughput. The dropping backlog only catches you when it's stopped dropping.
The whole change shipped Sunday afternoon as part of a broader internal-link mesh + SEO sprint - full notes are on the /changelog (v70). If you're running a similar Vercel-cron-fed worker against a rate-limited upstream, here's the question worth asking: how much of your function budget is actually getting used per run, vs how much is sitting unused because the chunk size doesn't divide cleanly into the timeout window? Mine was 17% under-allocated for two months. Worth checking yours.