You ran Lighthouse, hit a green 98, and shipped. Three weeks later Search Console flags the page as failing Core Web Vitals. If that has ever happened to you, you have run into the central problem with Lighthouse: the number it gives you and the number Google actually ranks on are measured in completely different worlds. Lighthouse is a useful diagnostic tool. It is a terrible scoreboard. Here is exactly where it misleads, and what to watch instead.

Lab data and field data are not the same thing

A Lighthouse run is a single synthetic page load on an emulated mid-range Android phone with throttled CPU and a simulated slow 4G connection. It happens once, in a clean environment, with no real user attached. That is lab data.

What Google uses for ranking is field data: the Chrome User Experience Report (CrUX), assembled from real Chrome users who actually visited your page. It is a 28-day rolling window, reported at the 75th percentile — meaning three out of four real visits must hit the target. The thresholds you are graded against are:

LCP (Largest Contentful Paint) < 2.5 seconds
INP (Interaction to Next Paint) < 200 milliseconds
CLS (Cumulative Layout Shift) < 0.1

A page can score 100 in Lighthouse and still fail CrUX, and a page can score in the 60s and pass comfortably. The synthetic run does not know that 40% of your real traffic is on aging phones over patchy mobile networks, or that your visitors mostly land on a heavy archive page rather than the homepage you tested. The lab is a controlled experiment; the field is your actual audience. Only one of them affects rankings.

The metric Google ranks on isn't even in your lab score

This is the single most important thing to understand. In March 2024, Google replaced First Input Delay with INP as a Core Web Vital. INP measures responsiveness across the whole visit — it watches how long the page takes to visually respond after a user taps, clicks, or types, and reports near the worst interaction.

Lighthouse cannot measure INP. There are no real interactions during a synthetic load — nobody clicks anything — so there is nothing to time. Instead, Lighthouse reports Total Blocking Time (TBT) as a lab proxy for interactivity. TBT and INP are correlated but they are not the same, and the gap between them is exactly where sites get burned. You can drive TBT to near zero in the lab and still ship terrible INP in the field.

The "delay JavaScript until interaction" trap

Here is the most common way this happens on WordPress. Optimization plugins like WP Rocket, Perfmatters, and FlyingPress offer a feature usually called "Delay JavaScript Execution" — it holds back nearly all scripts (analytics, ad tags, chat widgets, sliders) until the user's first interaction.

In a Lighthouse run, no interaction ever happens, so none of that JavaScript executes. TBT plummets and your Performance score jumps, often by 20 or 30 points. It looks like a miracle fix.

Then a real visitor's first tap fires every deferred script at once. The main thread chokes processing all of it, and the response to that very first interaction — the one INP is most likely to record — is slow. Your lab score went up; your field INP got worse. The tool rewarded the exact behavior that hurts real users. This is not a bug in the plugins; used carefully (delaying only non-critical scripts) they help. It is a demonstration that the lab score and the ranked metric can move in opposite directions.

The Performance score is a composite that hides the detail

The big number at the top is a weighted blend of several lab metrics — most of the weight sits on interactivity (TBT) and LCP, with smaller contributions from CLS, First Contentful Paint, and Speed Index. Because it is a blend, two pages with the same score can have very different real problems: one might have great paint times but janky interactivity, the other the reverse. Worse, you can "fix the score" by improving whichever metric is cheapest to game rather than the one your users actually feel. Always scroll past the number and read the individual metrics. The score is a summary; the metrics are the truth.

One run is not a measurement

Run Lighthouse three times on the same page and you will often see the score swing by 10 points or more. The simulated throttling, your machine's CPU contention, background tabs, and especially third-party scripts (ad networks and A/B tools serve different payloads each load) all introduce variance. A single run feels authoritative because it produces one tidy number, but it is one sample from a noisy distribution. If you must use the lab score to compare before/after, run it several times and look at the median — never trust a one-shot result.

You're probably comparing scores that aren't comparable

There are several ways to run Lighthouse, and they do not agree:

PageSpeed Insights runs on Google's servers with fixed mobile throttling (emulated mid-tier Android, slow 4G). It also shows real CrUX field data at the top — the part that actually matters.
Chrome DevTools runs locally on your hardware and connection, which are far faster than the emulated phone, so scores read artificially high.
The Lighthouse CLI and CI integrations use their own config and machine, producing yet another number.

When a developer's DevTools shows 95 and the client's PageSpeed Insights shows 68, nobody is lying — they ran two different tests. Pick one environment, ideally PSI for its field data, and stop comparing apples to oranges.

The WordPress-specific blind spots

A few more places where a clean lab score hides real-world trouble on WordPress:

Caching masks a slow host. WP Rocket, LiteSpeed Cache, and FlyingPress can make a cached page paint fast in the lab, but field LCP still leans heavily on server TTFB (Time to First Byte). Aim for TTFB under ~200ms (under 600ms at worst). On budget shared hosting, the first uncached request — which real crawlers and cold visitors hit — can be slow even when your test happened to land on a warm cache. Quality managed hosts (Kinsta, WP Engine, Cloudways, SiteGround's higher tiers) move this needle far more than any plugin setting.
Page builders add weight the lab tolerates. Elementor and Divi generate deep DOM trees and large CSS payloads. A fast test machine renders that bloat fine; a real mid-range phone does not, and it shows up as worse field LCP and occasional CLS that the single clean run didn't trigger.
Layout shift is interaction-dependent. Lazy-loaded images without width and height, late-loading ad slots, and cookie banners cause CLS that often only fires during real scrolling and consent clicks — events a synthetic load never performs.

What to actually use as your KPI

None of this means ignore Lighthouse. It means use it for what it is good at and stop treating the score as the goal.

Treat field data as the real KPI. Watch the Core Web Vitals report in Google Search Console and the field section at the top of PageSpeed Insights. That is the data tied to ranking and to real users.
Measure INP from real users. Install the web-vitals JavaScript library or a RUM tool so you see INP, LCP, and CLS from your actual audience, not a proxy. Many analytics setups can now report these directly.
Use Lighthouse's diagnostics, not its grade. The "Opportunities" and "Diagnostics" sections — render-blocking resources, oversized images, unused CSS, long main-thread tasks — are genuinely useful for debugging why a metric is slow.
Validate fixes in the field. After a change, give CrUX its rolling window before you call it a win. A lab score that improves instantly tells you almost nothing about what your users will feel four weeks from now.

Chase the green number and you will optimize for a robot that never clicks anything. Chase your field Core Web Vitals and you will optimize for the people Google is actually measuring — which is the only audience that moves your rankings.

Why Lighthouse Scores Can Mislead You