Detecting a 0.005% regression means detecting that a 20s task now takes 20.001s.
It's not even easy to reliably detect such a small performance regression on a single thread on a single machine.
I suppose in theory having multiple machines could actually improve the situation, by letting them average out the noise? But on the other hand, it's not like you have identically distributed samples to work with - workloads have variance over time and space, so there's extra noise that isn't uniform across machines.
Color me a little skeptical, but it's super cool if actually true.
The .005% is a bit of a headline-grabber for sure, but the idea makes sense given the context: They're monitoring large, multi-tenant/multi-use-case systems which have very large amounts of diverse traffic. In these cases, the regression may be .005% of the overall size, but you don't detect it like that, but rather by detecting a 0.5% regression in a use case which was 1% of the cost. They can and do slice data in various ways (group by endpoint, group by function, etc.) to improve the odds of detecting a regression.
If you’re looking for statistical validity , this is not the way to go about it; if you perform 20 analyses, there’s too high a chance one of them will spuriously show p>.95 (see p-hacking).
They're not trying to write a paper with the identified. They're trying to identify when regressions creep in so they can be fixed. Half the paper talks about all the ways they filter signals into things which are likely actionable. Sure they'll still get some wrong, but as long as the precision is high enough for engineers to treat the regression reports seriously and the recall is high enough that costs stay down, that's all that matters.
"...measuring CPU usage at the subroutine level rather than at the overall service level. However, if this 0.005% regression originates from a single subroutine that consumes 0.1% of the total CPU, the relative change at the subroutine level is 0.005% / 0.1% = 5%, which is much more substantial. Consequently, small regressions are easier to detect at the subroutine level."
You’re right to be skeptical. This entire space as far as I can tell is filled with people who overpromise and under deliver and use bad metrics to claim success. If you look at their false positive and false negative sections, they perform terribly but use words to claim that it’s actually good and use flawed logic to extrapolate on missing data (eg assume our rates stay the same for non-response vs “people are tired of our tickets and ignore our system”). And as follow up work their solution is to keep tuning their parameters (ie keep fiddling to overfit past data”). You can even tell how it’s perceived where they describe people not even bother to interact with their system during 2/4 high impact incidents examined and blaming the developer for one of them as “they didn’t integrate the metrics”. Like if a system can provide a meaningful cost/benefit the teams would be clamoring to adjust their processes. Until demonstrated clearly otherwise it’s dressed up numerology.
I saw a team at oculus fail to do this for highly constrained isolated environments with repeatable workloads and having the threshold be much more conservative (eg 1-10%) and failing. This paper is promulgating filtering your data all to hell to the point of overfitting.
I don't agree. This is basically an elaborate form of statistical process control, which has been proving itself useful and effective for nearly a century. We can quibble about the thresholds and false positive rates, but I think the idea of automating regression detection is perfectly sound.
Statistical process control as a concept is a sound idea in theory. When talking about real world complex dynamic systems like operating systems and processes handling random load vs things like assembly lines, it’s less clear it’s on a solid mathematical foundation. Clearly what’s happening in that paper isn’t starting out with some principled idea but instead filtering out patterns deemed “noise” and adjusting the levels for those filters to generate “results”. I think if you read through the lines it’s clear the engineers being supported by this tool aren’t really relying on it which tells you something
SPC gets used on complex dynamic systems all the time. It takes more work and more nuance, but it's doable. I don't see a categorical error here, it's about fine-tuning the details.
Thanks, I hadn't read that far into the paper. But I have to say I had what I feel is a good reason to be skeptical before even reading a single word in the paper, honestly. Which is that Facebook never felt so... blazing fast, shall we say, to make me believe anyone even wanted to pay attention to tiny performance regressions, let alone the drive and tooling to do so.
> blazing fast, shall we say, to make me believe anyone even wanted to pay attention to tiny performance regressions
Important to distinguish frontend vs backend performance of course. This is about backend performance where they care about this stuff a lot because it multiplies at scale & starts costing them real money. Frontend performance has less of a direct impact on their numbers with the only data I know on that is the oft-cited Google stuff trying to claim that there's a direct correlation between lost revenue and latency (which I haven't seen anyone else bother to try to replicate & see if it holds up).
> Important to distinguish frontend vs backend performance of course.
I'm not sure what you're calling frontend in this context (client side? or client-facing "front-end" servers?), but I'm talking about their server side specifically. I've looked at their API calls and understand when slowness is coming from the client or the server. The client side is even slower for sure, but the server side also never felt like it was optimized to the point where deviations this small mattered.
I get that they're different, but the whole point is optimization here. They're not gathering performance metrics just to hang them up on a wall and marvel at the number of decimal points, right? They presumably invested all the effort into this infrastructure because they think this much precision has significant ROI on the optimization side.
I'm using "control" in the statistical process control sense, where it means "we can tell if variation is ordinary or extraordinary".
To me it seemed clear that the paper is about detecting regressions, which is control under my definition above. I still think of that as distinct from optimization.
It often isn't to make things go faster for an individual user (often times the driving factor of latency is not computation, but inter-system rpc latency, etc.). The value is to bin-pack processing more requests into the same bucket of CPU.
That can have latency wins, but it may not in a lot of contexts.
Trying to do fine-grained regression detection in a controlled environment is indeed a fool's errand, but that's the opposite of what this paper is about.
Your claim is that doing fine-grained detection in a more chaotic and dynamic environment with unrepeatable inputs is easier than fine-grained detection in a controlled environment with consistent inputs? Not sure I follow.
Yes, it often is. In a "controlled" environment you control what you control and not the stuff that you can't control or don't even know about. It's tedious and sort of a chore to setup and a source of ongoing woes afterwards. On the other hand natural experiments abound when you have large scale. Just having really large N on both arms of an experiment where the difference is the compiled program, and there are no systematic biases in traffic patterns, will reveal smaller regressions that weren't found in the controlled load tests.
This is the company that famously runs their main website on ... drum roll ... PHP. I know it's not off-the-shelf PHP, and they do a lot of clever stuff with it, but to worry about single digit milliseconds when you have such a barnacle as the centerpiece of your business is hilarious.
Hack is a rather absurd local maxima language, but the journey there makes sense. Circa 2005, PHP was a reasonable choice for quickly iterating on a website. The language was quite amateurish, but it's a web-first language, and its edit-and-refresh development loop was better than most languages.
At some point, two things happened. Any change to code, especially heavily used code, became risky because of limited typechecking, and the scale of the site was large enough that running Zend got expensive. Compiling to C++ with HipHop addressed immediate performance concerns, but it's a fragile solution. At some point, you're faced with either rewriting the codebase in something sensible or migrating in-place with a language fork and a custom VM. Facebook opted to pay the ongoing tax of maintaining its own language and having a team that supports it.
An in-house language gives you freedoms like being able to make language changes driven by business needs like privacy, but the language lacks scrutiny of larger community. Some features are half-baked, and it's missing features most popular languages have.
PHP 7 cherry picked enough Hack and HHVM features that it's better to stick with PHP. There's no broader support for the language, and it's not clear HHVM is still faster.
The next time you're involved in language bikeshedding, remember that Facebook went all-in on PHP, and it was a good-enough choice.
HHVM is an amazingly-optimized VM for their workload. That PHP runs many times faster than off-the-shelf PHP does. I suspect taking any other widely available VM and waving a magic wand to rewrite in another language would result in something that performed significantly worse.
I think most of their performance critical infrastructures are written in native languages like C++ and even their PHP (actually Hack) parts are significantly modified and optimized.
Not only is HHVM much more faster than the earlier HipHop C++ compiler attempt, they also have enough C++ and Rust, alongside some less mainstream ones like Haskell and Erlang.
I know it's popular to hate on PHP. But when you do it right, you can turn pages out fast enough that single digit milliseconds are important. I'm not a fan of Hack or how FB structures their PHP; but some of their pages are pretty quick. Personally, with just regular PHP, my baseline is about 10 ms for a minimum page, and 50 ms for something that's complex and maybe has a few database queries. That's not the quickest thing in the world, and not all of my PHP goes that fast, but single digit milliseconds are still significant at that scale.
Detecting a 0.005% regression means detecting that a 20s task now takes 20.001s.
It's not even easy to reliably detect such a small performance regression on a single thread on a single machine.
I suppose in theory having multiple machines could actually improve the situation, by letting them average out the noise? But on the other hand, it's not like you have identically distributed samples to work with - workloads have variance over time and space, so there's extra noise that isn't uniform across machines.
Color me a little skeptical, but it's super cool if actually true.
The .005% is a bit of a headline-grabber for sure, but the idea makes sense given the context: They're monitoring large, multi-tenant/multi-use-case systems which have very large amounts of diverse traffic. In these cases, the regression may be .005% of the overall size, but you don't detect it like that, but rather by detecting a 0.5% regression in a use case which was 1% of the cost. They can and do slice data in various ways (group by endpoint, group by function, etc.) to improve the odds of detecting a regression.
> They can and do slice data in various ways
If you’re looking for statistical validity , this is not the way to go about it; if you perform 20 analyses, there’s too high a chance one of them will spuriously show p>.95 (see p-hacking).
They're not trying to write a paper with the identified. They're trying to identify when regressions creep in so they can be fixed. Half the paper talks about all the ways they filter signals into things which are likely actionable. Sure they'll still get some wrong, but as long as the precision is high enough for engineers to treat the regression reports seriously and the recall is high enough that costs stay down, that's all that matters.
I understand that - was more pointing out that "check every possible statistical test" tanks your S/N ratio by firing way too many false positives.
"...measuring CPU usage at the subroutine level rather than at the overall service level. However, if this 0.005% regression originates from a single subroutine that consumes 0.1% of the total CPU, the relative change at the subroutine level is 0.005% / 0.1% = 5%, which is much more substantial. Consequently, small regressions are easier to detect at the subroutine level."
Now think about how much money detecting 1,000 0.005% regressions saves at Meta-scale.
You’re right to be skeptical. This entire space as far as I can tell is filled with people who overpromise and under deliver and use bad metrics to claim success. If you look at their false positive and false negative sections, they perform terribly but use words to claim that it’s actually good and use flawed logic to extrapolate on missing data (eg assume our rates stay the same for non-response vs “people are tired of our tickets and ignore our system”). And as follow up work their solution is to keep tuning their parameters (ie keep fiddling to overfit past data”). You can even tell how it’s perceived where they describe people not even bother to interact with their system during 2/4 high impact incidents examined and blaming the developer for one of them as “they didn’t integrate the metrics”. Like if a system can provide a meaningful cost/benefit the teams would be clamoring to adjust their processes. Until demonstrated clearly otherwise it’s dressed up numerology.
I saw a team at oculus fail to do this for highly constrained isolated environments with repeatable workloads and having the threshold be much more conservative (eg 1-10%) and failing. This paper is promulgating filtering your data all to hell to the point of overfitting.
I don't agree. This is basically an elaborate form of statistical process control, which has been proving itself useful and effective for nearly a century. We can quibble about the thresholds and false positive rates, but I think the idea of automating regression detection is perfectly sound.
Statistical process control as a concept is a sound idea in theory. When talking about real world complex dynamic systems like operating systems and processes handling random load vs things like assembly lines, it’s less clear it’s on a solid mathematical foundation. Clearly what’s happening in that paper isn’t starting out with some principled idea but instead filtering out patterns deemed “noise” and adjusting the levels for those filters to generate “results”. I think if you read through the lines it’s clear the engineers being supported by this tool aren’t really relying on it which tells you something
SPC gets used on complex dynamic systems all the time. It takes more work and more nuance, but it's doable. I don't see a categorical error here, it's about fine-tuning the details.
Thanks, I hadn't read that far into the paper. But I have to say I had what I feel is a good reason to be skeptical before even reading a single word in the paper, honestly. Which is that Facebook never felt so... blazing fast, shall we say, to make me believe anyone even wanted to pay attention to tiny performance regressions, let alone the drive and tooling to do so.
> blazing fast, shall we say, to make me believe anyone even wanted to pay attention to tiny performance regressions
Important to distinguish frontend vs backend performance of course. This is about backend performance where they care about this stuff a lot because it multiplies at scale & starts costing them real money. Frontend performance has less of a direct impact on their numbers with the only data I know on that is the oft-cited Google stuff trying to claim that there's a direct correlation between lost revenue and latency (which I haven't seen anyone else bother to try to replicate & see if it holds up).
> Important to distinguish frontend vs backend performance of course.
I'm not sure what you're calling frontend in this context (client side? or client-facing "front-end" servers?), but I'm talking about their server side specifically. I've looked at their API calls and understand when slowness is coming from the client or the server. The client side is even slower for sure, but the server side also never felt like it was optimized to the point where deviations this small mattered.
I think the confusion arises because of the difference between optimization and control, which are superficially similar.
Having control lets you see if things changed. Optimization is changing things.
This team seems to be focused on control. I assume optimization is left to the service teams.
I think by control you mean observability?
I get that they're different, but the whole point is optimization here. They're not gathering performance metrics just to hang them up on a wall and marvel at the number of decimal points, right? They presumably invested all the effort into this infrastructure because they think this much precision has significant ROI on the optimization side.
I'm using "control" in the statistical process control sense, where it means "we can tell if variation is ordinary or extraordinary".
To me it seemed clear that the paper is about detecting regressions, which is control under my definition above. I still think of that as distinct from optimization.
Right, but what is the point of optimizing?
It often isn't to make things go faster for an individual user (often times the driving factor of latency is not computation, but inter-system rpc latency, etc.). The value is to bin-pack processing more requests into the same bucket of CPU.
That can have latency wins, but it may not in a lot of contexts.
Trying to do fine-grained regression detection in a controlled environment is indeed a fool's errand, but that's the opposite of what this paper is about.
Your claim is that doing fine-grained detection in a more chaotic and dynamic environment with unrepeatable inputs is easier than fine-grained detection in a controlled environment with consistent inputs? Not sure I follow.
Yes, it often is. In a "controlled" environment you control what you control and not the stuff that you can't control or don't even know about. It's tedious and sort of a chore to setup and a source of ongoing woes afterwards. On the other hand natural experiments abound when you have large scale. Just having really large N on both arms of an experiment where the difference is the compiled program, and there are no systematic biases in traffic patterns, will reveal smaller regressions that weren't found in the controlled load tests.
This is the company that famously runs their main website on ... drum roll ... PHP. I know it's not off-the-shelf PHP, and they do a lot of clever stuff with it, but to worry about single digit milliseconds when you have such a barnacle as the centerpiece of your business is hilarious.
Hack is a rather absurd local maxima language, but the journey there makes sense. Circa 2005, PHP was a reasonable choice for quickly iterating on a website. The language was quite amateurish, but it's a web-first language, and its edit-and-refresh development loop was better than most languages.
At some point, two things happened. Any change to code, especially heavily used code, became risky because of limited typechecking, and the scale of the site was large enough that running Zend got expensive. Compiling to C++ with HipHop addressed immediate performance concerns, but it's a fragile solution. At some point, you're faced with either rewriting the codebase in something sensible or migrating in-place with a language fork and a custom VM. Facebook opted to pay the ongoing tax of maintaining its own language and having a team that supports it.
An in-house language gives you freedoms like being able to make language changes driven by business needs like privacy, but the language lacks scrutiny of larger community. Some features are half-baked, and it's missing features most popular languages have.
PHP 7 cherry picked enough Hack and HHVM features that it's better to stick with PHP. There's no broader support for the language, and it's not clear HHVM is still faster.
The next time you're involved in language bikeshedding, remember that Facebook went all-in on PHP, and it was a good-enough choice.
HHVM is an amazingly-optimized VM for their workload. That PHP runs many times faster than off-the-shelf PHP does. I suspect taking any other widely available VM and waving a magic wand to rewrite in another language would result in something that performed significantly worse.
I think most of their performance critical infrastructures are written in native languages like C++ and even their PHP (actually Hack) parts are significantly modified and optimized.
Not only is HHVM much more faster than the earlier HipHop C++ compiler attempt, they also have enough C++ and Rust, alongside some less mainstream ones like Haskell and Erlang.
https://engineering.fb.com/2022/07/27/developer-tools/progra...
Hack superficially looks like PHP, but is a completely different beast under the hood.
I know it's popular to hate on PHP. But when you do it right, you can turn pages out fast enough that single digit milliseconds are important. I'm not a fan of Hack or how FB structures their PHP; but some of their pages are pretty quick. Personally, with just regular PHP, my baseline is about 10 ms for a minimum page, and 50 ms for something that's complex and maybe has a few database queries. That's not the quickest thing in the world, and not all of my PHP goes that fast, but single digit milliseconds are still significant at that scale.
They are extraordinarily different languages and runtime systems.
There is no PHP at Facebook/Meta