Netflix Engineering is probably far ahead of any other competitor streaming service, but I wonder how much the ROI is on that effort and cost. As a user, I don’t see much difference in reliability between Netflix, Disney, HBO, Hulu, Peacock and Paramount+. They all error out every now and then. Maybe 5-10 years ago you needed to be much more sophisticated because of lower bandwidth and less mature tech. Ultimately, the only real difference that makes me go for one service over the other is the content.
All the money they spend still fails at basic UX, creating a finely crafted, polished turd of engineering perfection. I don't see the Netflix experience to be superior in any way to the other big services; they all have their own issues that engineering haven't solved properly, most likely some different ones.
An example (in this case ONE PIECE on Netflix Japan region):
My Netflix language is set to English, but I am not in an English speaking country. And yet, all shows have an English description and title, as well as episode lists with descriptions also in English. And yet, the program itself is not in English, nor does it have English audio options or subtitles. And the UI does not indicate if there are English subtitles until I play an episode and open the subtitles menu. This is compounded to be even worse when there are so many shows that are only half subtitled (different rights holders I assume).
Why does the UI lie to me by showing everything about the series/movie in my native language except the actual content itself? This is really common in non-English speaking regions, and it looks like a basic engineering failure of looking up their global database content in MY language while ignoring whether the actual CONTENT is available in my language. I suppose this could be a UX issue, but it also looks like an engineering one to me. And aren't those intertwined to some extent anyway?
I don't know if Netflix has caught up recently, but during the pandemic times Disney+'s group viewing was a very good feature with good UX, and I haven't seen anything like it on other services
I agree with your example and that this functionality sucks. However it cannot be an engineering issue as it must be technically very easy to implement in a better way. It must be a product/UX/marketing kind of decision behind it or just plain ignorance.
This. There's no technical limitations behind offering an A-Z catalog experience, it's all decided by that VP of user experience who wants to squeeze the most amount of screen / device time out of each user
Am I really alone to notice that each year, our collective UX across consumer (SaaS) software gets worse?
I worked at HBO Max when it launched. Less than 300 engineers, backend services written in either Node or Go. Mostly plain REST calls between services, fancier stuff was only used when there was a performance bottleneck.
The engineering was very thoughtful and the internal tooling was really good.
One underappreciated aspect of simply engineered systems is that they are simple to think about, simple to scale up, and simple to debug when something goes wrong.
And one of the underestimated aspects is that all the best engineers won't make you a good product. They will produce the best architecture and probably constant refactoring, but they will fail to deliver [things the user care about/can see].
I notice their front-end engineering (and design). The app experience has always been noticeably better than every other streaming service, which matters to me
It's also possible that savings on infrastructure costs, reduced engineering hours spent on debugging, etc give them a quiet competitive advantage
Lately, the user experience has gotten much worse for Windows users: In July they removed the ability to download content for offline viewing. So everyone who wants to watch content while travelling, for example, has no other option but to use a competitor like Prime Video.
I would also like to point out that those "savings on infrastructure costs" seemingly do not benefit their users: They have repeatedly increased prices over the past few years.
I'm also unsure whether they are using their competitive advantage properly. Anecdotally, it used to be that almost everyone I knew was only streaming on Netflix. These days, the streaming services that people use seem much more diverse.
Their front end is just plain garbage. It does everything besides helping discovery of content. And people look up to Netflix like it's some miracle thing.
It is well engineered and meticulously manicured garbage though. In your example, the harder it is to discover content the easier it is to realize there is nothing new to watch. (Hence the rotating thumbnails, for a trivial example)
One thing I hate about netflix that category recommendation also include other category stuff like Trending/Recently watched etc
I just want to watch some classical movies so plz show me your whole catalog of such movies I don't use netflix often so this issue is more prominent for me atleast since I am not used to the UI bloat
There is always a lot of good things to watch, but you have to dance around the idiotic recommendation system to find the hidden pearls. I'm sure there are a lot of good, talented people working there, but for what, when the frontend team kneecaps their work with the abysmal user interface.
And the best part... you chat with the support and they try to force a how to use our webpage shit on you when you want to report an annoying UI issue. It's really mind boggling how a congregation of idiots with their heads up their asses runs that place.
All of the services you mention have large engineering/ops teams and similar monitoring systems in place. Their is absolutely an ROI on spending on reliability.
Not an expert by any means but streaming HQ video is pretty expensive (even more so for live content), seems like the only providers that can do so profitably are YouTube and Netflix. I'm sure a big reason for that is the engineering (esp. CDN)
This is actually not true nowadays. Streaming HQ video is pretty cheap (check out per GB pricing from Cloudfront or Fastly and divide that by 5-10 to get a realistic number)
The value of netflix is probably not only on its technical prowess and app experience, but it seems they are pretty involved in content direction through metrics. ¹ ²
Jury’s out on their content direction. They cancel great shows before the first season has even had a chance to permeate. It’s clear they have little interest in long term artistic investments.
It also shows that they have no interest in building a library of quality content. They could invest in shows that have proper endings but instead they pollute their library with unfinished works that are certain to either be avoided by netflix customers who know that the show was canceled before it concluded, or piss off any netflix customer who doesn't.
Netflix doesn't care though because they just want people watching the newest thing they shovel at us. They go out of their way to hide a lot of older content from users to keep people's attention on the new stuff. Half of the categories they show you are just new ways to show you the latest things they're pushing (recently added, trending, top 10, new on netflix, your next watch, top picks for you, we think you'll love these, etc.) and every other category just repeats the same shows.
Who cares if some show leaves you disappointed because it got you hooked but then was canceled after 3 weeks, Netflix will just push some other newer show to keep you watching until they cancel that too.
I’ve thought this for a while, and it’s sad because I want to reward good tech. I usually think about this while I’m waiting for paramount to load for the third time because picture in picture has gone black again when I tried to full screen .
Although, I agree with the general point, I am thankful for the bit of extra work they do when I travel abroad. The quality and availability of Netflix is far superior (many don't even work outside the US).
EVCache is a disaster. The code base has no concept of a threading model. The code is almost completely untested* too. I was on call at least 2 time when EVcache blew up on us. I tried root causing it and the code is a rats nest. Avoid!
EVCache definitely has some sharp edges and can be hard to use, which is one of the reasons we are putting it behind these gRPC abstractions like this Counter one or e.g. KeyValue [1] which offer CompletableFuture APIs with clean async and blocking modalities. We are also starting to add proper async APIs to EVCache itself e.g. getAsync [2] which the abstractions are using under-the-hood.
At the same time, EVCache is the cheapest (by about 10x in our experiments) caching solution with global replication [3] and cache warming [4] we are aware of. Every time we've tried alternatives like Redis or managed services they either fail to scale (e.g. cannot leverage flash storage effectively [5]) or cost waaay too much at our scale.
I absolutely agree though EVCache is probably the wrong choice for most folks - most folks aren't doing 100 million operations / second with 4-region full-active replication and applications that expect p50 client-side latency <500us. Similar I think to how most folks should probably start with PostgreSQL and not Cassandra.
Throwing out a clarification: EVcache is effectively a complex memcached client + an internal ecosystem at Netflix. You can get much of its benefits with other systems (such as the memcached internal proxy: https://docs.memcached.org/features/proxy/).
For plugging into other apps they may only need a small slice of EVCache; just the fetch from local-then-far, copy sets to multiple zones, etc. A greenfield client with the same backing store could be trivial to do.
That all said I wouldn't advise people copy their method of expanding cache clusters: it's possible to add or remove one instance at a time without rebuilding and re-warming the whole thing.
Every zone has a copy, and clients always read their local zone copy (via pooled memcached connections) first and fallback only once to another zone on miss. Key is staying in zone and memcached protocol plus super fast server latencies. It's been a little while since we measured, but memcached has a time to first byte of around 10us and then scales sublinearly with payload size [1]. Single zone latency is variable but generally between 150 and 250us roundtrip, cross AZ is terrible at up to a millisecond [2].
So you put 200us network with 30us response time and get about 250us average latency. Of course the P99 tail is closer to a millisecond and you have to do things like hedges to fight things like the hard coded eternity 200ms TCP packet retry timer ... But that's a whole other can of worms to talk about.
It's a bit weird to not compare this to HyperLogLog & similar techniques that are designed to solve exactly this problem but much more cheaply (at least as far as I understand).
hyperloglog distributes perfectly. Each node keeps track of the "best" hash, and then to query the global maximum, you just ask each node for their value.
Querying each instance can lead to availability and latency challenges. Moreover, HLL is not suited for tasks like increments/decrements, TTLs on counts, and clearing of counts. Count-Min Sketch could work if we're willing to forgo certain requirements like TTLs, but relying solely on in-memory data structures on every instance isn't ideal (think about instance crashes, restarts, new code deployments etc.) Instead, using data stores that support HLL or Count-Min Sketch, like Redis, would offer better reliability. That said, we prefer to build on top of data stores we already operate. Also, the "Best-Effort" counter is literally 5 lines of code for EvCache. The bulk of the post focuses on "Eventually Consistent Accurate" counters, along with a way to track which systems sent the increments, which probabilistic counting is not ideal for.
Just to add, we are also trying to support idempotency wherever possible to enable safe hedging and retries. This is mentioned a bunch in the article on Accurate counters. So take that into consideration.
I came here to write the same thing. Getting an estimate accurate for at least 5 digits on all netflix video watches worldwide can all be done with intelligent sampling (like hyperloglog) and likely one macbook air as the backend. And aside from the compute save the complexity and implementation time would be much lower too.
Fwiw, we didn't mention any probabilistic data structures because they don't satisfy some of the basic requirements we had for the Best-Effort counter. HyperLogLog is designed for cardinality estimation, not for incrementing or decrementing specific counts (which in our case could be any arbitrary +ve/-ve number per key). AFAIK, both Count-Min Sketch and HyperLogLog do not support clearing counts for specific keys. I believe Count-Min Sketch cannot support decrement as well. The core EvCache solution for the Best-Effort counter is like 5 lines of code. And EvCache can handle millions of operations/second relatively cheaply.
Including this in the blog would have been helpful although I don’t think the decrement explanation is unsolvable - just have a second field for decrements that is incremented when you want to decrement & then the final result is a sum of the two.
True. You could do decrements that way. We trimmed this article as the post is already quite long. But considering the multiple threads on this, we might add a few lines. There is also something to be said on operating data stores that support HLL or similar probabilistic data structures. Our goal is to build taller on what we already operate and deploy (like EvCache)
I wonder how they're going to go about purging all the counters that end up unused once the employee and/or team leaves?
I can see someone setting up a huge number of counters then leaving...and in a hundred years their counters are taking up TB of space and thousands of requests-per-second.
There is a retention policy, so the raw events aren't kept very long. The rollups probably compress really well in their time series database, which I'm guessing also has a retention policy.
If you have high cardinality metrics, it can still be really painful, although I think you will feel the pain initially and it won't take years. Usually these systems have a way to inspect what metrics or counters are using the most resources and then they can be reduced or purged from time to time.
Yes, once the events are aggregated (and optionally moved to a cost-effective storage for audits), we don't need them anymore in the primary storage. You can check the retention section in the article. The rollups themselves can have TTL if the users wish to set that on a namespace. Although doing that, they have to be fine with certain timing issues on when the rollups expire and new events are aggregated. We also have automation to truncate/delete namespaces.
Yes, this is one of the approaches mentioned in the article and is indeed a valid approach. One thing to keep in mind is that we are already operating the TimeSeries service for a lot of other high ROI use cases within Netflix. There already exists a lot of automation to self-provision, configure, deploy and scale TimeSeries. There already exists automation to move data from Cassandra to S3/Iceberg. We somewhat get all that for free. The Counter service is really just the Rollup layer on top of it. The Rollup operational nuances are just to give it that extra edge when it comes to accuracy and reliability.
Not for this use case. Other use cases at Netflix use AWS Managed service when it makes sense from a use-case and cost perspective. In this case, using TimeSeries opens the door to a lot of other potential future use cases:
1. What was the count for counter X between times T1 and T2?
2. "I am going to re-run my batch job again from yesterday. Adjust the increments for this window and re-compute the final count".
Although the #2 use-case requires lot of other nuances around Recounting, which we allude to but don't expand upon in the article (adjustable retention, multiple rollup checkpoints per counter, pushing back accept-limit for backfills etc.)
I suspect if I were to recursively ask "why?", we may eventually wind up at some triviality (to the actual business/customer) that could have easily gone another way and obviated the need for this abstraction in the first place.
Just thinking about the raw information, I don't see how the average streaming media consumer produces more than a few hundred kb of useful data per month. I get that the delivery of the content is hard, but I dont see why we need to torture ourselves over the gathering of basic statistics.
But what in the world are they using a global counter for that needs "low millisecond latencies"? I don't see a single example in the entire article, and I can't think of any.
Use cases fetching counts directly in the path of Netflix users/streaming, e.g. user-personalization, feature-gating > what features are shown when you load the home page, dictated by how many times these have been shown before for a given device. The article hints at this in the beginning. Also, there were some initial use cases related to interactive titles, details of which can't be publicly shared [although that is winding down now]
> user-personalization, feature-gating > what features are shown when you load the home page, dictated by how many times these have been shown before for a given device
I don't see how any of those would suffer if the numbers took seconds to update instead of milliseconds.
We're talking about having updated numbers in milliseconds, right? Not just "the database responds in a reasonable amount of time" because that's been solved many many times over and the article specifically says "this category requires near-immediate access to the current count at low latencies".
> some initial use cases related to interactive titles
Maybe 1 second of latency for a group interaction? That's still orders of magnitude more slack. And I'd expect only moderate accuracy requirements.
For the Eventually Consistent counter, the low millisecond requirement is for reads and writes, not for the convergence of counts. For this category, the convergence is in the order of seconds (user-personalization, feature-gating fall in this category). For the "Best-effort" category, there are some use cases that run experiments in a single-region and need access to current counts at low latencies (they basically add increments and read the value back in the same call i.e. AddAndGet), but are willing to sacrifice "some degree" of accuracy for it. See the table in the 2nd section. There are multiple dimensions in terms of Read/Write Latency, Staleness, Global reads/writes etc. mapped to the two kinds of use cases. Maybe you are conflating a few things. Finally, there is the experimental type of Accurate counters that can get the current count with high degree of accuracy (but the latency there depends on a few things as explained in the article). The last type is more like what can be done using this approach, no current use case for it.
5 people (who also maintain a lot of other services like the linked TimeSeries service). The self-service to create new namespaces is pretty much autonomous (see attached link in the article on "Provisioning"). The stateless layer auto-scales up and down based on attached CPU/Network-based scaling policies. The exports to audit stores can be scheduled at a cadence. The only intervention is when we have to scale the storage layer (although parts of it also automated using the same Provisioning workflow). I guess the other intervention is when we decide to change the configs (like number of queues) and trigger a re-deploy. But thats about it. So far, we have spent a very small percentage of our support budget for this.
The queuing reminds me of old "tote board" technology. I can't find a reference easily but these were machines used to automatically figure and display parimutuel payouts at horse tracks. One particular kind of them would have the cashiers' terminals emit ball bearings on a track back at the central (digital but electromechanical) counter for each betting pool. This arrangement allowed the cashiers to sell as fast as they liked without jamming the works, and then allowed the calculation/display of odds to be "eventually consistent".
HyperLogLog (or even Count-Min Sketch) will not support some of the requirements of even the Best-Effort counter (clearing counts for specific keys, decrementing counts by any arbitrary number, having a TTL on counts etc.). For Accurate counters, we are trying to solve for multi-region read/write availability at low single-digit millisecond latency at cheap costs, using the infrastructure we already operate and deploy. There are also other requirements such as tracking the provenance of increments, which play a part.
Netflix Engineering is probably far ahead of any other competitor streaming service, but I wonder how much the ROI is on that effort and cost. As a user, I don’t see much difference in reliability between Netflix, Disney, HBO, Hulu, Peacock and Paramount+. They all error out every now and then. Maybe 5-10 years ago you needed to be much more sophisticated because of lower bandwidth and less mature tech. Ultimately, the only real difference that makes me go for one service over the other is the content.
All the money they spend still fails at basic UX, creating a finely crafted, polished turd of engineering perfection. I don't see the Netflix experience to be superior in any way to the other big services; they all have their own issues that engineering haven't solved properly, most likely some different ones.
An example (in this case ONE PIECE on Netflix Japan region): My Netflix language is set to English, but I am not in an English speaking country. And yet, all shows have an English description and title, as well as episode lists with descriptions also in English. And yet, the program itself is not in English, nor does it have English audio options or subtitles. And the UI does not indicate if there are English subtitles until I play an episode and open the subtitles menu. This is compounded to be even worse when there are so many shows that are only half subtitled (different rights holders I assume). Why does the UI lie to me by showing everything about the series/movie in my native language except the actual content itself? This is really common in non-English speaking regions, and it looks like a basic engineering failure of looking up their global database content in MY language while ignoring whether the actual CONTENT is available in my language. I suppose this could be a UX issue, but it also looks like an engineering one to me. And aren't those intertwined to some extent anyway?
Netflix has far better UI/UX and features than it's competitors. I'm sure your specific issue is a legit one, but Netflix as a whole is much better
I don't know if Netflix has caught up recently, but during the pandemic times Disney+'s group viewing was a very good feature with good UX, and I haven't seen anything like it on other services
Netflix Party launched in March 2020 for that purpose.
I agree with your example and that this functionality sucks. However it cannot be an engineering issue as it must be technically very easy to implement in a better way. It must be a product/UX/marketing kind of decision behind it or just plain ignorance.
This. There's no technical limitations behind offering an A-Z catalog experience, it's all decided by that VP of user experience who wants to squeeze the most amount of screen / device time out of each user
Am I really alone to notice that each year, our collective UX across consumer (SaaS) software gets worse?
I worked at HBO Max when it launched. Less than 300 engineers, backend services written in either Node or Go. Mostly plain REST calls between services, fancier stuff was only used when there was a performance bottleneck.
The engineering was very thoughtful and the internal tooling was really good.
One underappreciated aspect of simply engineered systems is that they are simple to think about, simple to scale up, and simple to debug when something goes wrong.
And one of the underestimated aspects is that all the best engineers won't make you a good product. They will produce the best architecture and probably constant refactoring, but they will fail to deliver [things the user care about/can see].
I notice their front-end engineering (and design). The app experience has always been noticeably better than every other streaming service, which matters to me
It's also possible that savings on infrastructure costs, reduced engineering hours spent on debugging, etc give them a quiet competitive advantage
Most engineering is not visible to the end-user
Lately, the user experience has gotten much worse for Windows users: In July they removed the ability to download content for offline viewing. So everyone who wants to watch content while travelling, for example, has no other option but to use a competitor like Prime Video.
I would also like to point out that those "savings on infrastructure costs" seemingly do not benefit their users: They have repeatedly increased prices over the past few years.
I'm also unsure whether they are using their competitive advantage properly. Anecdotally, it used to be that almost everyone I knew was only streaming on Netflix. These days, the streaming services that people use seem much more diverse.
Their front end is just plain garbage. It does everything besides helping discovery of content. And people look up to Netflix like it's some miracle thing.
It is well engineered and meticulously manicured garbage though. In your example, the harder it is to discover content the easier it is to realize there is nothing new to watch. (Hence the rotating thumbnails, for a trivial example)
One thing I hate about netflix that category recommendation also include other category stuff like Trending/Recently watched etc I just want to watch some classical movies so plz show me your whole catalog of such movies I don't use netflix often so this issue is more prominent for me atleast since I am not used to the UI bloat
There is always a lot of good things to watch, but you have to dance around the idiotic recommendation system to find the hidden pearls. I'm sure there are a lot of good, talented people working there, but for what, when the frontend team kneecaps their work with the abysmal user interface.
And the best part... you chat with the support and they try to force a how to use our webpage shit on you when you want to report an annoying UI issue. It's really mind boggling how a congregation of idiots with their heads up their asses runs that place.
That you don't like how you have to use the Netflix interface has nothing to do with how it's engineered.
All of the services you mention have large engineering/ops teams and similar monitoring systems in place. Their is absolutely an ROI on spending on reliability.
Not an expert by any means but streaming HQ video is pretty expensive (even more so for live content), seems like the only providers that can do so profitably are YouTube and Netflix. I'm sure a big reason for that is the engineering (esp. CDN)
This is actually not true nowadays. Streaming HQ video is pretty cheap (check out per GB pricing from Cloudfront or Fastly and divide that by 5-10 to get a realistic number)
The value of netflix is probably not only on its technical prowess and app experience, but it seems they are pretty involved in content direction through metrics. ¹ ²
Sources:
[1] https://youtu.be/xL58d1l-6tA
[2] https://youtu.be/uFpK_r-jEXg
Jury’s out on their content direction. They cancel great shows before the first season has even had a chance to permeate. It’s clear they have little interest in long term artistic investments.
It also shows that they have no interest in building a library of quality content. They could invest in shows that have proper endings but instead they pollute their library with unfinished works that are certain to either be avoided by netflix customers who know that the show was canceled before it concluded, or piss off any netflix customer who doesn't.
Netflix doesn't care though because they just want people watching the newest thing they shovel at us. They go out of their way to hide a lot of older content from users to keep people's attention on the new stuff. Half of the categories they show you are just new ways to show you the latest things they're pushing (recently added, trending, top 10, new on netflix, your next watch, top picks for you, we think you'll love these, etc.) and every other category just repeats the same shows.
Who cares if some show leaves you disappointed because it got you hooked but then was canceled after 3 weeks, Netflix will just push some other newer show to keep you watching until they cancel that too.
I’ve thought this for a while, and it’s sad because I want to reward good tech. I usually think about this while I’m waiting for paramount to load for the third time because picture in picture has gone black again when I tried to full screen .
Although, I agree with the general point, I am thankful for the bit of extra work they do when I travel abroad. The quality and availability of Netflix is far superior (many don't even work outside the US).
> EVCache
EVCache is a disaster. The code base has no concept of a threading model. The code is almost completely untested* too. I was on call at least 2 time when EVcache blew up on us. I tried root causing it and the code is a rats nest. Avoid!
* https://github.com/Netflix/EVCache
(I work at Netflix on these Datastores)
EVCache definitely has some sharp edges and can be hard to use, which is one of the reasons we are putting it behind these gRPC abstractions like this Counter one or e.g. KeyValue [1] which offer CompletableFuture APIs with clean async and blocking modalities. We are also starting to add proper async APIs to EVCache itself e.g. getAsync [2] which the abstractions are using under-the-hood.
At the same time, EVCache is the cheapest (by about 10x in our experiments) caching solution with global replication [3] and cache warming [4] we are aware of. Every time we've tried alternatives like Redis or managed services they either fail to scale (e.g. cannot leverage flash storage effectively [5]) or cost waaay too much at our scale.
I absolutely agree though EVCache is probably the wrong choice for most folks - most folks aren't doing 100 million operations / second with 4-region full-active replication and applications that expect p50 client-side latency <500us. Similar I think to how most folks should probably start with PostgreSQL and not Cassandra.
[1] https://netflixtechblog.com/introducing-netflixs-key-value-d...
[2] https://github.com/Netflix/EVCache/blob/11b47ecb4e15234ca99c...
[3] https://www.infoq.com/articles/netflix-global-cache/
[4] https://netflixtechblog.medium.com/cache-warming-leveraging-...
[5] https://netflixtechblog.com/evolution-of-application-data-ca...
Throwing out a clarification: EVcache is effectively a complex memcached client + an internal ecosystem at Netflix. You can get much of its benefits with other systems (such as the memcached internal proxy: https://docs.memcached.org/features/proxy/).
For plugging into other apps they may only need a small slice of EVCache; just the fetch from local-then-far, copy sets to multiple zones, etc. A greenfield client with the same backing store could be trivial to do.
That all said I wouldn't advise people copy their method of expanding cache clusters: it's possible to add or remove one instance at a time without rebuilding and re-warming the whole thing.
Curious to how your getting <500us latencies. Connection pooling, GRPC?
Every zone has a copy, and clients always read their local zone copy (via pooled memcached connections) first and fallback only once to another zone on miss. Key is staying in zone and memcached protocol plus super fast server latencies. It's been a little while since we measured, but memcached has a time to first byte of around 10us and then scales sublinearly with payload size [1]. Single zone latency is variable but generally between 150 and 250us roundtrip, cross AZ is terrible at up to a millisecond [2].
So you put 200us network with 30us response time and get about 250us average latency. Of course the P99 tail is closer to a millisecond and you have to do things like hedges to fight things like the hard coded eternity 200ms TCP packet retry timer ... But that's a whole other can of worms to talk about.
[1] https://github.com/Netflix-Skunkworks/service-capacity-model...
[2] https://jolynch.github.io/pdf/wlllb-apachecon-2022.pdf
I'm surprised it's still there! It was built over a decade ago when I was still there. At the time there were no other good solutions.
But Momento exists now. It solves every problem EVCache was supposed to solve.
There are other options too. They should retire it by now.
Momento? This? https://www.gomomento.com/
Seems to be cloud hosted only.
Can you elaborate?
From the looks of it, each module has plenty of tests - and the codebase is written in a spring/boot style, making it fairly intuitive to navigate.
It's a bit weird to not compare this to HyperLogLog & similar techniques that are designed to solve exactly this problem but much more cheaply (at least as far as I understand).
Added a note on why we chose EvCache for the "Best-Effort" use-case instead of probabilistic data structures like HLL/CMS. Appreciate the discussion.
Can you do a distributed HyperLogLog? Wouldn't you have to have a single instance of it somewhere?
hyperloglog distributes perfectly. Each node keeps track of the "best" hash, and then to query the global maximum, you just ask each node for their value.
Querying each instance can lead to availability and latency challenges. Moreover, HLL is not suited for tasks like increments/decrements, TTLs on counts, and clearing of counts. Count-Min Sketch could work if we're willing to forgo certain requirements like TTLs, but relying solely on in-memory data structures on every instance isn't ideal (think about instance crashes, restarts, new code deployments etc.) Instead, using data stores that support HLL or Count-Min Sketch, like Redis, would offer better reliability. That said, we prefer to build on top of data stores we already operate. Also, the "Best-Effort" counter is literally 5 lines of code for EvCache. The bulk of the post focuses on "Eventually Consistent Accurate" counters, along with a way to track which systems sent the increments, which probabilistic counting is not ideal for.
Just to add, we are also trying to support idempotency wherever possible to enable safe hedging and retries. This is mentioned a bunch in the article on Accurate counters. So take that into consideration.
I came here to write the same thing. Getting an estimate accurate for at least 5 digits on all netflix video watches worldwide can all be done with intelligent sampling (like hyperloglog) and likely one macbook air as the backend. And aside from the compute save the complexity and implementation time would be much lower too.
Fwiw, we didn't mention any probabilistic data structures because they don't satisfy some of the basic requirements we had for the Best-Effort counter. HyperLogLog is designed for cardinality estimation, not for incrementing or decrementing specific counts (which in our case could be any arbitrary +ve/-ve number per key). AFAIK, both Count-Min Sketch and HyperLogLog do not support clearing counts for specific keys. I believe Count-Min Sketch cannot support decrement as well. The core EvCache solution for the Best-Effort counter is like 5 lines of code. And EvCache can handle millions of operations/second relatively cheaply.
Including this in the blog would have been helpful although I don’t think the decrement explanation is unsolvable - just have a second field for decrements that is incremented when you want to decrement & then the final result is a sum of the two.
True. You could do decrements that way. We trimmed this article as the post is already quite long. But considering the multiple threads on this, we might add a few lines. There is also something to be said on operating data stores that support HLL or similar probabilistic data structures. Our goal is to build taller on what we already operate and deploy (like EvCache)
I wonder how they're going to go about purging all the counters that end up unused once the employee and/or team leaves?
I can see someone setting up a huge number of counters then leaving...and in a hundred years their counters are taking up TB of space and thousands of requests-per-second.
There is a retention policy, so the raw events aren't kept very long. The rollups probably compress really well in their time series database, which I'm guessing also has a retention policy.
If you have high cardinality metrics, it can still be really painful, although I think you will feel the pain initially and it won't take years. Usually these systems have a way to inspect what metrics or counters are using the most resources and then they can be reduced or purged from time to time.
Yes, once the events are aggregated (and optionally moved to a cost-effective storage for audits), we don't need them anymore in the primary storage. You can check the retention section in the article. The rollups themselves can have TTL if the users wish to set that on a namespace. Although doing that, they have to be fine with certain timing issues on when the rollups expire and new events are aggregated. We also have automation to truncate/delete namespaces.
I think the design could have been simpler with Kafka (which they touched on briefly):
- Write counter changes to a Kafka topic with many partitions. The partition key is derived from the counter name.
- Use Kafka connect to push all counter events to S3 for audit and analysis.
- Write a Kafka consumer that reads events in batches and updates a persistent store with the current count.
- Pick a good Kafka message lifetime to ensure that topic size is kept under control, but data is not lost.
This gives us:
- Fast reads (count is precomputed, but potentially stale)
- Fast writes (Kafka)
- Correctness (every counter is assigned exactly one consumer)
- Durability (all state is in Kafka or the persistent store)
- Scalable storage and compute requirements over time
If I were to really go crazy with this, I would shard each counter further and use CRDTs to compute the total across all shards.
Yes, this is one of the approaches mentioned in the article and is indeed a valid approach. One thing to keep in mind is that we are already operating the TimeSeries service for a lot of other high ROI use cases within Netflix. There already exists a lot of automation to self-provision, configure, deploy and scale TimeSeries. There already exists automation to move data from Cassandra to S3/Iceberg. We somewhat get all that for free. The Counter service is really just the Rollup layer on top of it. The Rollup operational nuances are just to give it that extra edge when it comes to accuracy and reliability.
Did you also consider AWS managed services? Like a direct write to Dynamo?
Not for this use case. Other use cases at Netflix use AWS Managed service when it makes sense from a use-case and cost perspective. In this case, using TimeSeries opens the door to a lot of other potential future use cases:
1. What was the count for counter X between times T1 and T2? 2. "I am going to re-run my batch job again from yesterday. Adjust the increments for this window and re-compute the final count".
Although the #2 use-case requires lot of other nuances around Recounting, which we allude to but don't expand upon in the article (adjustable retention, multiple rollup checkpoints per counter, pushing back accept-limit for backfills etc.)
This seems a bit overcooked to me.
I suspect if I were to recursively ask "why?", we may eventually wind up at some triviality (to the actual business/customer) that could have easily gone another way and obviated the need for this abstraction in the first place.
Just thinking about the raw information, I don't see how the average streaming media consumer produces more than a few hundred kb of useful data per month. I get that the delivery of the content is hard, but I dont see why we need to torture ourselves over the gathering of basic statistics.
Why would netflix put their blog on medium?
Better distribution. More people will read it there.
That was probably true 5 years ago, but today a large chunk of readers are going to be driven away by the intrusive popups and auth walls.
I didnt read it and that was the reason I asked.
Because it's the least bad way to do a blog when you want to focus on the content.
Well okay, that's some neat implementation stuff.
But what in the world are they using a global counter for that needs "low millisecond latencies"? I don't see a single example in the entire article, and I can't think of any.
Use cases fetching counts directly in the path of Netflix users/streaming, e.g. user-personalization, feature-gating > what features are shown when you load the home page, dictated by how many times these have been shown before for a given device. The article hints at this in the beginning. Also, there were some initial use cases related to interactive titles, details of which can't be publicly shared [although that is winding down now]
> user-personalization, feature-gating > what features are shown when you load the home page, dictated by how many times these have been shown before for a given device
I don't see how any of those would suffer if the numbers took seconds to update instead of milliseconds.
We're talking about having updated numbers in milliseconds, right? Not just "the database responds in a reasonable amount of time" because that's been solved many many times over and the article specifically says "this category requires near-immediate access to the current count at low latencies".
> some initial use cases related to interactive titles
Maybe 1 second of latency for a group interaction? That's still orders of magnitude more slack. And I'd expect only moderate accuracy requirements.
For the Eventually Consistent counter, the low millisecond requirement is for reads and writes, not for the convergence of counts. For this category, the convergence is in the order of seconds (user-personalization, feature-gating fall in this category). For the "Best-effort" category, there are some use cases that run experiments in a single-region and need access to current counts at low latencies (they basically add increments and read the value back in the same call i.e. AddAndGet), but are willing to sacrifice "some degree" of accuracy for it. See the table in the 2nd section. There are multiple dimensions in terms of Read/Write Latency, Staleness, Global reads/writes etc. mapped to the two kinds of use cases. Maybe you are conflating a few things. Finally, there is the experimental type of Accurate counters that can get the current count with high degree of accuracy (but the latency there depends on a few things as explained in the article). The last type is more like what can be done using this approach, no current use case for it.
Given the complexity of the system, I'm curious to know how many people maintain this service
5 people (who also maintain a lot of other services like the linked TimeSeries service). The self-service to create new namespaces is pretty much autonomous (see attached link in the article on "Provisioning"). The stateless layer auto-scales up and down based on attached CPU/Network-based scaling policies. The exports to audit stores can be scheduled at a cadence. The only intervention is when we have to scale the storage layer (although parts of it also automated using the same Provisioning workflow). I guess the other intervention is when we decide to change the configs (like number of queues) and trigger a re-deploy. But thats about it. So far, we have spent a very small percentage of our support budget for this.
Looks a bit overengineered due to Netflix's own microservices nonsense.
I would be more interested in how a higher traffic video company like Pornhub handles things like this.
> Netflix's own microservices nonsense
How many times has Netflix been entirely down over the years?
Seems it's not "nonesense".
Netflix's technology is so nonsense that right now for the boxing match they are dropping the quality to 720 pixels.
Have you worked with distributed counters before? It's a hard problem to solve. Typical tradeoffs are lower cardinality for exact counters.
The queue solution is pretty elegant.
The queuing reminds me of old "tote board" technology. I can't find a reference easily but these were machines used to automatically figure and display parimutuel payouts at horse tracks. One particular kind of them would have the cashiers' terminals emit ball bearings on a track back at the central (digital but electromechanical) counter for each betting pool. This arrangement allowed the cashiers to sell as fast as they liked without jamming the works, and then allowed the calculation/display of odds to be "eventually consistent".
How would you design it to support mentioned use cases?
HyperLogLog and PostgreSQL: https://github.com/citusdata/postgresql-hll
Or even simpler, Roaring Bitmaps: https://pncnmnp.github.io/blogs/roaring-bitmaps.html
https://blog.quastor.org/p/grab-rate-limiting
https://github.com/RoaringBitmap/CRoaring
HyperLogLog doesn't support exact counters though, does it? That seems to be one of the core requirements of the queue-based solution.
HyperLogLog (or even Count-Min Sketch) will not support some of the requirements of even the Best-Effort counter (clearing counts for specific keys, decrementing counts by any arbitrary number, having a TTL on counts etc.). For Accurate counters, we are trying to solve for multi-region read/write availability at low single-digit millisecond latency at cheap costs, using the infrastructure we already operate and deploy. There are also other requirements such as tracking the provenance of increments, which play a part.
if anyone want a non-distributed but still very powerful counter service, I'd recommend Telegraf from Grafana or https://vector.dev/