> In a dream world of course, everything would load in < 1s
It's important you understand that "everything loading in <1s" would still be unacceptably slow - that is still an order of magnitude too slow.
That is not "a dream world" - not even close. A well built tool like this, meeting standard expectations (i.e. table stakes), would hit <50ms for the end user - the vast majority of the time. A "dream world" would be more like 10ms.
You should be targeting <200ms for 99% of user-facing interactions. That is the baseline standard/minimum expected.
This is why people are saying the company needs to make a major shift on this - you're not just out of the ballpark of table stakes here, you're barely in the same county!
It cannot be overstated how far off the mark you are here. There's a fundamental missetting of expectations and understanding of what is acceptable.
Do you have evidence that what you're asking for is possible? I'd be interested to see websites that hit the benchmark that you're aiming for.
I just tested a HN profile page (famously one of the lightest weight non-static websites) and it takes between 300ms and 600ms to load. I'm not saying that Jira can't improve, but if HN isn't hitting 250ms then I think telling the Jira guys that nothing less than <200ms is the minimum standard is unrealistic.
Look at github pull requests. It loads in under 200ms for me. And is vastly more complex than HN, both in sense of queries and UI, content should be equivalent of what Jira needs.
Jira is also much more interactive than HN. You are sitting 10+ people in a room with some half asleep scrum master opening the wrong issue, have to go back and open the correct one, search again for some related issue you though was fixed last month. Refresh the board to make sure you didnt forget to fill in one field so it ends up in the wrong column, etc etc.
1 sec per click in a situation like this is a joke, and that's just their goal. Reality is 4sec+ as OP mentioned, often even more.
Assuming the FE resources are already cached on the user's machine, with careful optimisation, doing all of the rendering/fetching on the FE over a single connection, and with everything parallelised, it definitely is possible to load a new page well under 100ms with the key content being displayed.
When taking that kind of approach, you don't have to wait for the slowest thing to come in - eg with a normal BE render, you might need to pull up the user's profile and settings, A/B testing flags, the current footer config or whatever.
eg if you're on the page for viewing a single ticket, you can request the ticket data immediately, and render it as soon as it's available - even if other parts of the page aren't finished yet. True it may be more like 200-300ms to have the entire thing be 100% complete, but all parts of the page are not of equal importance and holding up the main content while loading the rest isn't necessary.
If you are doing a full BE render, it's still totally possible to hit that 100ms mark, but indeed dramatically more difficult.
You're right, I apologize for not being clear. We're targeting 1s for "Initial loads" on new tabs/new navigation, which I assume you're referring to. Our target for 'transitions' is different.
If however the numbers you're referring to are "initial load" numbers, then I'm not sure.
(edit: and action responses again are also a separate category. Our largest number of complaints are about 'page load times' in Confluence, so most conversations center around that)
As a first step, 1s would be better than nothing for sure, but you need to be working towards a much tighter goal on a 1-2 year timeframe.
New load, you should really be hitting 200ms as your 95th percentile - 300ms or so would be decent still. "Transitions" should hit 100ms 95th, 150ms would be decent.
If you did hit 100ms across the board, you'd be rewarded by your customers/users psychologically considering the interactions as being effectively instantaneous. So it really is worth setting a super high bar as your target here (esp given you need a bit of breathing room for future changes too).
Thank you for coming back and clarifying. Do you happen to have links to any public testing results of other tools, or guidance to this specificity - would love to use them to build a case internally
Most of what we've seen online are nowhere near this level of detail (X-ms for Y-%ile for Z-type of load)
On the other noted times, they're just a general range of what can be expected from a reasonably well-built tool of this nature. Obviously much simpler systems should be drastically faster, but project management tools do tend to be processing quite a bit of data and so do involve _some_ amount of inherent "weight", but that isn't an excuse for very poor perf.
That said, I imagine if your PMs do some research and go ahead and try using some of the common project management tools, you should get a good idea. ;) Keep in mind speeds to Australia (assuming Atlassian is operated mostly there?) will likely show them in a much worse light than typical perf experienced in the US/UK/EU areas.
The time to first load is derived from the fact that you're running essentially the equivalent of many "transition" type interactions, but they should be run almost entirely in parallel, so roughly 2x between "transition" and "new load" is a reasonable allowance.
Thanks for the link! Yes this is the general guidance we're using too (0.1/1/10s), and one that we're reinforcing at every level of the company. This link does have more detail than I've seen in other places though, so it's an interesting read.
However I've not seen guidance on whether these should be P90 or P95 or P99 measures for example though. We've selected something internally, but obviously selecting amongst three 'measurement points' could drastically change general user's experience.
(HN is throttling my replies so apologies for delay)
A big part is simply how far you are in your journey of getting good at performance - if your p50 is still garbage, there's not much point in focussing on your p99 measurements. You should be targeting the p99 long term, but focus on the p50/p90 for now.
It's super important to target and make long term decisions around the p99 though, because, e.g., making a 100x improvement is not possible through little iterative changes over 2-3 years. You need a base to work from where that 100x is fundamentally achievable, which requires thinking from first principles and slightly getting out of the typical product mindset.
I also find the typical product mindset tends to result in focussing a lot on the "this quarter/next quarter" goals, but neglecting the "8/12 quarters from now" as a result.
Beyond short term/long term goals, the choice is largely just down to what the product is/does. Even ignoring all current architectural choices, there are some fundamentals where certain things must always be faster/slower - e.g. sync writes will typically be a fair bit slower than reads, and typically occur much less often, complex dynamic queries which can't be pre-optimised require DB scanning but are much less common.
For these kinds of tools, where most of the interaction is reads, mostly on predefined or predefined + small extra filtering, and reading/writing on individual resources (ie tickets), you can get p99 numbers trending towards the 100ms mark eventually - there's very little which truly can't get to that level with clever enough engineering.
---
Of course I imagine Google tends to be looking more at their p99.9/p99.99/pmax/etc(!), at least for their absolute highest volume systems.
None of us are going to be getting to that point, but it's often worth thinking about engineering principles against a super high bar - it often helps people to open their minds a bit more and think more outside the box when given a really dramatic goal magnitudes beyond their existing mindset.
Of course you're not expecting to really get to that level, but anchoring that way can achieve amazing things. I've done that with a lot of success at my company and we actually did manage to achieve a few originally thought to be totally unrealistic.
Not to be a jerk, but you guys don’t allow others to take your performance metrics, but you’re publicly soliciting performance data from other products at the same time? I’m assuming you’re taking it for granted they don’t have a ToS that bans you from doing this.
Sorry if that’s pointed, but it’s sort of meant to be incredulous (but hopefully not offensive).
Not offended - as an employee I have no specific insight into the actual headline term in the ToS - honestly I'm planning on tracking down someone in legal to help clarify this, since it seems like it currently as written (and currently as interpreted in worst case) unecessarily impedes me from doing my job.
I would never encourage anyone to violate a ToS of another product and apologize for anyone that was considering doing it due to my ask.
I think these are other possibilities:
1) (as stated) other products don't have such ToS
2) other products may have published their own metrics and made them available for consumption
3) from a more legal in depth standpoint, maybe other companies have such ToS terms but have clarified them to some point that makes them more clear about when they apply and when they don't
Sorry you have to work on this thread on vacation dude (or girl). This thread has been an absolute beat down and you’ve treated it with utmost professionalism when it’s pretty clear you’re new to the team. It is Saturday night after all.
I appreciate the well wishes, honestly it means a lot - good guess too, I am in the US (maybe you knew, but most of Confluence Cloud is based out of the West Coast offices).
I don't know what the overlap is between Atlassian users and IT admins though - my previous job was on the vSphere UI and if you happen to know about the death of the Flash based client, this is not too far off.
Hopefully users stay willing to engage with us so we can improve the product as fast as possible.
I don't think I know what that might exactly mean - I know we have synthetic traffic generation tools (and thus, measurements generated from the synthetic traffic), but I think those exhibit the same variance as production -> the backend for them are the same cloud IaaS systems and SW, so there's no 'sandboxed from all outside variance'.
If it means something else then I'm not aware if we do it or not.
no dont give into this guy ... this is done over the net. The rate of transfer has to be taken into account. Unacceptable is a measure of comparison.
Unacceptable to who, you have a faster provider for cheaper, with as many features???
Im pretty sure he doesnt because if he could he would go there. There are tradeoffs and Atlassian has many project they are working on. They understand that there is room for improvement in performance. Its one of Atlassian's priorities, it is a tech company (a pretty good one I would say).
I guess one question is about server redundancy. Where is this guy loading from and where is the server he is loading from? Getting things below 1s is nearing the speed of the connection itself. Also at that speed there is deminishing returns. Something that happens at 1s vs .5s doesnt make you twice as fast when you dont even have the response time to move your mouse and click on the next item in .5s.
Sometimes techies just love to argue. You are doing great Atlassian and have tons of features. But maybe it is time to revisit and refactor some of your older tools.
> Getting things below 1s is nearing the speed of the connection itself
That is absolutely false. Internet latency is actually very low - even e.g. Paris to NZ is only about 270ms RTT, and you _do not_ need multiple full round trips to the application server for an encrypted connection - on the modern internet, connections are held open, and initial TLS termination is done at local PoPs.
For services like this - as they are sharded with customer tenancy - are usually located at least in the same vague area as the customer (e.g. within North America, Western Europe, APAC etc).
For most users of things like Atlassian products, that typically results in a base networking latency of <30ms, often even <10ms in good conditions.
Really well engineered products can even operate in multiple regions at once - offering that sort of latency globally.
> Im pretty sure he doesnt because if he could he would go there
Yeah, we don't use any Atlassian products - partly for this reason. We use many Atlassian-comparable tools which have the featureset we want and which are drastically faster.
> when you dont even have the response time to move your mouse and click on the next item in .5s.
Not really, I have no particular investment in this - I don't use any Atlassian product, nor do I plan to even if they make massive perf improvements.
But I do have an objective grasp - for tools like this - of what's possible, what good looks like, and what user expectations look like.
> no dont give into this guy
I don't expect Atlassian is going to make any major decisions entirely based on my feedback here, but it is useful data/input for exploration, and I do feel it's right to point out that they're looking in the wrong ballpark when it comes to the scale of improvement needed.
To put things in perspective, the typical Jira 5-second page load time as reported by many people in this forum is equivalent to twice the round-trip time for light to the Moon!
It's the network latency equivalent of a million kilometres of fibre!
The internet is fast. Computers are fast. One second is enough time for my machine to download 10M data points and render them into an interactive plot.
In my mind, anyone doing UI development and seeing user interactions taking over 1 second should be asking themselves "did the user just try to operate on more than 10^6 of something?" and if the answer is no, start operating under the assumption that they've made a mistake.
What gateways have you been using?! That's a long, long way off on the modern internet. Assuming you mean gateways as in the lines you'd see on a traceroute, more typical might be ~2-5ms on a home router, ~0.5-1.0ms upstream.
Ah nice, I didn't realise you meant application proxies/gateways. Network ones are so quick due to their ASICs etc!
I personally would still say 50ms is super, super slow for an application gateway - a well designed one using e.g. nginx/openresty, lambda@edge, or simply writing another application server etc can easily do that job with an addition of <0.1ms processing time (assuming no additional network calls or heavy work), and maybe 0.3ms for additional connection establishment if it hasn't been optimised to use persistent connections.
If it is e.g. making a DB request to check auth, I would highlight that this _is_ backend processing time, not inherent or unoptimisable overhead. e.g. it's totally feasible to do auth checks without making any async calls, just need a bit of crypto and to allocate some memory for tracking revoked tokens - does add a bit of complexity, but likely worth it for the super hot path.
BFFs would not really need to add anything beyond ~1ms or so, but you do hit the lowest common denominator - in that you have to wait for the slowest thing to complete, even if everything is happening in parallel.
BFFs definitely benefit in simplifying client-side code, but at the downside of increased overall latency and potentially resilience which could be achieved by decoupling unrelated components.
As such, I wouldn't expect the Atlassian products to use BFF patterns - for them it's better to throw 1k requests down a single HTTP 2/3 connection and render each part of the page when it's available. I have heard their FEs are very complex, which I think would probably support that assessment.
Gateways can add a lot of functionality. Even Graphql can be used as a gateway.
It's not all "dumb forwarding" and I would be very surprised that you find any sub ms benchmarks.
Amazon has a one million dollar award if you get the page to load under 10 ms. So that's what you are expecting by default on a saas in your previous comment.
That just says that Ocelot consumes quite a bit of your latency budget. Maybe the features it brings are worth it to you, but it's def not anywhere close to the limit of what's achievable.
e.g. Envoy (which replaced Ocelot in Microsoft's .net microservice reference architecture) has a significantly lower latency cost (1)
Your reference to an Amazon reward is interesting as it's quite easy to get pages to load under 10ms in the right conditions. Perhaps you can provide a link to more information?
At the end of the day, all that middleware type stuff is part of your backend - it is not inherent overhead.
If you want to really focus on performance, you can choose not to use anything like that off the shelf and do it all in a fraction of a millisecond. It actually isn't difficult - you just need to not get stuck into a dependency on something heavy.
For my company's backend, our entire middleware stack incl auth checks is around the 1-2ms level including hitting a DB server to check for token revocation. That's all there is between the end user and our application code, plus network latency. We didn't do anything particularly clever or special. But we didn't use any frameworks or heavy magic products - just Go's net/http, the chi router and Lambda@Edge.
It's important you understand that "everything loading in <1s" would still be unacceptably slow - that is still an order of magnitude too slow.
That is not "a dream world" - not even close. A well built tool like this, meeting standard expectations (i.e. table stakes), would hit <50ms for the end user - the vast majority of the time. A "dream world" would be more like 10ms.
You should be targeting <200ms for 99% of user-facing interactions. That is the baseline standard/minimum expected.
This is why people are saying the company needs to make a major shift on this - you're not just out of the ballpark of table stakes here, you're barely in the same county!
It cannot be overstated how far off the mark you are here. There's a fundamental missetting of expectations and understanding of what is acceptable.