X Tutup
The Wayback Machine - https://web.archive.org/web/20210623213652/https://github.com/netdata/netdata/pull/8399
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf fixes #8399

Merged
merged 12 commits into from Mar 14, 2020
Merged

Perf fixes #8399

merged 12 commits into from Mar 14, 2020

Conversation

@amoss
Copy link
Contributor

@amoss amoss commented Mar 13, 2020

Summary

Initial work on fixing the performance problems in LWS / MQTT. Reduces the visible latency from 120s for the initial on-connect payloads to 10-20s. May increase CPU. Fixes the encoding errors for the JSON messages on the link allowing the alarms to work.

Component Name

ACLK

Test Plan
  • Run against the cloud local dev env and ensure that performance is no worse that it was.
  • Run the paho inspection container inside aclk/testing/ and verify that messages are arriving at VerneMQ.
Additional Information

Although this is not a complete fix for the performance problems - it does improve performance and allows other PRs to be merged on top.

@amoss amoss requested review from underhood and stelfrag Mar 13, 2020
@amoss amoss requested review from cosmix and ktsaou as code owners Mar 13, 2020
@squash-labs
Copy link

@squash-labs squash-labs bot commented Mar 13, 2020

Manage this branch in Squash

Test this branch here: https://amossperf-fixes-bionf.squash.io
@amoss amoss removed request for cosmix and ktsaou Mar 13, 2020
@amoss amoss requested review from Ferroin, knatsakis, ncmans and prologic as code owners Mar 13, 2020
@amoss amoss removed request for Ferroin, knatsakis, ncmans and prologic Mar 13, 2020
amoss added 12 commits Mar 12, 2020
Currently disabled as my local setup has blown up and I need to gently repair the local dev env
with a blow torch.
The main thread dumps a 2MB payload for writing to the broker every time the write queue is empty.
Currently we are thrashing the LWS event loop to see what it does, and interleaving with the
mosquitto event loop. It take between 40-60s for the payload to arrive at the inspection point.
The write-queue in the agent stays full for this period.
Message disassembly before transmission. It looks as if the MQTT broker reassembles the messages from the
stream of independent websockets messages. This would make sense.

There is a massive delay in the lws library between asking for the writable callback and actually receiving
it. This is to do with the size of the SND_BUFFER option on the socket. This fix is a bit CPU heavy as it
cancels the poll() to force a reattempt.

This needs to be properly integrated into our poll-loop.
@amoss amoss force-pushed the amoss:perf-fixes branch from 887081e to 3b7496f Mar 13, 2020
@amoss amoss merged commit 87a0559 into netdata:master Mar 14, 2020
33 of 36 checks passed
33 of 36 checks passed
@github-actions
Build (alpine:3.11)
Details
@github-actions
Checksums
Details
@github-actions
eslint
Details
@github-actions
Build (alpine:3.10)
Details
@github-actions
Build (alpine:3.9)
Details
@github-actions
Build (archlinux:latest)
Details
@github-actions
Build (centos:7)
Details
@github-actions
Build (centos:6)
Details
@github-actions
Build (debian:bullseye)
Details
@github-actions
Build (debian:buster)
Details
@github-actions
Build (debian:stretch)
Details
@github-actions
Build (fedora:31)
Details
@github-actions
Build (fedora:30)
Details
@github-actions
Build (fedora:29)
Details
@github-actions
Build (ubuntu:20.04)
Details
@github-actions
Build (ubuntu:19.10)
Details
@github-actions
Build (ubuntu:19.04)
Details
@github-actions
Build (ubuntu:18.04)
Details
@github-actions
Build (ubuntu:16.04)
Details
@github-actions
Dashboard
Details
@github-actions
shellcheck
Details
@github-actions
Dist
Details
@github-actions
yamllint
Details
@netlify
Header rules - netdata
Details
@netlify
Mixed content - netdata
Details
@netlify
Redirect rules - netdata
Details
@lgtm-com
LGTM analysis: JavaScript No code changes detected
Details
@netlify
Pages changed - netdata 4 new files uploaded
Details
Codacy/PR Quality Review Up to standards. A positive pull request.
Details
@lgtm-com
LGTM analysis: C/C++ No new or fixed alerts
Details
@lgtm-com
LGTM analysis: Python No new or fixed alerts
Details
@squash-labs
Squash environment: netdata Successful in 3.66 minutes - Received a success response
Details
@travis-ci
Travis CI - Pull Request Build Passed
Details
@wip
WIP Ready for review
Details
license/cla Contributor License Agreement is signed.
Details
@netlify
netlify/netdata/deploy-preview Deploy preview ready!
Details
@amoss amoss deleted the amoss:perf-fixes branch Mar 14, 2020
@amoss amoss linked an issue that may be closed by this pull request Mar 14, 2020
Saruspete added a commit to Saruspete/netdata that referenced this pull request May 21, 2020
Add an inspection point for VerneMQ in the local dev env. Remove the bottleneck in sending websocket messages, at the expense of increased CPU-load. Fixed the message encoding. Added support for stress testing - it is still enabled in the main loop so will fire stress-testing payloads when the ACLK is established.

Next patch will integrate the socket polling properly to reduce the CPU overhead and remove the stress testing payloads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants
X Tutup