Implement netperf for "mc support perf net" #14397

krishnasrinivas · 2022-02-24T02:18:42Z

Description

Support for mesh style throughput test

Motivation and Context

How to test this PR?

mc support perf net ALIAS

Related mc PR: minio/mc#3998

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Optimization (provides speedup with no functional changes)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

Fixes a regression (If yes, please add commit-id or PR # here)
Documentation updated
Unit tests added/updated

go.mod

cmd/peer-rest-server.go

klauspost

Besides the mentioned, it should detect if any speedtest is currently running, and not start a new.

cmd/peer-rest-server.go

klauspost · 2022-02-25T14:11:13Z

Here is an example of how stats can be collected safely.

When the request comes in, the server decides on a random ID.

The first request that comes with the ID to any server starts recording bytes.

I added some time keeping on first/last connection. Not sure how much it is needed.

This is nor order dependent, and ensures you don't have multiple running at the same time.

var globalNetperfRX netperfStats

// Used for collecting stats for netperf
type netperfStats struct {
	received  uint64
	id        uint64
	first     int64
	last      int64
	connected int32
	mu        sync.Mutex
}

// recorder will return a counter linked to the id.
// If the ID is unknown and no other counters are running
// a new will be initialized.
// The returned pointer must be atomically incremented.
// If nil is returned another recording is already.
// Whenever a non-nil value is returned release() must be called.
func (n *netperfStats) recorder(id uint64) *uint64 {
	n.mu.Lock()
	defer n.mu.Unlock()
	if id == n.id {
		atomic.AddInt32(&n.connected, 1)
		return &n.received
	}
	if atomic.LoadInt32(&n.connected) > 0 {
		return nil
	}
	// Replace with new, no-one is connected.
	n.id = id
	n.received = 0
	n.connected++
	n.first = time.Now().UnixNano()
	n.last = 0
	return &n.received
}

// getRecorded returns statistics recorded on the specific ID.
func (n *netperfStats) getRecorded(id uint64) (uint64, time.Duration, error) {
	n.mu.Lock()
	defer n.mu.Unlock()
	if n.id != id {
		return 0, 0, errors.New("id not found")
	}
	if n.connected > 0 {
		return 0, 0, errors.New("clients are still connected")
	}
	return n.received, time.Duration(n.last-n.first) * time.Nanosecond, nil
}

// release a connection from the stats.
func (n *netperfStats) release() {
	atomic.AddInt32(&n.connected, -1)
	atomic.StoreInt64(&n.last, time.Now().UnixNano())
}

// DevNull - everything goes to io.Discard
func (s *peerRESTServer) DevNull(w http.ResponseWriter, r *http.Request) {
	if !s.IsValid(w, r) {
		s.writeErrorResponse(w, errors.New("invalid request"))
		return
	}
	id, err := strconv.ParseUint(r.Form.Get("id"), 10, 64)
	if err != nil {
		w.WriteHeader(http.StatusBadRequest)
		return
	}
	rx := globalNetperfRX.recorder(id)
	if rx == nil {
		w.WriteHeader(http.StatusConflict)
		return
	}
	defer globalNetperfRX.release()
	for {
		n, err := io.CopyN(io.Discard, r.Body, 128*humanize.KiByte)
		atomic.AddUint64(rx, uint64(n))
		if err != nil {
			break
		}
	}
}

krishnasrinivas · 2022-02-26T03:07:43Z

Here is an example of how stats can be collected safely.
When the request comes in, the server decides on a random ID.
...

@klauspost this implementation/code is not worth it. Though this covers the start of the test but not the end of the test. i.e after 10 sec timeout we have to wait till peers stop sending data if we want to be accurate towards the end of the test. But that extends the duration for RX to more than 10 seconds. So for TX the duration will be 10 seconds but RX will be more than 10 secs. Adjusting calculation for these timing variations will have additional complexity.

The "net perf" test's purpose is to get a quick rough estimate of the network capability to rule out glaring network issues. The code implemented in this PR is good enough as proven by the tests.

klauspost · 2022-02-28T17:38:03Z

@krishnasrinivas The timing is purely optional. But it will at least record when you have a hard time starting at the same time.

Starting in sync is very hard. Warp gives an absolute start time 3 seconds in the future, but that requires that clocks are in sync, which for sure isn't perfect, though we check if they are within 1 second absolute time.

The main change is to make the current system fault tolerant and not risk multiple running at once and ensure that stats aren't dropped in high latency cases. We could also add max concurrent count, and maybe only average that. Let me give that a shot.

klauspost · 2022-02-28T17:50:17Z

@krishnasrinivas This solves the timing problem by only recording between last connected and first disconnected.

var globalNetperfRX netperfStats

// Used for collecting stats for netperf
type netperfStats struct {
	id               uint64 // id of current test
	rxCounter        uint64 // temporary counter.
	received         uint64 // received between lastConnected and firstDisconnect
	lastConnected    int64  // time when last client connected.
	firstDisconnect  int64  // time when first client disconnected.
	connected        int32  // n connected.
	someDisconnected int32  // 0 when none have disconnected yet, 1 when any has disconnected.
	mu               sync.Mutex
}

// recorder will return a counter linked to the id.
// If the ID is unknown and no other counters are running
// a new will be initialized.
// The returned pointer must be atomically incremented.
// If nil is returned another recording is already.
// Whenever a non-nil value is returned release() must be called.
func (n *netperfStats) recorder(id uint64) *uint64 {
	n.mu.Lock()
	defer n.mu.Unlock()
	if id == n.id {
		atomic.AddInt32(&n.connected, 1)
		// Reset timing
		atomic.StoreUint64(&n.rxCounter, 0)
		atomic.StoreInt64(&n.lastConnected, time.Now().UnixNano())

		return &n.rxCounter
	}
	if atomic.LoadInt32(&n.connected) > 0 {
		return nil
	}
	// Replace with new, no-one is connected.
	n.id = id
	n.rxCounter = 0
	n.connected++
	n.lastConnected = time.Now().UnixNano()
	n.firstDisconnect = 0
	n.someDisconnected = 0
	return &n.rxCounter
}

// getRecorded returns statistics recorded on the specific ID.
func (n *netperfStats) getRecorded(id uint64) (uint64, time.Duration, error) {
	n.mu.Lock()
	defer n.mu.Unlock()
	if n.id != id {
		return 0, 0, errors.New("id not found")
	}
	if n.connected > 0 {
		return 0, 0, errors.New("clients are still connected")
	}
	return n.rxCounter, time.Duration(n.firstDisconnect-n.lastConnected) * time.Nanosecond, nil
}

// release a connection from the stats.
func (n *netperfStats) release() {
	atomic.AddInt32(&n.connected, -1)
	// Store when first disconnects.
	if atomic.CompareAndSwapInt32(&n.someDisconnected, 0, 1) {
		r := atomic.LoadUint64(&n.rxCounter)
		atomic.StoreUint64(&n.received, r)
		atomic.StoreInt64(&n.firstDisconnect, time.Now().UnixNano())
	}
}

// DevNull - everything goes to io.Discard
func (s *peerRESTServer) DevNull(w http.ResponseWriter, r *http.Request) {
	if !s.IsValid(w, r) {
		s.writeErrorResponse(w, errors.New("invalid request"))
		return
	}
	id, err := strconv.ParseUint(r.Form.Get("id"), 10, 64)
	if err != nil {
		w.WriteHeader(http.StatusBadRequest)
		return
	}
	rx := globalNetperfRX.recorder(id)
	if rx == nil {
		w.WriteHeader(http.StatusConflict)
		return
	}
	defer globalNetperfRX.release()
	for {
		n, err := io.CopyN(io.Discard, r.Body, 128*humanize.KiByte)
		atomic.AddUint64(rx, uint64(n))
		if err != nil {
			break
		}
	}
}

Note that you can have firstDisconnect BEFORE lastConnected. In that case the test is useless.

krishnasrinivas · 2022-03-01T07:13:43Z

@klauspost my point is the implementation in the PR is the best approach because of code simplicity. I have added locking as well so that only one instance of netperf is running. My tests in lab show that it shows correct throughput numbers. In future if there are any shortcomings in the implementation then we can fix it at that point. netperf's purpose is not to give highly accurate numbers, but to just give a quick rough estimate of the network throughput capabilities of the servers.

klauspost · 2022-03-01T16:06:29Z

@krishnasrinivas "Works on my machine" is not good enough. We need to be reliable in many setups. What happens with 50 machines? 500? Implementing it is trivial, you only really need to send a random ID, and you get actual peak performance, as well as an indication of data quality.

harshavardhana

rest of it looks good to me.

cmd/perf-tests.go

go.sum

krishnasrinivas · 2022-03-08T01:02:53Z

@klauspost can you check?

cmd/peer-rest-common.go

klauspost

👍🏼

cmd/admin-handlers.go

Co-authored-by: Klaus Post <klauspost@gmail.com>

minio-trusted · 2022-03-08T17:25:16Z

Mint Automation

Test	Result
mint-large-bucket.sh	✔️
mint-fs.sh	✔️
mint-gateway-s3.sh	✔️
mint-erasure.sh	✔️
mint-dist-erasure.sh	✔️
mint-gateway-nas.sh	✔️
mint-compress-encrypt-dist-erasure.sh	✔️
mint-pools.sh	✔️
Deleting image on docker hub
Deleting image locally

harshavardhana requested a review from klauspost February 24, 2022 02:46

krishnasrinivas force-pushed the mperf2 branch 3 times, most recently from 6850f68 to 42de653 Compare February 24, 2022 07:28

harshavardhana requested changes Feb 24, 2022

View reviewed changes

go.mod Outdated Show resolved Hide resolved

cmd/peer-rest-server.go Outdated Show resolved Hide resolved

cmd/peer-rest-server.go Outdated Show resolved Hide resolved

klauspost requested changes Feb 24, 2022

View reviewed changes

krishnasrinivas force-pushed the mperf2 branch 2 times, most recently from 14ef922 to 910d68e Compare February 24, 2022 20:42

krishnasrinivas force-pushed the mperf2 branch from 910d68e to 84e295a Compare February 26, 2022 00:23

krishnasrinivas force-pushed the mperf2 branch 2 times, most recently from a383848 to 85a052b Compare March 8, 2022 00:49

harshavardhana requested changes Mar 8, 2022

View reviewed changes

cmd/perf-tests.go Outdated Show resolved Hide resolved

go.sum Outdated Show resolved Hide resolved

krishnasrinivas force-pushed the mperf2 branch 2 times, most recently from e3bbf14 to c1f19f5 Compare March 8, 2022 01:00

harshavardhana requested changes Mar 8, 2022

View reviewed changes

cmd/peer-rest-common.go Show resolved Hide resolved

harshavardhana requested a review from klauspost March 8, 2022 01:37

harshavardhana approved these changes Mar 8, 2022

View reviewed changes

Implement netperf for "mc support perf net"

760f4ef

krishnasrinivas force-pushed the mperf2 branch from c1f19f5 to 760f4ef Compare March 8, 2022 03:22

klauspost approved these changes Mar 8, 2022

View reviewed changes

cmd/admin-handlers.go Outdated Show resolved Hide resolved

Update cmd/admin-handlers.go

cffab04

Co-authored-by: Klaus Post <klauspost@gmail.com>

harshavardhana merged commit 4d0715d into minio:master Mar 8, 2022

bh4t added the new-feature label Jun 12, 2023

Apr	MAY	Jun
	13
2024	2025	2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement netperf for "mc support perf net" #14397

Implement netperf for "mc support perf net" #14397

krishnasrinivas commented Feb 24, 2022

klauspost left a comment

klauspost commented Feb 25, 2022 •

edited

Loading

krishnasrinivas commented Feb 26, 2022

klauspost commented Feb 28, 2022

klauspost commented Feb 28, 2022 •

edited

Loading

krishnasrinivas commented Mar 1, 2022 •

edited

Loading

klauspost commented Mar 1, 2022

harshavardhana left a comment

krishnasrinivas commented Mar 8, 2022

klauspost left a comment

minio-trusted commented Mar 8, 2022

Implement netperf for "mc support perf net" #14397

Implement netperf for "mc support perf net" #14397

Conversation

krishnasrinivas commented Feb 24, 2022

Description

Motivation and Context

How to test this PR?

Types of changes

Checklist:

klauspost left a comment

Choose a reason for hiding this comment

klauspost commented Feb 25, 2022 • edited Loading

krishnasrinivas commented Feb 26, 2022

klauspost commented Feb 28, 2022

klauspost commented Feb 28, 2022 • edited Loading

krishnasrinivas commented Mar 1, 2022 • edited Loading

klauspost commented Mar 1, 2022

harshavardhana left a comment

Choose a reason for hiding this comment

krishnasrinivas commented Mar 8, 2022

klauspost left a comment

Choose a reason for hiding this comment

minio-trusted commented Mar 8, 2022

Mint Automation

klauspost commented Feb 25, 2022 •

edited

Loading

klauspost commented Feb 28, 2022 •

edited

Loading

krishnasrinivas commented Mar 1, 2022 •

edited

Loading