Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesh rc 0.4.3 #1575

Closed
wants to merge 92 commits into from
Closed

Conversation

nodiesBlade
Copy link
Contributor

@nodiesBlade nodiesBlade commented Jul 26, 2023

Description

We should invalidate sessions based of some new errors before a relay / after a relay is handled in order to serve the least amount of free relays.

Here are all the invalid session errors we should account for

func (s Session) Validate(node sdk.Address, app appexported.ApplicationI, sessionNodeCount int) sdk.Error {
	// validate chain
	if len(s.SessionHeader.Chain) == 0 {
		return NewEmptyNonNativeChainError(ModuleName)
	}
	// validate sessionBlockHeight
	if s.SessionHeader.SessionBlockHeight < 1 {
		return NewInvalidBlockHeightError(ModuleName)
	}
	// validate the app public key
	if err := PubKeyVerification(s.SessionHeader.ApplicationPubKey); err != nil {
		return err
	}
	// validate app corresponds to appPubKey
	if app.GetPublicKey().RawString() != s.SessionHeader.ApplicationPubKey {
		return NewInvalidAppPubKeyError(ModuleName)
	}
	// validate app chains
	chains := app.GetChains()
	found := false
	for _, c := range chains {
		if c == s.SessionHeader.Chain {
			found = true
			break
		}
	}
	if !found {
		return NewUnsupportedBlockchainAppError(ModuleName)
	}
	// validate sessionNodes
	err := s.SessionNodes.Validate(sessionNodeCount)
	if err != nil {
		return err
	}
	// validate node is of the session
	if !s.SessionNodes.Contains(node) {
		return NewInvalidSessionError(ModuleName)
	}
	return nil
}

This was inspired after seeing portal send us relays with an invalid app and we were serving them for free (but not retrying to send to full node)

Fix a high memory consumption that also is part of the issue pokt-network#1457.
Under high load of requests (1000/rps or more) the RAM got crazy and scale up to 40GB or close to that.
Now after the fix of pokt-network#1457 with the worker pool, the node remains under 14gb of ram in my local tests.
* Fixed RPC timeout handled as Seconds instead of Milliseconds
* Updated mesh.md to handle new cache configurations
* Updated mesh.md to list /v1/private/mesh/session as required on the whitelist endpoints/paths
* Fixed /v1/private/mesh/updatechains to properly update them on memory and disk
* Added hot reload for servicer private key files (add & remove)
  * on add turn on the checks and start allowing it
  * on remove stop receiving and consume all the pending relays in queue
* Version bump
* Enhanced log about missing sessions
* Version Bump
…rivate key is removed after it been supported by the mesh node.

* Version Bump
…ral solution)

* Fixed error that panic process when load servicer_url without http/https schema. Now it will properly report the error.
* Added manual cron to compact relays database every hour.
* Removed a log2.Fatal that was crashing the process.
* relay_cache_background_sync_interval was not used
* relay_cache_background_compaction_interval was not used

Added:
* hot_reload_interval allow to turn off using 0 the hot reload of chains/servicers - otherwise the amount of MS it will check the files again

Updated:
* Now health check of servicers is done every 60s - was 30s - future: will be configurable through config.json
* Now old sessions are evaluated to be removed every 30m - was 30s - future: will be configurable through config.json
* config.json example of docs.

Removed:
* Manual relays db compaction job removed; We receive reports that it was corrupting relays database if you run at same time of background configured by relay_cache_background_compaction_interval
… from storage in any case after they are success/failed.

Fixed log that was printing node instead of app public key.
Added different key format.
Refactor connectivity checks.
Refactor node/servicer internal structure of mesh to reduce amount of worker/cron instances.
Refactor chains/keys reload.
Added FullNode worker dynamic resize on servicers change.
Updated servicers reload to only run the modification on maps when there is something new/removed.
…e and better readability of the code without so many casts.

Refactor fullNode.Servicer to be a map instead of a slice.
Enhance a bit more the logs and bootstrap time information.
Added metrics config support.
Refactor code to split in files.
Bump pond version to 1.8.3 (patch).
Clean up the code.
Update config to handle rpc timeout for different things like chains, client and pocket node calls with a different value.
Ensure that http response body is read even on errored request to reuse connections.
Enhanced chains reload logs.
Enhanced startup logs.
… so many edge cases and possible infinite goroutine spams.

Added name property to nodes as optional key, if not set use the hostname of the node url.
Added minWorker, maxWorker, maxCapacity to prometheus metrics collectors.
Refactor minWorker, maxWorker and maxCapacity option in config.
Bump default to a more real world value.
Updated docs.
jorgecuesta and others added 27 commits July 11, 2023 19:11
…k session height 100589.

Removed jump lines (\n) on the errors provided by the pocketcore code. This difficult the usage of tooling like Loki that will collect a line of text before the jump line as an entry.
… will be done by GetSession a few lines below.
…orage. Fixed a typo. Bump version to RC-0.4.2
@reviewpad reviewpad bot added large Pull request is large waiting-for-review labels Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
large Pull request is large waiting-for-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants