Commit graph

20 commits

Author SHA1 Message Date
lawwong1
d2fd7b9ba9
merge retry mechanism change from gorealis v1 to gorealis v2 (#21) 2023-01-26 13:36:40 -08:00
Renán I. Del Valle
02710e5434
Moving repository to aurora-scheduler organization. (#2)
gorealis v2 will now live in the aurora-scheduler organization
2020-02-19 11:40:40 -08:00
Renan DelValle
55cf9bcb70 Adding more fine grained controls to retry mechanism. Retry mechanism may now be configured to not retry if an error is hit or to specifically stop retrying if a timeout error is encountered. 2019-09-25 17:20:30 -07:00
Renan DelValle
04471c6918 Adding trace logging. 2019-09-25 17:20:30 -07:00
Renan DelValle
461b23400c
V2.0 thrift repository migration and cleanup (#98)
* Remove unnecessary files from the thrift repository that come along with the go library.

* Updating thrift generated code to be 0.12.0 final generated code.

* Remove git.apache.org dependency in vendor folder.

* Migrating from git.apache.org/thrift.git to github.com/apache/thrift

* Upgrading dep (although it will not work now that imports are using mod format, it allows for users to easily fix this with a replacement of the import path).

* Upgrading mod dependencies for Thrift to point to github.com location of the repository.

* Bug fix for Thermos Payload generation relating to the GPU being set.
2019-02-19 16:40:41 -08:00
Renan DelValle
51597ecb32
Changing paths to refer to gorealis v2 in order for dependencies to be correct. 2018-12-27 10:09:22 -08:00
Renan DelValle
e4e8a1c0b3
Adding a check for 401. This reduces the retries on the end to end test and fails fast when a wrong/unathorized username and password are provided to interact with Aurora. 2018-12-18 17:14:48 -08:00
Renan DelValle
76300782ba
Renaming RealisClient to Client to avoid stuttering. Moving monitors under Client. Making configuration object private. Deleted legacy code to generate configuration object. 2018-12-08 08:57:15 -08:00
Renan DelValle
a23bd1b2cc
Shedding interface because there is no good reason to have it. 2018-11-22 12:22:22 -08:00
Renan DelValle
a09a18ea3b
Stop retrying if we find a permanent url error. (#85)
* Detecting if the transport error was not temporary in which case we stop retrying. Changed bug where get results was being called before we checked for an error.

* Adding exception for EOF error. All EOF errors will be retried.

* Addressing race conditions that may happen when client is closed or connection is re-established.

* Adding documentation about how this particular implemantion of the realis client uses retries in scenarios where a temporary error is found.
2018-11-01 17:00:03 -07:00
Renan DelValle
037c636d6d
Retry switch fallthrough fix and create multiple tests (#77)
* Bugfix: switch statements were missing fallthrough statement thus making them retry non-retriable errors. Using a list to catch cases now.

* Adding tests for CreateService, createService when the executor doesn't exist, and createJob when the executor doesn't exist. Renamed Pulse test to reflect that it's using CreateService instead of CreateJob.

* Repsonse propagate back up to caller for context for CreateJob, CreateService, and StartJobUpdate.

* Deleting PR template as Travis CI takes care of running tests and formatting tests now.
2018-10-04 10:47:08 -07:00
Renan DelValle
4f5766b443
Misc. bug fixes and addition of debug logging (#61)
* Fixing possible race condition when passing backoff around as a pointer.

* Adding a debug logger that is turned off by default. If debug is turned on, but a logger has not been assigned, a default logger that will print to STDOUT will be created.

* Making Mutex a pointer so that there's no chance it can accidentally be copied.

* Removing a leftover helper function from before we changed how we configured the client.

* Minor changes to demonstrate how a logger can be used in conjunction to debug mode in the sample client.
2018-04-13 11:03:29 -07:00
Renan DelValle
acc54c1015
Adding logging when there is a client error. 2018-03-05 11:20:39 -08:00
Renan DelValle
3d62df1684
* Errors have been refactored.
* ZK retries have been cleaned up. We will now retry after every error
EXCEPT when we have a badly formed path.
* ZK library has been reworked with optional arguments pattern to not be
so intertwined with the cluster.json file.
* Timeout error has been re-implemented as RetryError. RetryError
behaves like a Timeout error but is used exclusively to add more context
privately. This allows us to have unit tests that check our retry
mechanism is actually retrying.
* Additional logging has been added to retry mechanisms as well as to
the Zookeeper library we use.
2018-03-03 14:08:04 -08:00
Renan DelValle
a43dc81ea8
Simplifying retry mechanism for Thrift Calls (#56)
* Deleting permament error as it doesn't make sense. Just return a plain old error and that will be considered permanent.

* Removing double closure at as it's unmaintainable and can be error prone. Separated back offs into a generic one and a thrift call specific one.

* ZK leader finder now returns a temporary error instead of constantly no leader found and quitting. It could be that the leader info is being propagated so it's worth trying another time.

* Adding more logging to the retry.

* Wrapping lock and unlock in an anonymous function so that we can use defer on unlock such that it is called in the case of a panic.
2018-02-15 15:16:39 -08:00
Renan DelValle
64948c3712
Backoff mechanism fix (#54)
* Fixing logic that can lead to nil error being returned and retry stopping early.

* Fixing possible code path that may lead to an incorrect nil error.
2018-02-06 12:44:27 -08:00
Renan DelValle
a941bcb679
Thread safety, misc fixes, and refactoring (#51)
* Changing incorrect license in some source files.

* Changing CreateService to mimic CreateJob by setting the batch size to the instance count.

* Changing Getcerts to GetCerts to match the style of the rest of the codebase.

* Overhauled error handling. Backoff now recognizes temporary errors and continues to retry if it finds one.

* Changed thrift function call wrapper to be more explicitly named and to perform more safety checks.

* Moved Jitter function from realis to retry.

* API code is now more uniform and follows a certain template.

* Lock added whenever a thrift call is made or when a modification is done to the connection. Note that calling ReestablishConn externally may result in some race conditions. We will move to make this function private in the near future.

* Added test for Realis session thread safety. Tested ScheduleStatus monitor. Tested monitor timing out.

* Returning nil whenever there is an error return so that there are no ambiguities.

* Using defer with unlock so that the lock is still released if a panic is invoked.
2018-01-21 19:30:01 -08:00
Renan DelValle
b2ffb73183
Introducing temporary errors. Refactored reestablish connection code … (#50)
* Introducing temporary errors. 

* Refactored reestablish connection code to use NewClient.

* Added reestablish connection test to end to end tests.
2018-01-16 14:35:01 -08:00
Sivaram Mothiki
72b746e431 use exponential back off func from realis lib (#39)
* use exponential back off func from realis lib

* remove exponential backoffs from monitors

* dont compare for retry errors
2017-11-04 15:06:26 -07:00
Renan DelValle
0d3126c468 New API to set hosts to DRAINING. Cleaned up some of the client code, and fixed a few error printing bugs. 2017-09-22 12:55:03 -07:00