Skip to content

Reliability of client connectivity #597

@2mike2

Description

@2mike2

Hi,
I have been testing our solution based on Orleans run-time 1.0.8 on production. We are talking about azure deployment of 2 medium silos as worker roles, a default cluster membership configuration, web roles acting as grain clients in a separate cloud service deployment living in the same v-net with silos and thus able to communicate. First of all tests show very good results, great work guys! During the tests one of the silos, the process got restarted, "committed suicide", being identified as dead by membership protocol. According to the logs from the the kamikaze silo, the whole process took 90 seconds. During that time the second silo was alive and continued to process messages. However, during that time, our web role (the grain clients) experienced timeouts while issuing grain requests. I am aware of time aspects of how membership protocol detects and handles failures.
My question is :

  1. Will 3 silos instead of 2, provide a more robust "suspect dead voting" mechanism in such a case, which can result in a quicker "dead" voting and minimize the affect of the clients...
  2. Do you have any other suggestion how to treat clients in such case?

Thanks,

I am attaching a snippet from the silos logs during the talked period:

silo 0 (10.0.0.4:11111):

[2015-07-09 20:41:17.315 GMT 27 WARNING 101022 Runtime.Messaging.SiloMessageSender.PingSender/1 10.0.0.4:11111] Exception sending message to S10.0.0.6:11111:174157512. Message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1022416: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg) Exc level 0: System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg)
[2015-07-09 20:41:17.330 GMT 25 WARNING 100613 MembershipOracle 10.0.0.4:11111] -Did not get ping response for ping #1305 from S10.0.0.6:11111:174157512. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.4:11111:174157138 is rejecting message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1022416: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Retry count exceeded. Msg is: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1022416: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping()
[2015-07-09 20:41:37.256 GMT 22 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:13.5172130.
[2015-07-09 20:41:37.412 GMT 23 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:09.0583008.
[2015-07-09 20:41:37.818 GMT 23 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:09.0607429.
[2015-07-09 20:41:37.834 GMT 23 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:09.0607432.
[2015-07-09 20:41:38.131 GMT 21 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:09.0607899.
[2015-07-09 20:41:41.026 GMT 23 WARNING 100613 MembershipOracle 10.0.0.4:11111] -Did not get ping response for ping #1306 from S10.0.0.6:11111:174157512. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.4:11111:174157138 is rejecting message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1022722: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Exception getting a sending socket to endpoint S10.0.0.6:11111:174157512
[2015-07-09 20:41:41.026 GMT 23 WARNING 100613 MembershipOracle 10.0.0.4:11111] -Did not get ping response for ping #1307 from S10.0.0.6:11111:174157512. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.4:11111:174157138 is rejecting message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1023002: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Recent (00:00:00 ago, at 2015-07-09 20:41:41.026 GMT) connection failure trying to reach target silo S10.0.0.6:11111:174157512. Going to drop Request msg 1023002 without sending. CONNECTION_RETRY_DELAY = 00:00:01.
[2015-07-09 20:41:41.057 GMT 23 WARNING 100613 MembershipOracle 10.0.0.4:11111] -Did not get ping response for ping #1308 from S10.0.0.6:11111:174157512. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.4:11111:174157138 is rejecting message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1023331: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Recent (00:00:00 ago, at 2015-07-09 20:41:41.026 GMT) connection failure trying to reach target silo S10.0.0.6:11111:174157512. Going to drop Request msg 1023331 without sending. CONNECTION_RETRY_DELAY = 00:00:01.
[2015-07-09 20:41:41.057 GMT 25 WARNING 100628 MembershipOracle 10.0.0.4:11111] I should be Dead according to membership table (in TryToSuspectOrKill): entry = SiloAddress=S10.0.0.4:11111:174157138 Status=Dead.
[2015-07-09 20:41:41.073 GMT 25 ERROR 100627 MembershipOracle 10.0.0.4:11111] !!!!!!!!!! I have been told I am dead, so this silo will commit suicide! I should be Dead according to membership table (in TryToSuspectOrKill): entry = SiloAddress=S10.0.0.4:11111:174157138 Status=Dead.
[2015-07-09 20:41:41.073 GMT 25 ERROR 100627 MembershipOracle 10.0.0.4:11111] !!!!!!!!!! INTERNAL FAILURE! About to crash! Fail message is: I have been told I am dead, so this silo will commit suicide! I should be Dead according to membership table (in TryToSuspectOrKill): entry = SiloAddress=S10.0.0.4:11111:174157138 Status=Dead.
[2015-07-09 20:41:42.562 GMT 25 ERROR 100001 MembershipOracle 10.0.0.4:11111] !!!!!!!!!! INTERNAL FAILURE! Application mini-dump written to file E:\approot\Orleans-MiniDump-2015-07-09-20-41-41-088Z.dmp
[2015-07-09 20:41:42.656 GMT 25 ERROR 100002 MembershipOracle 10.0.0.4:11111] !!!!!!!!!! INTERNAL FAILURE! Process crashing!
.
.

[2015-07-09 20:42:39.491 GMT 18 INFO 100406 OrleansSiloHost ] Setting Deployment Id to XXXXXXXXXXX and data connection string to DefaultEndpointsProtocol=https;AccountName=XXXXXXXXXsilos;AccountKey=<--SNIP-->
[2015-07-09 20:42:39.506 GMT 18 INFO 100403 OrleansSiloHost ] Initializing Silo YYYYYYYYYYYY on host=RD000D3A607B7E CPU count=2 running .NET version='4.0.30319.34014' Is .NET 4.5=True OS version='Microsoft Windows NT 6.2.9200.0'
[2015-07-09 20:42:39.506 GMT 18 INFO 100404 OrleansSiloHost ] Silo running with GC settings: ServerGC=False GCLatencyMode=Interactive
[2015-07-09 20:42:39.506 GMT 18 WARNING 100405 OrleansSiloHost ] Note: Silo not running with ServerGC turned on - recommend checking app config : -- and --
[2015-07-09 20:42:46.387 GMT 18 INFO 100000 MembershipFactory 10.0.0.4:11111] Creating membership oracle for type=AzureTable
[2015-07-09 20:42:46.418 GMT 16 INFO 100801 OrleansSiloInstanceManager 10.0.0.4:11111] Attached to Azure storage table OrleansSiloInstances
[2015-07-09 20:42:46.465 GMT 12 INFO 100836 OrleansSiloInstanceManager 10.0.0.4:11111] Initialized schema for Azure storage table OrleansSiloInstances

silo 1 (10.0.0.6:11111):

[2015-07-09 20:41:11.509 GMT 26 WARNING 101022 Runtime.Messaging.SiloMessageSender.PingSender/1 10.0.0.6:11111] Exception sending message to S10.0.0.4:11111:174157138. Message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #925887: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg) Exc level 0: System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg)
[2015-07-09 20:41:11.509 GMT 27 WARNING 101022 Runtime.Messaging.SiloMessageSender.SystemSender/2 10.0.0.6:11111] Exception sending message to S10.0.0.4:11111:174157138. Message: Request S10.0.0.6:11111:174157512Catalog@S0000000e->S10.0.0.4:11111:174157138DirectoryService@S0000000a #926152: Orleans.Runtime.Orleans.Runtime.IRemoteGrainDirectory:LookUp(). System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg) Exc level 0: System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg)
[2015-07-09 20:41:11.525 GMT 22 WARNING 100613 MembershipOracle 10.0.0.6:11111] -Did not get ping response for ping #1293 from S10.0.0.4:11111:174157138. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.6:11111:174157512 is rejecting message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #925887: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Retry count exceeded. Msg is: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #925887: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping()
[2015-07-09 20:41:34.788 GMT 26 WARNING 101021 Runtime.Messaging.SiloMessageSender.PingSender/1 10.0.0.6:11111] Exception getting a sending socket to endpoint S10.0.0.4:11111:174157138 Exc level 0: System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.0.0.4:11111 at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) at System.Net.Sockets.Socket.Connect(EndPoint remoteEP) at Orleans.Runtime.SocketManager.SendingSocketCreator(IPEndPoint target) at Orleans.Runtime.LRU`2.Get(TKey key) at Orleans.Runtime.Messaging.SiloMessageSender.GetSendingSocket(Message msg, Socket& socket, SiloAddress& targetSilo, String& error)
[2015-07-09 20:41:34.788 GMT 24 WARNING 100613 MembershipOracle 10.0.0.6:11111] -Did not get ping response for ping #1294 from S10.0.0.4:11111:174157138. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.6:11111:174157512 is rejecting message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #926185: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Exception getting a sending socket to endpoint S10.0.0.4:11111:174157138
[2015-07-09 20:41:34.788 GMT 24 WARNING 100613 MembershipOracle 10.0.0.6:11111] -Did not get ping response for ping #1295 from S10.0.0.4:11111:174157138. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.6:11111:174157512 is rejecting message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #926518: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Recent (00:00:00 ago, at 2015-07-09 20:41:34.788 GMT) connection failure trying to reach target silo S10.0.0.4:11111:174157138. Going to drop Request msg 926518 without sending. CONNECTION_RETRY_DELAY = 00:00:01.
[2015-07-09 20:41:34.820 GMT 24 WARNING 100613 MembershipOracle 10.0.0.6:11111] -Did not get ping response for ping #1296 from S10.0.0.4:11111:174157138. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.6:11111:174157512 is rejecting message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #926822: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Recent (00:00:00 ago, at 2015-07-09 20:41:34.788 GMT) connection failure trying to reach target silo S10.0.0.4:11111:174157138. Going to drop Request msg 926822 without sending. CONNECTION_RETRY_DELAY = 00:00:01.
[2015-07-09 20:41:34.835 GMT 22 INFO 100611 MembershipOracle 10.0.0.6:11111] -Going to mark silo S10.0.0.4:11111:174157138 as DEAD in the table #1. I am the last voter: #freshVotes=0, myVoteIndex = -1, NumVotesForDeathDeclaration=2 , #activeSilos=2, suspect list=[]
[2015-07-09 20:41:34.866 GMT 22 INFO 100611 MembershipOracle 10.0.0.6:11111] -Going to mark silo S10.0.0.4:11111:174157138 as DEAD in the table #1. I am the last voter: #freshVotes=0, myVoteIndex = -1, NumVotesForDeathDeclaration=2 , #activeSilos=2, suspect list=[]
[2015-07-09 20:41:34.898 GMT 24 INFO 103004 VirtualBucketsRingProvider 10.0.0.6:11111] Removed Server S10.0.0.4:11111:174157138/x02A79268. Current view: [S10.0.0.6:11111:174157512 -> <MultiRange: Size=x100000000, %Ring=100.000%>]
[2015-07-09 20:41:34.898 GMT 24 INFO 103005 VirtualBucketsRingProvider 10.0.0.6:11111] -NotifyLocalRangeSubscribers about old <MultiRange: Size=x8C57D355, %Ring=54.822%> new <MultiRange: Size=x100000000, %Ring=100.000%> increased? True
[2015-07-09 20:41:34.913 GMT 21 INFO 100524 Catalog 10.0.0.6:11111] Catalog is deactivating 814 activations due to a failure of silo S10.0.0.4:11111:174157138/x02A79268, since it is a primary directory partiton to these grain ids.
[2015-07-09 20:41:34.929 GMT 20 INFO 602920 ReminderService 10.0.0.6:11111] Log code 102920 occurred 25 additional times in the previous 00:06:12.8133786
[2015-07-09 20:41:34.929 GMT 20 INFO 102934 ReminderService 10.0.0.6:11111] My range changed from <MultiRange: Size=x8C57D355, %Ring=54.822%> to <MultiRange: Size=x100000000, %Ring=100.000%> increased = True
[2015-07-09 20:41:35.007 GMT 24 INFO 100612 MembershipOracle 10.0.0.6:11111] Will watch (actively ping) 0 silos: []
[2015-07-09 20:41:35.007 GMT 24 WARNING 100824 AzureSiloMembershipTable 10.0.0.6:11111] Update failed due to contention on the table. Will retry. Entry SiloAddress=S10.0.0.4:11111:174157138 Status=Dead, eTag W/"datetime'2015-07-09T20%3A39%3A10.7941473Z'", table version = <21, W/"datetime'2015-07-09T17%3A05%3A21.6691694Z'">
[2015-07-09 20:41:35.007 GMT 24 INFO 100629 MembershipOracle 10.0.0.6:11111] -Failed to update S10.0.0.4:11111:174157138 status to Dead in the Membership table, due to write conflicts. Will retry.
[2015-07-09 20:42:53.937 GMT 24 INFO 103003 VirtualBucketsRingProvider 10.0.0.6:11111] Added Server S10.0.0.4:11111:174170562/x9EF3654D. Current view: [S10.0.0.6:11111:174157512 -> <MultiRange: Size=x694D75D8, %Ring=41.134%>, S10.0.0.4:11111:174170562 -> <MultiRange: Size=x96B28A28, %Ring=58.866%>]
[2015-07-09 20:42:53.937 GMT 24 INFO 103005 VirtualBucketsRingProvider 10.0.0.6:11111] -NotifyLocalRangeSubscribers about old <MultiRange: Size=x100000000, %Ring=100.000%> new <MultiRange: Size=x694D75D8, %Ring=41.134%> increased? True
[2015-07-09 20:42:53.937 GMT 24 INFO 100612 MembershipOracle 10.0.0.6:11111] Will watch (actively ping) 1 silos: [S10.0.0.4:11111:174170562]
[2015-07-09 20:42:53.937 GMT 24 INFO 100645 MembershipOracle 10.0.0.6:11111] -ReadAll (called from gossip, after local view changed, with removed duplicate deads) Membership table: 3 silos, 2 are Active, 1 are Dead, Version=<23, W/"datetime'2015-07-09T20%3A42%3A53.4760565Z'">. All silos: [SiloAddress=S10.0.0.4:11111:174157138 Status=Dead, SiloAddress=S10.0.0.4:11111:174170562 Status=Active, SiloAddress=S10.0.0.6:11111:174157512 Status=Active]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions