Mesos-master crash with zookeeper cluster

I am deploying a zookeeper cluster that has 3 nodes. I use it to maintain the high level of availability of my Mesos. I download the zookeeper-3.4.6.tar.gz archive and unpack it into / opt, rename it to / opt / zookeeper, go into the directory, edit conf / zoo.cfg (paste below), create the myid file in dataDir (which is installed in / var / lib / zookeeper in zoo.cfg), and run zookeeper using. /bin/zkServer.sh start and everything is going well. I start all 3 nodes one by one and they all look good. I use. /bin/zkCli.sh to connect the server, no problem.

But when I start the meso (3 masters and 3 subordinates, each node starts the master and the subordinate), the masters soon break up one after the other and on the web page http: // mesos_master: 5050 , slave tab, the slaves are not displayed. But when I run only one zoo, everything is fine. Therefore, I think this is a zookeeper cluster problem.

I have 3 PV hosts on my ubuntu server. they all work ubuntu 14.04 LTS: node -01, node -02, node -03, I have /etc/hosts in all three nodes:

 172.16.2.70 node-01 172.16.2.81 node-02 172.16.2.80 node-03 

I installed zookeeper, mesos on all three nodes. The Zookeeper configuration file is similar to this (all three nodes):

 tickTime=2000 dataDir=/var/lib/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.1=node-01:2888:3888 server.2=node-02:2888:3888 server.3=node-03:2888:3888 

they can be started normally and work well. And then I start the mesos-master service using the command line ./bin/mesos-master.sh --zk=zk://172.16.2.70:2181,172.16.2.81:2181,172.16.2.80:2181/mesos --work_dir=/var/lib/mesos --quorum=2 , and after a few seconds it gives me such errors:

 F0817 15:09:19.995256 2250 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7fa2b8be71a2 google::LogMessage::Fail() @ 0x7fa2b8be70ee google::LogMessage::SendToLog() @ 0x7fa2b8be6af0 google::LogMessage::Flush() @ 0x7fa2b8be9a04 google::LogMessageFatal::~LogMessageFatal() ▽ @ 0x7fa2b81a899a mesos::internal::master::fail() ▽ @ 0x7fa2b8262f8f _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE ▽ @ 0x7fa2b823fba7 _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_ @ 0x7fa2b820f9f3 _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_ @ 0x7fa2b826305c _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ @ 0x4a44e7 std::function<>::operator()() @ 0x49f3a7 _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ @ 0x499480 process::Future<>::fail() @ 0x7fa2b806b4b4 process::Promise<>::fail() @ 0x7fa2b826011b process::internal::thenf<>() @ 0x7fa2b82a0757 _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7fa2b82962d9 std::_Bind<>::operator()<>() @ 0x7fa2b827ee89 std::_Function_handler<>::_M_invoke() I0817 15:09:20.098639 2248 http.cpp:283] HTTP GET for /master/state.json from 172.16.2.84:54542 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36' @ 0x7fa2b8296507 std::function<>::operator()() @ 0x7fa2b827efaf _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_ @ 0x7fa2b82a07fe _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_ @ 0x7fa2b8296507 std::function<>::operator()() @ 0x7fa2b82e4419 process::internal::run<>() @ 0x7fa2b82da22a process::Future<>::fail() @ 0x7fa2b83136b5 std::_Mem_fn<>::operator()<>() @ 0x7fa2b830efdf _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7fa2b8307d7f _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_ @ 0x7fa2b82fe431 _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_ @ 0x7fa2b830f065 _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ @ 0x4a44e7 std::function<>::operator()() @ 0x49f3a7 _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ @ 0x7fa2b82da202 process::Future<>::fail() @ 0x7fa2b82d2d82 process::Promise<>::fail() Aborted 

Sometimes a warning is like this, and then it breaks with the same output above:

 0817 15:09:49.745750 2104 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying 

I want to know if the zookeeper will deploy and work well in my case, and how I can find where the problem is. Any answers and suggestions are welcome. thanks.

+5
source share
4 answers

In fact, in my case, this is because I did not open the port 5050 of the firewall so that the three servers can communicate with each other. After updating the firewall rule, it starts working as expected.

+1
source

I get into the same problem, I tried different methods and different options, and finally, the --ip option worked for me. I originally used the --hostname option

 mesos-master --ip=192.168.0.13 --quorum=2 --zk=zk://m1:2181,m2:2181,m3:2181/mesos --work_dir=/opt/mm1 --log_dir=/opt/mm1/logs 
+1
source

You need to check that all mesos / zookeeper hosts can communicate correctly. For this you need:

  • Zookeeper ports open: TCP 2181, 2888, 3888
  • Mesos Port Open: TCP 5050
  • ping is available (ICMP message 0 and 8)

If you use FQDN instead of IP in your configuration, make sure that DNS resolution also works correctly.

0
source

Separate work_dir mesos masters into different directories, do not use the work_dir shared resource for all masters, due to zk

-1
source

All Articles