Sunday October 23, 2011

[Time] NameMessage
[11:53] jsia im coding using C++
[11:53] jsia can anyone help on how to share the zmq context within 2 classes
[13:28] hwinkel hi
[13:45] mikko jsia: whats the problem?
[13:49] jsia I used the C library for zeromq
[13:50] jsia Im having problems with th einproc socket
[13:50] mikko i tend to use boost::shared_ptr for sharing the context
[13:50] jsia im using ZMQ_ROUTER
[13:50] mikko what is the problem?
[13:50] jsia the problem is during the first send I can receive the message and parse it correctly
[13:50] jsia the recipient needs to reply to the sender it sends the message but the other party can't receive the message
[13:51] mikko ROUTER on both ends?
[13:51] jsia yup
[13:51] jsia I have 2 classes I named it the client and the other one the proxy
[13:51] mikko is the peer id the first part in the reply message?
[13:51] jsia I have the code running in python
[13:51] jsia I am just recoding it into C
[13:51] jsia yup
[13:51] jsia I'm passing the id
[13:52] jsia I send two parts the first is the identity
[13:52] jsia then the next is the real payload
[13:52] mikko if you got router to router connection how do you know the id?
[13:52] mikko do you use explicit identities?
[13:53] jsia yup
[13:53] jsia I pass the class object
[13:53] jsia I have a method in the first class where it can return the id
[13:53] jsia and alsoi the context
[13:53] jsia by the way im using the same context
[13:53] jsia isn't it that inproc communications are like that?
[13:53] jsia that was what I did in python
[13:54] mikko are these one to one communications?
[13:54] mikko or one to many?
[13:54] jsia currenyl one to one
[13:55] mikko can i see the code?
[13:55] jsia its a bit huge I'll try to get the most important snippets
[13:56] mikko jsia: are you setting the identity on both sockets?
[13:57] jsia yup
[13:57] jsia oh
[13:57] jsia wait
[13:57] jsia only atht the proxy
[13:57] mikko i'm not sure about ROUTER <-> ROUTER
[13:57] jsia only one has an identity
[13:57] jsia int linger = 0; this->INTERNAL_SOCKET = zmq_socket(context, ZMQ_ROUTER); identity = "mode-"; zmq_setsockopt(this->INTERNAL_SOCKET, ZMQ_IDENTITY, identity.c_str(), identity.size()); zmq_setsockopt(this->INTERNAL_SOCKET, ZMQ_LINGER, &linger, sizeof(linger)); zmq_bind(this->INTERNAL_SOCKET, "inproc://ipc.proc");
[13:57] mikko you might be better off with DEALER <-> ROUTER
[13:57] jsia that's how I bind the proxy
[13:58] jsia do you think its better?
[13:58] jsia Im trying to buiuld an async req reply
[13:58] mikko it might be depending on the scenario
[13:58] jsia woith reliability
[13:58] jsia so I made a protocol
[13:58] mikko DEALER automatically load-balances to connected peers
[13:58] jsia that the client talks to a proxy registers itself wwaits for acknowledgement
[13:59] mikko and you can use ROUTER to reply to specific DEALER
[13:59] jsia then the proxy talks with another external proxy for the real consumer apps
[13:59] mikko what i've used for brokers in the past
[14:00] mikko producer DEALER < - > ROUTER broker DEALER < - > ROUTER consumer
[14:00] mikko so producer fair queues between multiple brokers
[14:00] mikko which fair queue to multiple consumers
[14:00] mikko which ACK back to the specific broker
[14:00] jsia ic
[14:01] jsia that can also be a nice design
[14:02] jsia does that provide async request reply?
[14:02] mikko i guess you can use that for async reqrep
[14:02] mikko i used it for persistent device
[14:03] jsia ic
[14:03] mikko but i dont see why it wouldnt work for async request reply
[14:04] mikko it's a bit heavier design though
[14:04] jsia its funny I have this wworking in python
[14:04] jsia when I recode it basing on the python code it doesn't work
[14:05] mikko are you making sure that you bind before connect?
[14:05] mikko the inproc endpoint
[14:05] jsia yup
[14:05] jsia I call the proxy initialization first
[14:05] jsia before the client
[14:05] mikko you dont seem to be checking for errors in that code (or its stripped down)
[14:05] mikko is this threaded or single process?
[14:05] jsia ill try to make a striped down version
[14:05] jsia just pass simple messages
[14:05] jsia coz currently Im passing protobuf messages
[14:06] mikko but is the program in multiple threads?
[14:06] jsia currently the proxy runs on one thread
[14:06] jsia and the client runs on another
[14:06] mikko ok
[14:06] jsia those are the only threads that im using
[14:06] mikko and you are not receiving any messages from client?
[14:07] jsia the proxy is receiving the message from the client
[14:07] jsia but when the proxy replies
[14:07] jsia the client is not receiving it
[14:07] mikko are you copying the first message part from client to outgoing message?
[14:07] jsia yup
[14:07] mikko can you that part where the client reply happens
[14:07] jsia ok
[14:08] mikko why are you using C bindings inside C++ ?
[14:08] mikko just out of curiosity
[14:10] jsia U read about C bindings first then I was too lazy to implement the C++ hahaha
[14:10] jsia i read the C++ code, it's also using the C library
[14:10] jsia hahaha
[14:10] mikko yes, it's pretty slim .hpp file
[14:10] jsia I tried it a while ago
[14:11] jsia I got problems with the context
[14:11] mikko what sort of problems?
[14:11] mikko i usually create context in the main () and pass it down from there
[14:12] mikko and tend to create the sockets there as well and inject to different objects
[14:12] mikko
[14:15] jsia
[14:15] jsia I tried that a while ago however its wierd
[14:16] jsia i was initializing the classes on main after i created the context
[14:16] jsia inside the class initialization
[14:16] jsia I bind the socket
[14:16] jsia then after it finishes the initalization the socket gets disconnected
[14:17] mikko does it go out of scope?
[14:18] mikko zmq_getsockopt(this->INTERNAL_SOCKET, ZMQ_IDENTITY, (void *) sidentity, &identity_size);
[14:18] mikko you are getting the identity of the internal socket here
[14:18] mikko but you are sending a message to the client
[14:19] mikko not sure i fully understand this code
[14:19] jsia I send two parts
[14:19] jsia the first is the identity
[14:19] jsia however when I try to execute s_recv
[14:19] jsia I don't get the first message
[14:19] jsia when I get the next message I get the payload
[14:19] jsia that's why I executed two s_recv
[14:20] mikko you need to check for RCVMORE
[14:20] mikko c bindings will return only one part at a time
[14:20] jsia but when I execute two recv will I get both messages?
[14:21] mikko but you dont really know where the boundary is
[14:21] mikko that makes the assumption that client always sends two messages
[14:21] jsia yeah
[14:21] mikko it's better to loop with ZMQ_RCVMORE
[14:21] jsia ok
[14:21] jsia currently i only send two messages
[14:21] mikko
[14:21] jsia ill fix that one
[14:21] mikko something similar to that
[14:21] mikko where do you send the message back to client?
[14:21] mikko on line 41?
[14:22] jsia yup
[14:22] mikko and identity is what you init on line 23?
[14:22] jsia yup
[14:22] mikko that confuses me
[14:23] mikko you are getting the identity of the INTERNAL_SOCKET
[14:23] mikko and use that when sending a message to EXTERNAL_SOKET
[14:23] mikko +C
[14:23] mikko you most likely want to use the identity of the EXTERNAL_SOCKET rather than the internal one
[14:23] mikko otherwise it will be null router
[14:23] mikko routed*
[14:24] mikko or did i miss something?
[14:25] jsia the external socket is a tcp socket
[14:25] jsia that's for my external communication
[14:26] mikko i am very confused
[14:26] jsia Sending regiseter node- as identity Sending localhost-10834-0client_service as content finish sending .... The message id is localhost-10834-0 output message id to be passeed to the get_response localhost-10834-0 Fetching Internal Identity node- Receiving next message part received regitration request Proxy will reply with register_req_ack Sending reply back to the internal socket Sending node-
[14:26] jsia oops
[14:26] jsia
[14:26] jsia here's a sample output when I run the code
[14:28] mikko you are using the identity of the internal socket when you send
[14:28] mikko are you trying to send message to itself?
[14:29] jsia yup
[14:29] jsia im sending it as the first part of the message
[14:30] mikko but this confuses me
[14:30] mikko are you not trying to send a message to a socket that is connected to INTERNAL_SOCKET ?
[14:30] mikko rather than to the INTERNAL_SOCKET itself
[14:31] jsia yup
[14:31] mikko so why do you route the message to the INTERNAL_SOCKET ?
[14:31] jsia Im trying to get the identity that sent me a mesage
[14:31] mikko the first part determines where the message goes to
[14:31] jsia and answer back to it
[14:31] mikko zmq_getsockopt(this->INTERNAL_SOCKET, ZMQ_IDENTITY, (void *) sidentity, &identity_size);
[14:31] mikko this will give you the identity of the current socket
[14:31] mikko not the remote peer
[14:32] jsia I see
[14:33] jsia hmmm I think that's the difference with my python code
[14:33] jsia coz with python I can receive the first part of the message that the client is sending
[14:33] jsia I think I have to implement the recvmore
[14:34] mikko jsia:
[14:34] mikko you need something like that
[14:35] mikko and the first part is the remote identity
[14:36] jsia ok ill give it a try :)
[14:36] jsia thnx for the help :D
[14:36] jsia Ill keep you posted what happened
[14:37] mikko cool
[14:37] mikko ill hit the gym, bbl
[16:40] mbj Is there a zmq buildin that enables workers to "pull request when ready". So the upstream will not push data to dead / inexistend workers?
[17:25] mikko mbj: yes
[17:25] mikko mbj: how much have you read the guide?
[17:30] mbj mikko: I read the guide until a point I asked myself why is mongrel2 using PUSH / PULL for handlers. When a handler was handling a message the peer would send messages to this address even when this handler was removed and will never come back.
[17:31] mbj mikko: Maybe I missunderstood this part.
[17:31] mikko mbj: messages wont be sent to that peer when it goes away
[17:31] mikko the messages are round-robined between connected peers
[17:31] mikko there is a small window before the upstream notices that the downstream has gone away
[17:31] mbj Okay but there could be some messages not jet processed in the handlers mailbox
[17:32] mikko but by design push/pull is fire and forget
[17:32] mikko there are other socket types if ACK-type behaviour is needed
[17:32] mikko also, LIBZMQ-160 might be somewhat related
[17:32] mbj reading
[17:33] mikko but in case ACKs are needed ROUTER/DEALER pattern can be used
[17:33] mbj So I have to refine my initial query
[17:36] mikko some messages can be pushed to the downstream (depends on HWM values) and in case of a crash they might get lost
[17:36] mikko in such a scenario timeout based retry is the best bet
[17:37] mbj mikko: Will do some more doc research, thx for your time and for pointing me on the "connected" part
[17:39] mikko i think the guide hopefully explains this
[17:40] mbj mikko: actualy it does. My problem was the "connected". Since I did not realize zeromq will not reconnect from peer to downstream (downstream had initiated the connection) this question came up.
[17:41] mikko mbj: in zeromq the upstream can also connect to bound downstream
[17:41] mikko :)
[17:41] mikko to confuse things more
[17:42] mikko but if there is no underlying tcp connection to peer messages wont be sent to it
[17:42] mbj mikko: I knew
[17:47] mbj mikko: Im implementing worker driven consuming of jobs. I'll make sure no message is in the workers mailbox when he crashes using REQ/REP at the jobsserver. Used PULL before but ran into the assumption the jobserver will not realize a worker is gone and try to reconnect while throwing messages in the workers socket :)
[17:54] mbj Can I limit a workers mailbox size using HQM on workers PULL socket?
[17:54] mbj s/mailbox size/mailbox capacity/
[17:54] mbj s/HQM/HWM/ (sorry)
[18:49] mikko mbj: you might want to use more explicit ack mechanism
[18:50] mikko like the worker signals back when the work is done
[18:58] mbj mikko: I'll do so, saw the discussions on HWM on stackoverflow
[19:53] mbj Is it possible to receive the current outgoing/incoming mailbox queue length of a socket? (And is mailbox the corret term?)
[20:15] mikko mbj: nope
[20:27] hoppy I have a long-running app that uses PUB with multiple SUB clients, and REP with multiple REQ clients that runs flawlessly for hours and then stops communicating. This is Centos 5. Any suggestions on a debugging methodology, or other ideas?
[20:40] mbj mikko: Something planned in this direction? Would be nice to have a getsockopt option.
[20:41] mikko mbj: there has been some talk
[20:42] mikko mbj: but it's impossible to provide accurate count
[20:42] mikko hoppy: what language?
[20:47] mbj mikko: Because there is one queue for all sockets? Or the queue data structure does not allow to detemine the length in an easy way?
[20:48] mikko no, not that
[20:49] mikko because some messages might be in network buffers on each side
[20:49] mbj mikko: Ahh and the size of this buffer does not tell you if there are 1000 small or one big message
[20:49] mikko indeed
[20:50] mikko tcp buffers for example are just bytes
[20:50] mbj Forgot about them
[20:50] mbj zmqs abstraction is good :)
[20:51] mbj So the tcp buffers are normally larger than zmq mailbox queue...
[20:53] hoppy mikko: C++ as the main app, there are C++ and Python clients both talking to it.
[20:56] mikko hoppy: do both req/rep and pub/sub stop?
[20:56] mikko are they in same thread?
[20:56] hoppy it's single threaded. The req/rep continue fine, but the PUBs quit being heard by the subscribers.
[20:57] mikko hoppy: 2.1?
[20:57] hoppy I have a similar case in another app (again PUB/SUB) with a similar failure rate - but that one is TCP across machines.
[20:58] mikko in both cases is the producer C or C++ and consumer python?
[20:58] mikko oh nm
[20:58] mikko that sounds odd indeed
[20:59] hoppy I've been playing with versions for some time hoping to quash it, right now using 2.1.9 that I built my own rpm for.
[20:59] mikko hoppy: it would be beneficial if you can attach gdb
[20:59] mikko and see where there server is
[20:59] hoppy in the other app python to python. In this one C++ with both clients
[20:59] mikko thread apply all bt
[21:00] mikko where the*
[21:01] mikko hoppy: is it possible that the main app is blocking somewhere outside zeromq?
[21:01] mikko or have you isolated this to be zeromq related?
[21:01] hoppy I don't believe so.
[21:01] mikko backtrace would be helpful
[21:01] hoppy I need to double check that it is stuck now, but this is what's up ATM:
[21:02] hoppy gdb) thread apply all bt Thread 4 (Thread 0x428a1940 (LWP 24268)): #0 0x0000003178cd48a8 in epoll_wait () from /lib64/ #1 0x0000003e45c0f2f8 in zmq_msg_data () from /usr/lib64/ #2 0x0000003e45c22457 in zmq_msg_data () from /usr/lib64/ #3 0x000000317980673d in start_thread () from /lib64/ #4 0x0000003178cd44bd in clone () from /lib64/ Thread 3 (Thread 0x432a2940 (LWP 24270)):
[21:02] hoppy useless pasting that.
[21:02] hoppy send me a quick email and I'll reply with the stacktrace
[21:02] mikko works well for paste
[21:02] hoppy (needs a better irc client)
[21:03] mikko just paste to and provide url to your paste
[21:03] mikko that is probably the easiest way
[21:03] hoppy git://
[21:04] mikko thanks
[21:04] hoppy It didn't like being debugged. I had to restart after that.
[21:05] mikko most likely zmq_poll returning with interrupted
[21:05] mikko if you are using it
[21:05] hoppy agree
[21:05] mikko might be helpful to compile with symbols and without optimisations
[21:05] mikko the current backtrace seems to be optimised and doesn't have symbols
[21:06] hoppy k.
[21:06] hoppy I think I will try to make a smaller testcase and see if I can reproduce.
[21:06] mikko that would be useful
[21:06] hoppy this one has 15-20G/day going through it, and it is ugly to knock it over.
[21:06] mikko if you can isolate a test case open an issue in jira
[21:07] mikko the gdb trace only helps if you take it when the app starts blocking
[21:07] hoppy will try.
[21:07] mikko i assume the issue is on server
[21:07] hoppy agree
[21:07] mikko any possiblity of a flaky network or so?
[21:07] mikko have you tried reconnecting a consumer after this happens?
[21:07] hoppy should all be tcp inside the same box.
[21:08] mikko ok, that excludes network pretty much
[21:08] mikko reproduce case would be good
[21:08] mikko and if that doesn't work a backtrace when things grind to halt might help as well
[21:08] mbj does zmq_recv with ZMQ_NOBLOCK guarantee the whole message is readable not only a part? (IMHO it does, but it is better to ask once rather than building a system around an incorrect assumption)
[21:09] hoppy this thing is basically doing req/rep against requestors who want to poke something into a database, poking it in, and then notifying everybody on a pub/sub that it went it.and every time s
[21:09] mikko mbj: yes, if you can receive a part you can receive rest
[21:09] mbj mikko: thx
[21:09] hoppy I can always still see things going in the database, but can't hear the pub/sub anymore
[21:10] mikko hoppy: so req rep keeps going but pub sub stops?
[21:10] mikko ok
[21:10] hoppy correct
[21:10] cremes hoppy: and netstat shows the ports still open/connected ?
[21:10] hoppy haven't checked that.
[21:11] mikko that would be helpful as well
[21:11] hoppy I'm basically trying to get an isolation procedure.
[21:11] hoppy life would be simpler if I could get it to fail in a crucible, we'll see how lucky I am with that.
[21:12] hoppy concerned because it is a very long time between issues.
[21:12] cremes hoppy: i typically run PUB/SUB for 5 days at a time and push about 10G a day through the PUB
[21:12] cremes without any hangs or weirdness
[21:13] cremes and all socket events are handled via zmq_poll & friends
[21:13] cremes so the same code paths are likely being exercised
[21:13] mikko cremes: this is on linux as well?
[21:14] cremes osx + linux
[21:14] cremes hoppy: any chance zmq_close() is getting called on the pub socket?
[21:14] cremes by default the socket will linger indefinitely
[21:14] hoppy I doubt it.
[21:15] cremes you might want to set LINGER to 0 and see if the thread goes away which could indicate it got closed
[21:15] cremes ok
[21:15] cremes another idea...
[21:15] cremes since you are doing zmq_poll, do you have any "churn" on the poll item list that you hand to that function?
[21:15] cremes that is, do you add and/or delete a lot of sockets from that list?
[21:15] cremes if so, it's possible you are leaving the PUB socket off the list and it never gets polled again
[21:16] cremes (i did that once :)
[21:16] mikko you shouldn't need to poll pub
[21:16] mikko as the messages are dropped if there are no peers
[21:18] cremes er, sorry... the SUB socket that is supposed to be receiving data from the PUB
[21:19] mikko i understood that in this case there are many consumers
[21:19] mikko which all stop receiving the pubs