IRC Log

Wednesday November 2, 2011

[Time] NameMessage
[10:06] g4bor hi, are there any known issues with using (py)zeromq in a application that uses process-forking?
[10:14] mikko g4bor: possibly
[10:15] g4bor mikko: ok, i see that my question was a little too generic :-) ... could you name some of those issues?
[10:16] g4bor mikko: basically i'm trying to find out why a problem is happening (http://lists.zeromq.org/pipermail/zeromq-dev/2011-October/013781.html)
[10:19] g4bor mikko: and i'm using it in a forking python web-app (based on gunicorn (http://gunicorn.org/)), so i thought, maybe there are some known problem in such a situation...
[10:28] sustrik the kqueue problem?
[10:28] sustrik g4abor: it has nothing to do with forking
[10:29] g4bor sustrik_: yes, that one. except it morphed into the 'mailbox' problem when moving from 2.1.7 to 2.1.10
[10:29] g4bor sustrik_: nothing to do with forking.. i see...
[10:29] sustrik ok
[10:29] sustrik well, mailbox may have to do with forking, dunno
[10:29] sustrik we need a reproducible minimal test case to fix it
[10:30] g4bor sustrik_: i completely agree about the test-case... i just haven't found the time to do it yet :-)
[10:31] sustrik sure, np
[10:31] g4bor and IF the problem happens, it is ALWAYS the same situation in the application... in short i wonder that maybe i'm sending "incorrect" data into zeromq somehow
[10:34] sustrik g4abor: no, it looks like an internal 0mq problem
[10:34] sustrik of course, unless you are using the socket from multiple threads in parallel
[10:46] g4bor sustrik_: hmm... what if i open a PUSH socket, then i fork, and use the socket from both processes? is that fine or is that insane? :-) (i am not doing it, but maybe i have a bug and it is happening)
[10:50] mikko g4bor: i dont think thats ok
[10:53] g4bor mikko: neither do i... i just hope i get a error-message like 'gabor you are doing this wrong.' or something like that .... anyway will investigate this issue
[10:53] sustrik it' s definitely not OK
[10:54] g4bor is there an easy way to ask a PUSH socket for some id-number? i mean, if i want to check 2 sockets in 2 processes if they are the same zeromq-socket, is there an easy way to do that?
[10:54] g4bor i'm using the python-bindings...
[10:57] g4bor hmm... maybe comparing pointer that's returned from zmq_socket will be enough.. will see
[11:02] sustrik 0mq socket are user space objects
[11:02] sustrik they can't be shared between processes
[11:10] CIA-79 jzmq: 03Jason Chown 07master * r2303993 10/ src/org/zeromq/ZMQ.java : Clarified comments in timeout parameter ...
[11:10] CIA-79 jzmq: 03Gonzalo Diethelm 07master * rf9b0964 10/ src/org/zeromq/ZMQ.java : Merge pull request #84 from jchown/master ...
[11:22] g4bor sustrik_: maybe i'm getting to something. i did some test-scripts, and IF i do the context-creation and socket-creation in one process, and then call connect in a forked process, i get the mailbox-assertion... will investigate this more...
[11:24] sustrik g4bor: are you using the forked context/socket?
[11:24] sustrik instead of opening a new one in the forked process?
[11:25] mato sustrik_: what's this? someone trying to use contexts across fork() that will never work...
[11:26] sustrik dunno
[11:26] sustrik that's why i'm asking
[11:30] g4bor sustrik_: yes, i am not trying to do it.. i have an app, that does use zeromq, and uses process-forking.. i THOUGHT that i'm initializing zeromq AFTER the fork, but now i wonder that maybe i have a bug and doing it BEFORE th fork... still investigating ...
[11:31] mato g4bor: which version of zeromq are you using? i'm not sure if 2.x has the SOCK_CLOEXEC stuff in it and without that forking will break badly
[12:27] g4bor mato: 2.1.10... it's totally fine if zeromq breaks with forking. the plan is that zeromq is not used 'cross-fork'. i just have to find the bug in my code that causes this behavior (assuming it is the case. i have no proof yet, only suspicion)
[12:35] mato g4bor: ok, looking at the git log, the fix for CLOEXEC (issue 273) went in after the release of 2.1.10
[12:35] mato g4bor: without that it really won't work for you at all if you so much as create a single 0mq socket and then fork()
[12:36] mato g4bor: so i suggest, try with the zeromq2-1 git master (latest 2.1.x) first
[12:36] mato whoops, hang on
[12:36] mato LIBZMQ-218 was the original issue
[12:36] mato -273 is something else
[12:37] mato ok, and the LIBZMQ-218 fix is in 2.1.10
[12:42] g4bor cool, that might explain why i am getting a different error message in 2.1.7 and in 2.1.10 :-)
[12:42] g4bor but that's fine. in my case, in the fork-process, the "parent" is not supposed to ever touch zeromq, only the "child"... the problem is that the "parent" is doing it. when i fix this issue, it (hopefully) will solve the problem.
[18:57] cremes sustrik: :) for your ML reply
[19:06] sustrik you mean the primary-school-like one?
[19:06] sustrik :)
[19:13] technoweenie hey is there a way to do a safe close on a socket? i want to close the socket from receiving any more messages, but i want to process any that have come in on the socket but not returned from a lib's #recv call
[19:16] sustrik technoweenie: no, there's not, but there have been some discussion about that kind of functionality
[19:17] technoweenie i have a little project that just takes crap off a zmq socket into some local queue, but there's still a minute chance of dropping a message if i have to restart something
[19:18] sustrik yes, that's the reasoning for adding that kind of thing
[19:18] technoweenie cool i'm so on board with that :)
[19:19] sustrik it's not as simple as it seems though
[19:19] sustrik it requires some kind of handshake between the peers prior to terminating
[19:19] sustrik but what if handshake cannot be completed?
[19:20] sustrik you can time out but then you can loose a message anyway
[19:20] sustrik etc.
[19:22] technoweenie ah
[19:44] technoweenie sustrik_: is there an issue or something for this? or is everything just on the ML
[19:46] sustrik technoweenie: it's only on the ML, but it intersects with LIBZMQ-160 issue
[19:47] sustrik https://zeromq.jira.com/browse/LIBZMQ-160
[19:47] technoweenie thanks
[20:01] tarcieri cremes: you around?
[20:02] cremes tarcieri: yessir
[20:03] tarcieri cremes: is there a way I can get a ZMQ::Poller to actually monitor a file descriptor?
[20:03] tarcieri I was trying to do this: https://github.com/tarcieri/dcell/blob/master/lib/celluloid/zmq/reactor.rb#L15
[20:03] tarcieri it didn't work
[20:04] cremes tarcieri: ah yes... i put some basic framework in there for watching file descriptors but it isn't tested alas
[20:04] cremes let me look at it real quickly...
[20:06] cremes tarcieri: what is that call to @poller.register() returning? are you getting false or an integer?
[20:06] tarcieri -1 iirc
[20:06] cremes hmmm, i don't think that's possible
[20:06] tarcieri I might be wrong, let me double check
[20:11] cremes tarcieri: looks like registering will work, but a later call to retrieve readables/writables won't pick up that poll_item
[20:11] cremes the code i wrote assumes it is always a socket and never a file descriptor :(
[20:13] tarcieri hmm, it's returning 1 on YARV
[20:13] cremes right, if registration works you'll get a fixnum back
[20:13] tarcieri and on JRuby
[20:13] tarcieri wtf o_O
[20:13] tarcieri okay so it works
[20:13] tarcieri I swear it wasn't working before :D
[20:13] cremes it's the later call to #poll which in turn updates two arrays (readables/writables) that doesn't work
[20:14] cremes or rather, it only works for sockets and not for file descriptors
[20:14] tarcieri well I swear I couldn't even get as far as registering it before
[20:14] cremes let me think about how to solve that so both are treated equally
[20:14] tarcieri cool
[20:14] tarcieri I'd be fine if you just dropped an integer into readables/writables
[20:15] cremes ok
[20:17] tarcieri maybe it broke when I actually tried to poll
[20:17] tarcieri I'm trying to remember
[20:21] tarcieri well, anyway if you got that working it'd be nice, until then I'm going with a slightly more ghetto implement that only blocks for 100ms at a time
[20:24] cremes tarcieri: i just pushed what i hope is a fix; please pull from master and build & install a new gem
[20:24] tarcieri ok
[20:25] cremes if the library blows up when you run your code, gist me the error :)
[20:25] tarcieri lol
[20:25] cremes hey, you get what you pay for!
[20:30] tarcieri I get -1 out of the call to poll
[20:32] tarcieri any way to get a more descriptive error message?
[20:32] tarcieri like grab errno or whatever
[20:33] cremes tarcieri: you can call ZMQ::Util.errno and ZMQ::Util.error_string to get the errno and the english text for it
[20:33] tarcieri aah, cool
[20:35] tarcieri cremes: Socket operation on non-socket
[20:35] tarcieri so, the file descriptor I'm registering isn't a socket
[20:35] tarcieri it's a pipe
[20:35] tarcieri problem?
[20:36] cremes tarcieri: that's a question for the 0mq core guys
[20:36] cremes i think it should "just work"
[20:36] tarcieri I could switch to a socketpair
[20:37] cremes sure
[20:37] tarcieri if that would actually help
[20:38] cremes a socketpair is a socket, no? i think the fd should work too though.
[20:38] cremes tarcieri: you are probably venturing into an area where not many devs have exercised the code in libzmq
[20:39] tarcieri o_O heh
[20:39] cremes typically folks go the *other way* where they get a 0mq socket file descriptor and register it with poll/select/kqueue
[20:39] tarcieri nor jruby
[20:39] tarcieri > sockets = Socket.pair(Socket::AF_UNIX, Socket::SOCK_STREAM, 0)
[20:39] tarcieri NoMethodError: undefined method `pair' for Socket:Class
[20:39] tarcieri cremes: perhaps I'll just make an inproc pair socket and be done with it
[20:40] tarcieri like 0MQ socket
[20:40] cremes i don't know what you're trying to do, but sure :)
[20:40] tarcieri I'm trying to adapt some of the code for multiplexing I/O inside of actors to 0MQ
[20:41] tarcieri the original purpose was to wake up an actor making a blocking syscall
[20:41] tarcieri I think the POLS here is to just go 100% 0MQ
[20:41] grantr tarcieri, so you're thinking of using a pair socket?
[20:41] tarcieri yes
[20:42] grantr seems like the way to go
[20:42] tarcieri and poll that alongside a PULL socket
[20:42] tarcieri which is how the mailbox actually receives messages from the network
[20:42] cremes tarcieri: i like it
[20:43] tarcieri so, problem
[20:43] tarcieri aren't sockets thread-specific?
[20:43] grantr so does each node have a pull socket, or each actor? is there one actor that dispatches all messages to node-local actors?
[20:43] grantr tarcieri, they are not thread-safe
[20:43] cremes tarcieri: yes, one socket per thread, pls
[20:43] tarcieri grantr: each node has a single pull socket
[20:43] cremes contexts are thread safe but sockets ain't
[20:43] tarcieri the part I'm trying to replace is how other in-process actors would wake up the mailbox
[20:44] tarcieri the mailbox can get messages one of two places: from the network, or in process
[20:44] cremes tarcieri: you can have a single socket bind to multiple transports
[20:44] tarcieri I guess what I haven't mentioned: the node mailbox *is* an actor
[20:44] cremes e.g. socket.bind("inproc://mailbox")
[20:44] cremes and socket.bind("tcp://127.0.0.1:5555")
[20:44] cremes connect works the same way too
[20:45] tarcieri cremes: I suppose I could create a thread-specific PAIR socket and tear it down each time you send a message to one of the 0MQ actors
[20:45] tarcieri seems bad?
[20:45] cremes yes, you don't want to close/open sockets continuously
[20:45] tarcieri oi
[20:45] cremes why would you need to tear it down?
[20:45] tarcieri le sigh, how to explain this
[20:46] cremes btw, that multiple bind/connect trick works with all sockets *except* pair
[20:46] cremes :)
[20:46] tarcieri other objects in the system need to talk to the mailbox object
[20:46] tarcieri that mailbox is going to be blocking in the ZMQ::Poller, waiting for incoming messages from the network
[20:46] tarcieri but if something inproc sends it a control message, it needs to wake up too
[20:47] tarcieri and those messages can come from N threads
[20:47] tarcieri making N sockets to do that seems bad to me
[20:47] cremes ok, i get it
[20:47] cremes so the mailbox has a single PULL socket
[20:47] tarcieri yeah
[20:47] cremes and the N threads have PUSH sockets, yes?
[20:48] tarcieri so, that was the goal
[20:48] cremes this is fine...
[20:48] tarcieri but where you have N nodes on the network with PUSH sockets
[20:48] tarcieri but, ungh
[20:48] cremes the mailbox PULL socket should *bind* twice, once to inproc and a second time to a well-known port
[20:48] cremes having N PUSH sockets is *fine*
[20:48] tarcieri this totally doesn't need N push sockets for the inproc case
[20:48] cremes i have a distributed app that typically creates 20_000 sockets all running at the same time
[20:49] tarcieri it works today with one pipe
[20:49] tarcieri and the pipe is only there to unblock the syscall
[20:49] tarcieri it's just sending a single event, "wake up and check your incoming messages"
[20:49] cremes N push sockets will use up about 50 bytes + 2 file descriptors * N
[20:50] tarcieri 2 file descriptors * N is really bad
[20:50] tarcieri 2 file descriptors alone is really bad
[20:50] cremes heh
[20:50] tarcieri it currently uses 2 file descriptors
[20:50] cremes maybe you could post this to the ML and we could get some input from other 0mq experts
[20:50] tarcieri I just need a thread safe way to unblock the poller
[20:51] cremes who can send this message to unblock the poller? any of the N threads?
[20:51] tarcieri yes
[20:51] cremes ok... an idea
[20:51] cremes you *can* use a 0mq socket from multiple threads if you surround it with a mutex
[20:51] cremes maybe for this case that would be ok
[20:52] cremes @mutex.synchronize { push.send(wakeup) }
[20:52] tarcieri oh, cool
[20:52] tarcieri seems good
[20:52] cremes that's a choice, N sockets is a choice, and using something other than 0mq is a choice :)
[20:52] tarcieri I'm kind of fighting the grain of 0MQ by the fact that Celluloid already has an in-process messaging system
[20:53] tarcieri and 0MQ would really like to be the in-process messaging system
[20:53] tarcieri I'm just trying to marry Celluloid's existing system to 0MQ in a reasonably sane manner
[20:53] cremes right
[20:54] cremes well, do the simplest thing possible to get it working and go from there... that's usually my plan
[20:54] tarcieri yeah, already did that
[20:54] tarcieri and did the wonky poll for 100ms crap
[20:54] cremes ideally we could get that file descriptor thing working
[20:54] tarcieri poll zmq, poll the socket
[20:54] cremes tarcieri: you could get the zmq socket fds and poll those from the system poll you know
[20:54] tarcieri orly
[20:55] tarcieri well Celluloid can already do that with Celluloid::IO
[20:55] tarcieri that'd be perfect
[20:55] cremes look at the man page for zmq_getsockopt() and ZM_FD
[20:55] tarcieri err
[20:55] tarcieri I'd need to make an IO object I can hand to select :/
[20:55] cremes ah yes, ruby land
[20:55] cremes there be dragons as far as i'm concerned :)
[20:56] tarcieri I think I'll just try a pair socket + a mutex
[20:56] tarcieri seems good
[20:56] cremes ok
[20:56] cremes well, you can avoid creating the second PAIR socket if you just share a PUSH via the mutex
[20:56] cremes have the mailbox PULL socket bind to inproc and ipc (or tcp)
[20:56] cremes and then share a PUSH socket amongst the threads
[20:57] tarcieri ok
[20:57] cremes then you don't need PAIR at all (which may disappear from future iterations of the library)
[20:57] cremes when i get back from my trip, i'll take a peek at the work necessary to wrap a 0mq FD as a ruby io object
[20:58] tarcieri cool
[21:00] bb Is there anyway I can see the number of queued items? Lets say I have a Push/Pull architecture and my Pull worker died so messages are building up..
[21:00] mikko bb: nope
[21:01] mikko bb: you can use high watermark to protect yourself from this scenario
[21:01] bb Ah thanks.
[21:01] mikko and log when high watermark has been reached
[21:01] mikko but currently there is no functionality for seeing queue size
[21:01] mikko this is under debate
[21:02] bb ok I'll look into that
[21:06] cremes tarcieri: i just took a quick peek at wrapping a 0mq socket up as a Ruby IO object
[21:06] cremes tarcieri: looks pretty easy
[21:07] cremes though first looks are often deceptive :)
[21:08] tarcieri heh, cool
[21:12] bb Something I didn't quite get - So if I don't set a ZMQ_HWM, (by default its 0...) messages will keep queuing up until I run out of memory?
[21:17] cremes bb: read the man page for zmq_socket; it describes the behavior of each socket when HWM is set
[21:50] tarcieri cremes: lol, I don't think pairs work anyway
[21:51] tarcieri Assertion failed: !inpipe && !outpipe (pair.cpp:49)
[21:52] mikko tarcieri: is that easily reprocible?
[21:52] tarcieri yes
[21:52] tarcieri 100% of the time even
[21:53] mikko are you using the socket from multiple threads?
[21:53] tarcieri yes, albeit with a mutex
[21:53] tarcieri cremes said it wasok
[21:53] tarcieri apparently not?
[21:54] mikko well, it looks you are either accessing the socket concurrently or there hasnt been a full memory barrier
[21:55] mikko if you are for example polling in one thread and reading in another you need to make sure that both these use mutual exclusion
[21:55] tarcieri I made two pair sockets
[21:55] tarcieri a "sender" which is shared among threads
[21:55] tarcieri that's wrapped in a mutex
[21:55] tarcieri and the receiver
[21:55] tarcieri that's exclusive
[21:56] mikko why pair?
[21:56] mikko is there something preventing you from using socket per thread?
[21:56] mikko usually simplifies the code a lot
[21:56] tarcieri heh, scroll up
[21:57] tarcieri I can try rewriting it so the in process and network messages are handled by a single PULL socket
[21:57] tarcieri that'd be quite a significant change from what I have now though
[21:57] cremes tarcieri: why is it a rewrite to use a second shared PUSH socket for in process?
[21:58] cremes i don't see why you need to use pair...?
[21:58] mikko tarcieri: you can bind the socket multiple times
[21:58] tarcieri that's not the issue
[21:58] tarcieri I'm trying to replace the pipe which was previously used to unblock the event loop
[21:58] mikko tarcieri: i would bet that the assertion happens due to concurrent access
[21:59] mikko Assertion failed: !inpipe && !outpipe (pair.cpp:49) this one
[21:59] tarcieri yeah
[21:59] mikko unless we horribly broke pair sockets recently
[21:59] tarcieri wish I knew which socket was implicated
[21:59] mikko which i think is less likely
[21:59] mikko tarcieri: this is the pain you usually want to get away from by using inproc sockets and not sharing between threads
[22:00] cremes tarcieri: why share a pair amongst threads instead of push?
[22:00] tarcieri cremes: it's... a significant change from how the code works now
[22:00] tarcieri it can be done
[22:00] tarcieri it would involve ripping out what I have working now and starting over
[22:01] cremes i don't follow... you are able to use a PAIR without the rewrite, yes?
[22:01] cremes why can't you swap push for pair?
[22:01] tarcieri I was basing a lot of this code on how Celluloid handles multiplexing IO with the actor mailbox
[22:01] tarcieri I... kinda can I guess?
[22:01] cremes i don't see why not
[22:01] tarcieri it'd be pretty wonky
[22:02] cremes can you point me to the code?
[22:02] tarcieri it's sitting in my working copy
[22:02] tarcieri there's a lot of... concerns that are cleanly separated now
[22:02] cremes ok
[22:03] tarcieri to mash it together would involve colluding a bunch of components that are cleanly separated at the moment
[22:03] cremes try this... replace the shared pair socket with a push, and replace the single pair socket with a pull
[22:03] cremes and see if it works
[22:03] cremes don't rewrite anything else... just change those two socket types
[22:03] tarcieri Celluloid lets you have duck types roaming around that talk the same method protocol
[22:04] tarcieri ok
[22:04] cremes you only want to use this to wake up the mailbox... you aren't actually passing data so it shouldn't matter
[22:04] tarcieri it will almost certainly still be broken if concurrent access is the issue
[22:04] cremes true
[22:04] cremes but a ruby mutex should be executing a full memory barrier which is all that's necessary
[22:05] cremes shit... i gotta run for about an hour
[22:05] cremes post your issues here and i'll get back to them in a bit
[22:05] tarcieri okay, now it looks like this:
[22:05] tarcieri SENDING SPIKE...
[22:05] tarcieri Assertion failed: inpipe_ && !outpipe_ (pull.cpp:42)
[22:05] tarcieri puts "SENDING SPIKE..."
[22:05] tarcieri @sender_lock.synchronize { @sender.send_string PAYLOAD }
[22:06] cremes tarcieri: what version of 0mq? 2.1.10 or master, i hope
[22:06] tarcieri uhh, old
[22:07] tarcieri 2.1.10
[22:07] cremes oh, that isn't old
[22:07] cremes i have to go; i'll try to repro a little later
[22:07] cremes but i do have a spec in ffi-rzmq that uses push sockets from multiple threads to a single pull
[22:07] tarcieri orly
[22:07] cremes it could probably be easily changed to use the same push from multiple but protected via a mutex
[22:08] tarcieri wonder what I'm doing wrong then
[22:08] cremes look at the pushpull_spec
[22:08] tarcieri ^^^ should be fine, right?
[22:08] tarcieri yeah sure
[22:08] cremes that one uses a dedicated socket *per* thread
[22:08] cremes but it *could* be changed to use one socket with a mutex to see if we can repro your issue
[22:09] cremes later
[22:09] tarcieri cool
[22:09] tarcieri later
[22:39] d ddiioo
[23:05] cremes tarcieri: i adapted that spec to use one push socket shared across 4 threads with a mutex; it worked just fine
[23:05] cremes you have something else going on
[23:06] cremes tarcieri: i pushed the change to the repository; it's in pushpull_spec.rb