Thursday September 2, 2010

[Time] NameMessage
[00:03] ModusPwnens Hi, I'm having an issue with the first byte of a message being corrupted when it is sent. Has anyone experienced this?
[00:05] bgranger mcxx: You need to grab the stable 2.0.7 releases of pyzmq and zeromq.
[00:06] bgranger Trunk of pyzmq works with stable 2.0.8 of zeromq.
[00:06] bgranger We have not started following the post 2.0.8 master of zeromq
[00:06] ModusPwnens clarification, it's actually the first two bytes that get corrupted
[02:40] lestrrat I assume this is the right place to talk about zeromq?
[02:42] lestrrat I just wanted to make if zeromq's recv() is supposed to bail out properly (or not) when it receives a signal while it's waiting for a message?
[02:42] lestrrat my perl binding seems to be stuck, and I'm trying to figure out if it's my problem or not.
[03:46] bgranger lestrrat: I think I know what is going on.
[03:46] bgranger Are you still around?
[03:47] lestrrat yes!
[03:47] bgranger Hey, we have the same issue in the Python bindings.
[03:47] bgranger It perl is like Python, then it sets its own signal handlers.
[03:47] lestrrat ah, so glad I'm not alone on this :)
[03:47] lestrrat right
[03:47] bgranger BUT, while a blocking recv is waiting, perl is completely out of control.
[03:48] AndrewBC I wrapped mine in a try except in python, excepting a KeyboardInterrupt
[03:48] AndrewBC worked fine
[03:48] bgranger SO what happens is that Perl's signal handler will receive the signal, but it won't be handled until the recv returns. To test if this is what is really going on, try to create a simple example , then send the signal and send an additional msg to the socket so that recv returns.
[03:48] AndrewBC so you might have to manually catch ^C if that's possible in perl
[03:49] bgranger AndrewBC, it won't work if there is not a msg recv'd
[03:49] bgranger Or I should say, the KeyboardInt won't be raised until recv returns.
[03:49] lestrrat bgranger: ah right
[03:49] AndrewBC oh, hm
[03:49] bgranger You can't really interupt it with SIGINT
[03:50] lestrrat I can go and confirm, but if that's the case, it's really zeromq's responsibility to return on a signal, no? (with errno = EINTER or whatever)
[03:50] bgranger But the same is true when Python calls *any* external C/C++ code.
[03:50] lestrrat right
[03:50] bgranger No, because Python's signal handler is still in place, it just doesn't get to run.
[03:51] lestrrat yeah, but we can propagate the signal as long as we get back control
[03:51] bgranger But you can't get back control
[03:52] lestrrat hmm?
[03:52] bgranger The signal handler is the one at the C level, not the one that you can control using the signal module
[03:52] lestrrat yeah, but I'm writing the Perl<->C binding, so I have control :)
[03:53] bgranger To change the behavior you actually to have use signal.h and set jump/long jump logic.
[03:53] bgranger But it can be done
[03:54] bgranger It is just subtle logic and hard to get right in a cross platform manner.
[03:54] bgranger I don't think any of the language bindings are doing this yet, but eventually all should
[03:54] lestrrat I dunno the python internals, but I don't think we need a jump logic for perl
[03:54] bgranger Meanwhile, we are just not using blocking recv when it needs to be interupted
[03:55] bgranger If perl sets its own signal handlers at the C level, you likely will.
[03:55] lestrrat perl's signal handling doesn't run on "real time". it's deferred until the next perl op. so as long as I'm in C land, I can tell the Perl interpreter that there was a signal, and properly emulate it.
[03:56] bgranger That is how Python works as well and it why the default signal handler doesn't actually interrupt "in real time:"
[03:56] lestrrat then I don't see why you need a jump...? what am I missing?
[03:57] bgranger I don't remember the details honestly. I think the issue is that you have to be very careful about restoring Python's regular signal handler and putting the intepreter back in the right place in the stack.
[03:57] lestrrat hmm.
[03:58] bgranger That is, if you want control to return to Perl/Python in a way that doesn't simply kill the process.
[03:58] lestrrat actually, I don't really care either way -- I don't have to have zmq_recv to get back to the Perl handler -- I just need it to get back to me once a signal is sent w/o killing the process
[03:58] lestrrat otherwise servers hang.
[03:59] bgranger yep
[03:59] lestrrat glad to know I'm the only one. I can safely write an RFE :)
[04:43] sustrik bgranger, lestrrat: i am still not sure about EINTR behaviour
[04:43] sustrik would returning EINTR from blocking call help in any way?
[04:47] bgranger In what situations?
[04:49] bgranger Are you thinking of installing a signal handler that catches SIGINT and translates that into EINTR?
[04:49] bgranger That would help us, but that signal handler would have to play nice with that of Python
[04:52] sustrik bgragner: no, i meant the peculiar functionality of 0mq
[04:52] sustrik that catches OS's EINTR
[04:52] sustrik and instead of returning it to the user
[04:52] sustrik just restarts the blocking operation
[04:53] bgranger I am not familiar with what zmq is doing right now
[04:53] sustrik but afaiu that's not the problem you are dealing with
[04:53] bgranger When does the OS signal EINTR?
[04:53] bgranger And how does zeromq currently handle it?
[04:54] sustrik if a threads gets a signal while in blocking operation
[04:54] sustrik 0mq currently ignores it, i.e. restarts the blocking op
[04:56] bgranger I guess what I don't know if how all of this interplays with Python's signal handling.
[04:56] sustrik exactly
[04:56] bgranger Python only receives signals in the main thread
[04:56] bgranger Does zeromq install any signal handlers?
[04:56] sustrik that's the one that's blocked in 0mq?
[04:57] sustrik no, it does not
[04:57] bgranger OK
[04:57] bgranger but when a socket call returns EINTR, it just continues
[04:57] sustrik right
[04:57] sustrik actaully, you can check what's happening yourself
[04:57] bgranger OK, then another ?
[04:57] bgranger OK
[04:57] bgranger How...
[04:57] sustrik when waiting for a message
[04:57] sustrik 0mq is stuck in a read call
[04:58] bgranger Correct
[04:58] sustrik let me find the exact code...
[04:58] bgranger OK great, that would be helpful.
[04:59] bgranger Are the other occasions where EINTR is returned?
[05:00] sustrik polling
[05:00] sustrik are you on linux?
[05:00] lestrrat sorry was away
[05:00] bgranger linux and Mac
[05:00] sustrik hi
[05:00] bgranger Talking about signal handling more...
[05:00] sustrik trunk or 2.0.8?
[05:00] bgranger pyzmq doesn't work with trunk yet, sso 2.0.8 or 2.0.7
[05:00] lestrrat I'm using 2.0.8
[05:01] sustrik check this:
[05:01] bgranger but if we can do signal -> EINTR -> Python bindings -> raise KeyboardInterrupt that would be great
[05:02] sustrik yes
[05:02] sustrik you see the loop, right?
[05:02] lestrrat right
[05:02] bgranger Yep
[05:02] sustrik just put a printf there or something
[05:02] sustrik and check whether it loops in case of SIGINT
[05:02] bgranger OK I will do that
[05:02] bgranger I will check on Linux and OS X
[05:02] sustrik if so we can return EINTR instead of looping
[05:03] sustrik great
[05:03] sustrik thanks
[05:03] bgranger Cool
[05:03] bgranger Thanks!
[05:03] bgranger Won't get to it tonight though
[05:03] bgranger Getting late here...
[05:03] sustrik sure
[05:03] sustrik lestrrat: you may try the same, if in hurry
[05:03] lestrrat yeah, I'm on $day_job too, so I'm not going to be able to check it right away. Will see if I can sneak that in in the next couple of hours
[05:04] bgranger If I want to try the change, I can just have the function return EINTR?
[05:04] lestrrat yep, will do when pointy haired boss isn't looking
[05:04] sustrik :)
[05:04] lestrrat AFK for now
[05:04] bgranger Will the return value of that function be returned to the caller
[05:04] sustrik bgranger: probably
[05:04] sustrik let me check
[05:04] bgranger Well, it looks like it must return a bool though
[05:05] bgranger May have to change more logic.
[05:05] sustrik ah, ok
[05:05] sustrik it's not propagated
[05:05] sustrik but i can fix that
[05:05] sustrik the question is whether it would help
[05:05] bgranger OK, let's test first
[05:05] bgranger right
[05:09] sustrik lestrrat: are you in japan? i've notices some of your twitter notes
[05:09] sustrik had no idea what they were about though :)
[05:10] guido_g hi all
[05:10] guido_g good morning sustrik
[05:10] sustrik morning
[05:11] guido_g did you read that the poll segfault in master is still there?
[05:11] sustrik yep, seen you saying it yesterday
[05:11] sustrik can i reproduce it?
[05:11] guido_g i hope so
[05:11] sustrik which test program?
[05:12] guido_g code is at
[05:12] guido_g zmqcpp and sender
[05:12] guido_g zmqcpp segfaults in poll
[05:13] sustrik what order to start them in?
[05:13] guido_g result run is
[05:13] guido_g i started zmqcpp first and then sender
[05:13] guido_g sender is a one shot thing
[05:13] sustrik ok
[05:14] guido_g going to write a ømq test program for that
[05:14] sustrik ack
[05:14] guido_g seems to be a nasty corner that needs some regression
[05:22] sustrik guido_g:
[05:22] sustrik In file included from zmqcpp.cpp:9:
[05:22] sustrik worker.hpp:14:21: error: pqxx/pqxx: No such file or directory
[05:22] sustrik ...
[05:23] guido_g oh shit
[05:23] guido_g sorry
[05:23] sustrik I'll rather wait for your test program
[05:23] guido_g ack
[05:23] guido_g never mind then
[05:23] guido_g "i'll be back!" :)
[05:24] sustrik ;)
[05:24] guido_g so, day job is waiting, cu
[05:31] lestrrat sustrik: yes I'm in Japan :)
[05:32] sustrik lot of japanese tweets about 0mq lately
[05:32] sustrik it's fun
[05:33] lestrrat let's see if I can work on that signal thing...
[05:34] sustrik ok
[05:40] lestrrat alright, so I'm a autoconf noob. how do I make this go away?:
[05:41] sustrik you need newer version of autoconf
[05:41] sustrik are you on OSX?
[05:42] sustrik this one would deserve going into FAQ btw
[05:43] lestrrat yeah, I'm on OSX
[05:43] sustrik it's a known issue
[05:43] lestrrat grrr
[05:43] sustrik osx has old version of autoconf installed by default
[05:44] lestrrat grr, and it's not there in homebrew
[05:47] lestrrat man, this is going to take some time :)
[05:47] sustrik lestrrat: sorry, it's pkg-config
[05:47] sustrik
[05:47] lestrrat oh cool
[05:48] lestrrat hmm, I seem to have the latest pkg-config
[05:49] sustrik are you building witg --with-pgm btw?
[05:50] lestrrat nay, just the default with a custom --prefix
[05:50] sustrik hm
[05:50] sustrik well, you'll have to ask on the mailing list
[05:50] sustrik as i say, i'm an autotools idiot
[05:53] lestrrat trying out autoconf anyways
[06:09] CIA-20 zeromq2: 03Martin Sustrik 07maint * rd5b6f68 10/ AUTHORS : Mikael Kjaer added to AUTHORS -
[06:09] CIA-20 zeromq2: 03Bernd Melchers 07maint * r8ec0743 10/ (AUTHORS src/signaler.cpp): Fix for signaler_t on HP-UX and AIX platforms -
[06:09] CIA-20 zeromq2: 03Jon Dyte 07maint * r14853c2 10/ (src/prefix_tree.cpp src/prefix_tree.hpp):
[06:09] CIA-20 zeromq2: Prior to this patch prefix_tree asserts.
[06:09] CIA-20 zeromq2: This is because as it adds the 255th element at a node it attempts to calculate
[06:09] CIA-20 zeromq2: the count member var which is an unsigned char via count = (255 -0) + 1; and
[06:09] CIA-20 zeromq2: pass the result to realloc. Unfortunately the result is zero and realloc returns
[06:09] CIA-20 zeromq2: null; the prefix_tree asserts. I have fixed it by making the count an unsigned
[06:09] CIA-20 zeromq2: short. -
[06:09] CIA-20 zeromq2: 03Martin Sustrik 07master * rd5b6f68 10/ AUTHORS : Mikael Kjaer added to AUTHORS -
[06:09] CIA-20 zeromq2: 03Bernd Melchers 07master * r8ec0743 10/ (AUTHORS src/signaler.cpp): Fix for signaler_t on HP-UX and AIX platforms -
[06:09] CIA-20 zeromq2: 03Jon Dyte 07master * r14853c2 10/ (src/prefix_tree.cpp src/prefix_tree.hpp):
[06:09] CIA-20 zeromq2: Prior to this patch prefix_tree asserts.
[06:09] CIA-20 zeromq2: This is because as it adds the 255th element at a node it attempts to calculate
[06:09] CIA-20 zeromq2: the count member var which is an unsigned char via count = (255 -0) + 1; and
[06:09] CIA-20 zeromq2: pass the result to realloc. Unfortunately the result is zero and realloc returns
[06:09] CIA-20 zeromq2: null; the prefix_tree asserts. I have fixed it by making the count an unsigned
[06:09] CIA-20 zeromq2: short. -
[06:10] CIA-20 zeromq2: 03Martin Sustrik 07master * r0a1f7e3 10/ (AUTHORS src/signaler.cpp src/trie.cpp src/trie.hpp): (log message trimmed)
[06:10] CIA-20 zeromq2: Merge branch 'maint'
[06:10] CIA-20 zeromq2: * maint:
[06:10] CIA-20 zeromq2: Prior to this patch prefix_tree asserts.
[06:10] CIA-20 zeromq2: Fix for signaler_t on HP-UX and AIX platforms
[06:10] CIA-20 zeromq2: Mikael Kjaer added to AUTHORS
[06:10] CIA-20 zeromq2: Conflicts:
[06:35] lestrrat hmm, zeromq HEAD against my perl binding doesn't work -- (seeing some tests block on odd places)
[06:36] lestrrat could be something on my part, of course.
[06:36] lestrrat investigating...
[06:52] lestrrat hmm, switching back to 2.0.8 makes the problem go away. I see all my destructors running, so I'm willing to be there's some sort of background thread or something that's kept alive in zeromq
[06:54] guido_g re
[06:58] sustrik lestrrat: HEAD is a dev version, testing with 2.0.8 is ok
[06:59] lestrrat k
[06:59] sustrik what about the SIGINT thing?
[06:59] lestrrat trying now
[07:02] lestrrat so yeah, looks like it's stuck in that loop
[07:02] lestrrat tweaking to see if I can cleanly exit there...
[07:11] sustrik lestrrat: i'm going to be away now for a while, but leave me a message here about whether returning EINTR from 0mq would help you
[07:11] lestrrat yep, thanks
[07:23] lestrrat sustrik:
[07:23] lestrrat this works for me!
[08:45] sustrik lestrrat: great
[08:45] sustrik i'll fix it in a bit different way (actually returning EINTR)
[08:46] sustrik but it's good news to know it solves the Ctrl+C problem
[08:46] sustrik we'll see whether it helps with python as well
[09:52] mato sustrik: are you there?
[10:00] sssss test
[10:06] sustrik mato: re
[10:14] mato sustrik: could you switch on all the test servers + switch please?
[10:15] mato sustrik: actually, just one of the 8-core boxes is all i need right now
[10:15] mato sustrik: and the two white ones
[10:15] mato sustrik: so pick the quieter of janos or csaba :) i don't care...
[10:16] sustrik oh my
[10:16] sustrik ok
[10:16] mato :)
[10:16] mato loud music also works :)
[10:18] sustrik started
[10:18] mato damn
[10:18] sustrik now i realise there's just one of those big boxes here
[10:18] sustrik where's the other one?
[10:18] pieterh hey, the lights just dimmed here...!
[10:18] sustrik still at malosek's place?
[10:18] mato sustrik: just one? i dunno? does malo still have the other?
[10:19] mato sustrik: i can't connect to your IP at all :(
[10:19] mato hmm
[10:19] sustrik i see the world outside
[10:19] mato i have a suspicion your chello connection is funny
[10:20] sustrik ?
[10:20] mato well, first of all, your ip is on some weird subnet
[10:20] mato which i've never seen
[10:20] mato 2nd, i can't ping or connect to it :(
[10:21] sustrik :(
[10:21] sustrik you can take the boxes
[10:44] guido_g openvpn is really nice for this kind of things
[10:45] sustrik it works now (not sure why)
[11:26] mato wow
[11:26] mato sustrik: quick question...
[11:26] mato sustrik: zmq_free_fn() is called in the context of the *application* or *i/o* thread?
[11:27] sustrik i/o
[11:27] mato riiiiiight
[11:27] mato that explains a lot
[11:27] sustrik ?
[11:28] mato i'm having issues with the python GIL
[11:28] mato at least i think that's what happening
[11:28] sustrik a-ha, i see
[11:28] mato it explains all the mysterious hanging close() with 2.0.x
[11:28] sustrik python zero-copy functionality
[11:28] mato ja
[11:28] mato anyway, i'm going to lunch, we can discuss when i get back...
[11:29] sustrik ok
[11:47] sustrik thanks
[11:47] sustrik have you thought of a pub to meet the other guys in?
[11:47] sustrik it's only 3 people so far
[11:47] sustrik so no need for complex arrangements
[11:48] sustrik say it'll be 5-7 ultimately
[11:48] sustrik oops :)
[11:48] sustrik that was intended for another window
[11:49] sustrik the london meetup stuff
[12:16] mikko sustrik: theodore bullfrog is a decent place
[12:17] mikko,-0.129493&sspn=0.007317,0.022724&ie=UTF8&hq=theodore+bullfrog&hnear=&ll=51.508843,-0.123961&spn=0.003659,0.011362&z=17&iwloc=A
[12:18] sustrik looks central enough
[12:20] sustrik is that restaurant or a pub?
[12:24] sustrik mikko: hullo!
[12:24] mikko pub
[12:24] mikko but they serve food as well
[12:25] sustrik ok, why not then
[12:26] mikko
[12:26] sustrik nice
[12:36] mato sustrik: ok, so this GIL thing
[12:36] mato sustrik: it's an interesting problem
[12:37] sustrik yes?
[12:37] mato what i see is that in my python application zmq_close() is getting called, due to Python gc kicking in
[12:37] mato however, *at the same time*, zmq_msg_close() on a method shared with Python is called in the i/o thread
[12:38] mato now, zmq_close() calls process_commands() which is waiting for a signal from the i/o thread
[12:39] mato but that signal never arrives, because the free_fn() called by the zmq_msg_close() in the i/o thread is trying to acquire the Python GIL
[12:39] mato -> *deadlock*
[12:39] mato you follow me?
[12:40] mato anyway, i can work around it now by not closing yet more sockets :)
[12:40] mato but it's an interesting case
[12:40] mato i will have to make a test case and write it up on the mailing list so that other people can participate
[12:40] mato it's not clear what the solution is
[12:40] sustrik i have to draw it down :)
[12:41] sustrik mato: yes, please
[12:41] sustrik so that it's not lost
[12:41] mato definitely
[12:41] mato it needs to be solved, and i suspect the same problem can occur in 2.1
[12:41] mato however, it's possible the solution needs to be done on the pyzmq side
[12:43] lestrrat sustrik: wrt EINTR thing, what version should I be expecting it to be implemented? just want to know, cause I'd like to wait my first release of mongrel2 handler until that's fixed.
[12:43] lestrrat don't want to tell my users to send a SIGKILL to stop their handlers :)
[12:43] lestrrat (not in a hurry, just want to get an idea)
[12:44] sustrik lestrrat: the problem is that it isn't backward compatible
[12:44] sustrik :|
[12:44] lestrrat ah
[12:44] mato EINTR *again* ?
[12:44] lestrrat yeah, true
[12:44] sustrik yes
[12:44] sustrik it turns out that Ctrl+C problem in bindings
[12:44] sustrik can be solved by returning EINTR
[12:44] mato sustrik: have you verified this?
[12:45] sustrik lestrrat did with perl binding
[12:45] mato hmm, i think i know why it "solves" the problem
[12:45] mato or i can guess...
[12:45] sustrik i'm still waiting for brian to verify it with python
[12:45] lestrrat I'm willing to accept I'm a bit off :)
[12:45] sustrik mato: go on
[12:46] mato i reckon what happens is returning EINTR means "the language runtime" wakes up
[12:46] sustrik right
[12:46] mato depending on said language runtime's signal handling (out of our control, obviously), the runtime waking up *may* mean that it processes pending signals
[12:46] mato hence the "solution"
[12:47] sustrik well, it least the runtime has a chance to process the signal
[12:47] mato hmm
[12:47] sustrik right now it's completely stuck
[12:47] mato yeah
[12:47] mato of course the runtime might be doing something entirely different
[12:47] mato so it's not a guaranteed solution
[12:47] sustrik not our fault
[12:47] mato just one that might help some people
[12:48] mato hmm, except...
[12:48] sustrik obviously we cannot give guarantees for broken runtimes
[12:48] mato yes, but what is your proposed implementation?
[12:48] sustrik forward EINTR to the user
[12:48] mato yes, except...
[12:49] mato the relationship between EINTR from a syscall and SIGINT is afaik nondeterministic
[12:49] mato in other words, you might get an EINTR back from a blocking call
[12:49] mato or not
[12:49] mato you see where this is heading?
[12:49] lestrrat should I explain how perl's signal handling works? if you guys know what's up, I'm not an entirely an expert either so I'll let you guys debate the actual implementation ;)
[12:49] mato lestrrat: sure, go for it?
[12:49] mato .
[12:50] lestrrat well, the important part wrt to all of this is that, when we go down to the C world, Perl's signal handling is effectively silenced.
[12:51] lestrrat (C world, as in binding)
[12:51] lestrrat when the binding method is called, things are completely left up to the C/C++ routine, so you can do whatever with the signals
[12:51] lestrrat and when zmq's recv() is being called, perl has absolutely no say there
[12:52] lestrrat so from perl's perspective, all it needs is for recv() to return, and let it know from a return value or errno that there was a signal sent somewhere along the line
[12:53] lestrrat then (if need be) the Perl binding can tell the perl interpreter that there was a signal, and let perl's sig handlers run
[12:53] lestrrat at least that's how I understand it to work.
[12:53] sustrik right
[12:54] mato ok
[12:54] mato i've not looked into the perl signal handling
[12:54] sustrik what i think mato is saying is that OS does not have to interrupt a blocking call (at C level) when SIGINT happens
[12:54] mato yes
[12:54] lestrrat right
[12:54] sustrik how come that Ctrl+C works then
[12:54] sustrik ?
[12:54] sustrik for C programs?
[12:55] sustrik if main thread was stuck in a blocking call
[12:55] mato ah, the thing is multiple threads are involved
[12:55] mato this is what makes it tricky
[12:55] sustrik but how does it work in C world?
[12:55] mato ok, in the simple case
[12:55] mato (1 thread)
[12:55] mato the OS will invoke the default signal handler
[12:55] mato which happens to be "exit the program"
[12:56] mato the problem is, if you're handling signals
[12:56] sustrik ah, so no SIGINT
[12:56] sustrik just hard exit
[12:56] mato yes, you get SIGINT
[12:56] mato it's hard to explain...
[12:56] sustrik try
[12:56] lestrrat :)
[12:56] sustrik ascii graphics!
[12:57] mikko sustrik: it's possible that i might be in Amsterdam whole of next week
[12:57] mikko they are still trying to figure out schedules
[12:57] sustrik aha
[12:57] sustrik ok, we'll see
[12:57] mato ok, so, case 0: 1 thread, no signal handler: result: on ^C default "exit the program" handler is called by the OS
[12:58] mato case 1: 1 thread, *and* a signal handler in that thread: result: on ^C, threads SIGINT handler is called and when that returns the blocking call it interrupted returns EINTR
[12:58] mato it is now up to the program to resolve that EINTR
[12:58] mato if it did not exit in the signal handler already, of course
[12:59] sustrik ack
[12:59] mato now, here's what i imagine most language runtimes do in a signal handler
[12:59] mato and in fact most apps with old-style (no separate thread with sigwait()) signal handlers
[12:59] mato they set some flag
[12:59] mato that's all :)
[13:00] mato hence, when the call returns EINTR, the app/Perl/whatever checks it's "was i interrupted" flag
[13:00] mato and if it was, deals with that in it's normal control flow
[13:00] lestrrat right
[13:00] sustrik ok
[13:00] mato now, enter 0mq
[13:00] mato the i/o threads ignore all signals
[13:01] sustrik right
[13:01] mato they may or may not get an EINTR if the *process* gets a signal
[13:01] mato that is entirely OS-dependent
[13:01] mato assuming for a moment that they did, the naive implementation would just return that EINTR, which would bubble up to the app thread logic and something useful might happen
[13:01] mato however, that is a bad assumption
[13:02] mato so, what to do?
[13:02] sustrik wait a sec
[13:02] sustrik are you saying that I/O threadsmay get the signal;?
[13:02] mato I/O threads will never "get" the signal
[13:02] sustrik ok
[13:03] mato but depending on your OS, they *may* get an EINTR back from any random syscall
[13:03] sustrik ack
[13:03] mato well, not any random, but anything that does something complex
[13:03] mato i.e. not getpid() but send() is a candidate
[13:03] sustrik ok, but that's beside the point
[13:03] mato it's not
[13:03] sustrik beause we are interested in app threads
[13:03] sustrik not i/o threads
[13:03] sustrik whether i/o thread gets sigint or not, we don't care
[13:04] mato yes we do, because if you were to be 100% compatible you'd have to "emulate" the behaviour i described in case 1
[13:04] mato s/compatible/nice to broken apps/interpreters/
[13:04] sustrik wait
[13:04] sustrik i/o thread just loops
[13:04] lestrrat isn't thre recv() being called on the app thread?
[13:05] sustrik so if it gets EINTR it can just ignore it
[13:05] mato hmm, yes
[13:05] sustrik lestrrat: that's my point
[13:05] mato ok, right, so you want the app thread side of the API to pass EINTR
[13:05] sustrik ack
[13:05] lestrrat right
[13:06] mato hmm
[13:06] mato ok, one way to do it w/o breaking compatibility
[13:06] mato with existing code
[13:06] mato is a context option
[13:06] mato "ZMQ_INTERRUPTIBLE"
[13:07] sustrik ok, so it would work?
[13:07] mato meaning "API calls will return EINTR if interrupted by a signal"
[13:07] mato i think so
[13:07] sustrik woohoo
[13:07] mato note that there is no change on the i/o thread side
[13:07] lestrrat coolness
[13:07] sustrik or a compile time option
[13:07] mato no, compile time is bad
[13:08] mato may be shared
[13:08] lestrrat I'd vote for ZMQ_INTERRUPTABLE
[13:08] sustrik there are no context-wide options
[13:08] mato well, they'll have to be added then :)
[13:08] sustrik but this is a hack
[13:08] mato ?
[13:08] sustrik the right solution is tu return EINTR
[13:09] sustrik we need the option only to stay backward compatible with the original lousy solution
[13:09] mato hmm hmm
[13:09] sustrik i don't like changing API because of a hack
[13:10] mato sustrik: just a minute, i'm still thinking
[13:10] lestrrat I was wondering from the beginning, but does it have to "return" EINTR? is it not enough to keep errno = EINTR ?
[13:10] sustrik errno = EINTR
[13:10] mato lestrrat: yes, we mean return (-1) with errno = EINTR
[13:10] lestrrat ah
[13:10] sustrik but still it breaks the API specification
[13:10] sustrik EINTR is not described as valid error from recv()
[13:11] mato recv() or zmq_recv()?
[13:11] sustrik zmq_recv(), sorry
[13:11] mato so we'll have to change that, i think
[13:11] sustrik people will hate us
[13:11] mato then add ZMQ_INTERRUPTIBLE
[13:11] sustrik :)
[13:12] lestrrat hey, I'm willing to accept a zmq_recv2() ;P
[13:12] mato the thing is, the model of never getting EINTR is actually right if you do your handling properly
[13:12] mato kind of
[13:12] sustrik shrug
[13:12] lestrrat does that mean the "correct" way is for me to install a sighandler in my binding?
[13:13] mato lestrrat: that probably won't help because of the way the interpreter is architected
[13:13] mato lestrrat: but yes, that would be the correct way
[13:13] mato in an ideal world :)
[13:13] sustrik in broader sense i would say: let's keep with POSIX API
[13:13] sustrik trying to outsmart it just causes problems
[13:13] lestrrat oh yeah, recv is a currently a loop, ain't it...
[13:13] mato sustrik: well, what this requires then is...
[13:14] mato sustrik: a nice thorough explanation by email
[13:14] mato sustrik: combined with "sorry, we messed up"
[13:14] sustrik we did
[13:14] mato sustrik: and it'll just go into 2.1
[13:14] mato eventually, not immediately
[13:14] mato after actual verification with at least say Perl, Python, Ruby that it does *solve* the problem
[13:15] sustrik actually, it was brian granger who asked for backward compatibility guarantees :)
[13:15] lestrrat would it be acceptable to add a different, aptly named function?
[13:15] lestrrat zmq_recv_intr() or whatever.
[13:15] mato lestrrat: no, because it involves ALL API calls that can block
[13:15] lestrrat hmm
[13:16] mato it is actually a mistake on our part, but i didn't realise until today why it was a mistake
[13:16] mato well, we could be anal and say "make your signal handling work like this", but that's impossible in the real world
[13:17] sustrik ok, let me write an email describing the problem
[13:17] mato do you understand it well enough?
[13:17] mato this email must not be compressed :-)
[13:17] sustrik and asking whether breaking backward compatibility is acceptable in this case
[13:17] sustrik no
[13:17] sustrik you can write it
[13:17] sustrik but it involves saying "i am an idiot, sorry"
[13:17] mato i'd like to verify that it actually helps
[13:18] mato yes, true, you were the one that defined the behaviour
[13:18] mato ok, look, i'll write the text describing the problem
[13:18] mato you can send it out, adding "I'm an idiot"
[13:18] mato ok? :)
[13:18] sustrik :)
[13:18] sustrik ok, let's first check whether it helps
[13:18] mato but before you publicy denounce yourself, it would be nice to check first
[13:18] mato precisely
[13:18] mato sustrik: i would suggest being very pendantic about this
[13:19] sustrik if it turns out that it does not i don't have to call myself an idiotr
[13:19] mato i.e. make three test cases (Perl, Python, Ruby)
[13:19] sustrik ok, i can create a topic branch
[13:19] mato ensure they hang currently
[13:19] sustrik fix it there
[13:19] mato i.e. ^C doesn't work
[13:19] mato then make your change
[13:19] mato and ensure that all works as expected
[13:19] mato sustrik: yes, topic branch, involves doc changes and so on
[13:19] mato good idea
[13:20] sustrik ok, let me do it
[13:20] lestrrat let me know when I can test it :)
[13:20] mato sustrik: i will find the magic command for you to email the topic branch patch set around
[13:20] sustrik lestrrat: i'll ping you
[13:20] sustrik i'll ping brian as well
[13:20] mato sustrik: so that you can give it to e.g. lestrrat
[13:20] mato ja
[13:20] sustrik he's willing to test it, we've discussed it in the morning
[13:20] mato just email the patches around privately if you don't want to call yourself a potential idiot just yet :)
[13:20] sustrik not sure about ruby
[13:20] mato (in public) :)
[13:21] sustrik cremes: are you here?
[13:21] lestrrat +1 for branch (just easier to pull ;)
[13:22] sustrik lestrrat: what's you email
[13:22] sustrik how should i ping you?
[13:22] lestrrat
[13:22] sustrik thx
[13:22] lestrrat lestrrat @ github, lestrrat @ twitter
[13:22] sustrik ok
[13:22] mato lestrrat: shhh... i'm slowly teaching sustrik git
[13:22] lestrrat lol
[13:22] mato start with local branches
[13:22] mato :)
[13:24] sustrik ok
[13:29] mato ok, i have to concentrate on something else for a bit
[13:29] mato ping me if you need me, bbl
[13:31] sustrik cya
[13:53] cremes sustrik: just got here; what do you need?
[13:54] sustrik are you seeing problem with Ctrl+C in Ruby?
[13:54] sustrik i mean, application not responding to SIGINT?
[13:55] sustrik when stuck in 0mq blocking call?
[13:55] cremes yes
[13:55] sustrik aha, good
[13:55] sustrik we think we've found a solution
[13:55] cremes i don't think the ruby signal handler runs when external C code is executing
[13:55] cremes really? that's good
[13:55] sustrik would you be willing to test it once i have a fix?
[13:56] sustrik the idea is that the blocking calls would return EINTR in case of Ctrl+C
[13:56] sustrik then the binding can take care or what happens next
[13:57] cremes sure
[13:57] sustrik great, i'll ping you once i have it
[13:57] cremes ok
[14:04] mrm2m Hi, I'm very confused at the moment: That works:
[14:05] mrm2m That doesn't work:
[14:05] mrm2m ignore those random and threading they are not used.
[14:06] mrm2m the "thread started" shows up, but nothing is sent to the corresponding server.
[14:09] sustrik mrm2m: the application exits after sending the message?
[14:10] mrm2m right
[14:10] sustrik in 2.0.8 all the unsent data are discarded when socket is closed
[14:10] mrm2m Uh - ok
[14:11] sustrik the semantics is changed in 2.1
[14:11] sustrik it's: block context termination while all data are sent
[14:12] guido_g so you can't exit if something is broken?
[14:12] sustrik it's annoying, i know
[14:13] sustrik what i want to do is to add SO_LINGER socket option
[14:13] lestrrat would be nice if it was configurable
[14:13] guido_g ack
[14:13] sustrik with same semantics as with POSIX sockets
[14:13] sustrik that should do imo
[15:25] gavinstark I am following the PUB/SUB example in the 'guide' but I'm wondering how I might setup a way for there to be multiple publishers that each subscriber can receive from. Or do I need independent bind/connects for that?
[15:27] mikko gavinstark: the latter
[15:34] cremes gavinstark: you also need a forwarder device to aggregate all of the publisher's messages
[15:35] cremes (not strictly true, but i think it's the cleanest way to set things up)
[15:42] gavinstark Do I have to pre-list all the publishers in the forwarder config? It seems like I'd have to have multiple <in> with unique values for each publisher that might appear? Or am I missing something?
[15:55] sustrik gavinstark: bind your forwarder devices
[15:55] sustrik connect the publishers and subscribers
[15:57] gavinstark sustrik, what if I do not know the qty of publishers before hand? Won't each publisher have to bind uniquely? (tcp://....:5555, tcp://....:5556, etc.?)
[15:59] sustrik pubishers should _connect_ to the forwared
[15:59] sustrik forwarder*
[16:06] gavinstark surstrik, ah, ok. I just tried that, having the publisher zmq_connect instead of bind, still not quite working. Here is what I did:
[16:06] gavinstark forwarder config:
[16:07] gavinstark Publisher:
[16:07] gavinstark subscriber:
[16:08] gavinstark ah, I think I see, I was supposed to "bind" on the "in" entry?
[16:12] sustrik gavinstark: forwarder should bind both in and out
[16:12] zedas lestrrat: re: mongrel2 handlers blocking, i fixed it by making the zeromq IO threads be 1. I've been running it like that for weeks without any problems, so you should be fine.
[16:13] gavinstark sustrik: Thanks, working perfectly now.
[16:13] sustrik zedas: this is a different issue
[16:13] sustrik annoying interactions between language runime, OS signals and 0MQ async architecture
[16:15] zedas sustrik: ah. you got a link for me about it?
[16:15] mato that's a very polite way of putting it :-)
[16:15] sustrik mato: are you able to summarise it for zed?
[16:16] mato yeah
[16:16] mato signal handling is fucked :-)
[16:16] mato end of summary
[16:16] sustrik up to the point
[16:17] sustrik basically, it has to do with handling Ctrl+C in interpreted languages
[16:17] mato to elaborate on my summary, the issue is with the 0mq API not returning EINTR when API calls are interrupted by a signal in the application thread
[16:17] sustrik i'm working on a fix now
[16:17] mato zedas: problem is most language runtimes do delayed signal handling
[16:17] mato zedas: i.e. handler() just sets some flag
[16:17] mato zedas: it doesn't actually *do* anything
[16:17] mato zedas: flag gets picked up when the runtime wakes up
[16:18] mato zedas: but since 0mq calls never return EINTR, runtime never wakes up
[16:26] zedas mato: ah yes, that'd explain it.
[16:35] ModusPwnens hi, i'm back again. Hopefully with a question that's not as dumb this time. Anyways, does anyone know what would cause the first two bytes of a message to get corrupted?
[16:39] bgranger zedas: can you say more about what you have been doing to solve this. I was going to look into the EINTR stuff today.
[16:40] mato bgranger: I think we have a solution for the EINTR stuff
[16:40] sustrik ModusPwnens: do you have a test program?
[16:40] bgranger mato: ?
[16:40] mato bgranger: oh, hang on, maybe you mean zed's issue
[16:40] bgranger Is that not the same thing?
[16:41] mato as opposed to the general issue which manifests itself as "^C doesn't work in $RANDOM_LANGUAGE"
[16:41] bgranger I missed some of the discussion so I am trying to piece it together
[16:41] mato ah
[16:41] bgranger Are there 2 different issues?
[16:41] mato yeah
[16:41] mato zedas has an issue with multiple i/o threads
[16:41] mato which may or may not have something (different) to do with EINTR
[16:42] mato not clear, haven't had time to look at that
[16:42] bgranger And signals?
[16:42] mato signals
[16:42] mato The problem is as I described just above.
[16:42] bgranger What is the idea about EINTR?
[16:42] mato When Martin Sustrik made the original API he decided not to return EINTR from blocking calls
[16:42] bgranger I did talk to sustrik late last night (for me) and he pointed me to some code ...
[16:42] bgranger Right he showed me that code
[16:43] mato Right, except that basically breaks standard signal handling
[16:43] bgranger Today I am going to put in a print statement in that logic and try to see if it prints with a SIGINT in the Python bindings.
[16:43] bgranger It is still not clear if this will work with the Python bindings, but we will see.
[16:43] mato To recap, imagine the simplest case with a C program blocking on some syscall, while at the same time handling say SIGINT.
[16:43] bgranger I think it may
[16:43] ModusPwnens sustrik: Yeah, it's happening in my code. I'm not entirely sure why. It seems like it happens when it is sent, as the sending side can properly decode the message. However, the receving side cannot because the first two bytes get corrupted for some reason. I can paste my code to a pastebin if you like.
[16:44] mato bgranger: Now, if "handling" SIGINT in this programs case means it just prints "Interrupted!" and exits, then fine.
[16:44] mato bgranger: That will work even with the current situation in 0MQ.
[16:44] bgranger Not in the python bindings...
[16:45] mato bgranger: But, if it instead means that the program just sets some random flag, and then expects to process that flag "later", it won't work.
[16:45] bgranger Right
[16:45] mato bgranger: Which is precisely the Python/Perl (at least) case
[16:45] bgranger Which is what Python does...yep
[16:45] sustrik ModusPwnens: try it
[16:45] mato bgranger: So, the only real solution is that 0MQ *API* calls return EINTR if they get EINTR back from a blocking system call.
[16:46] bgranger Will that happen regardless of what signal handlers have been installed?
[16:46] mato Yup
[16:46] ModusPwnens sustrik: do you want the entire code or just the functions in question?
[16:46] mato bgranger: It's what the OS does.
[16:46] bgranger That would definittely solve our problems then! e would be very happy about that.
[16:46] sustrik well, i would like a simple example
[16:46] sustrik showing the problem
[16:47] mato bgranger: If you do e.g. poll () in C, handle a signal in the same thread, then that poll () will return EINTR
[16:47] mato bgranger: by "handle a signal" I mean the "set a flag case"
[16:47] sustrik bgranger: i'll fix it and let you know
[16:47] bgranger mato: right OK
[16:47] sustrik you can test it with python, others will test with perl and ruby
[16:47] mato bgranger: By the way, while you're here, have you had any more Python/GIL issues?
[16:47] bgranger sustrik: can you fix it in the 2.0.8 branch. We are not using trunk yet
[16:48] sustrik backwards compatibility :|
[16:48] bgranger mato: No we have solved those. It was super sutble to get non-copy send/recv working with the GIL though.
[16:48] bgranger mato: But it is working well.
[16:48] mato bgranger: I have a really interesting case which looks like one of those...
[16:48] mato bgranger: Will send to the mailing list, tomorrow.
[16:49] ModusPwnens Hmm.well i don't really have an's just happening in the code I have written, so i could give you that if you wanted to see it. It started happening after i used google protobufs to encode into a byte array instead of a string.
[16:49] bgranger mato: Ok, I will watch for it. The challenge was getting the ref counts of zmq messages synch'd with those of Python.
[16:49] mato bgranger: But the short story is, I see gc trying to close() a socket, while at the same time zmq_free_fn() in a different thread is trying to acquire the GIL in order to decrease the message refcount
[16:49] mato bgranger: and the result is deadlock
[16:50] bgranger Is this in trunk where sockets can move threads?
[16:50] mato bgranger: nope, 2.0.8
[16:50] bgranger What language?
[16:50] mato Python...
[16:50] bgranger What version of pyzmq?
[16:50] mato latest-ish, let me check
[16:51] bgranger Since Saturday?
[16:51] mato ah, no
[16:51] mato 2.0.7 pyzmq actually
[16:51] mato with 2.0.8 zmq
[16:51] bgranger I did a bunch of work then. I release a 2.0.7 stable release and master is now 2.0.8 cmpatible.
[16:51] bgranger But, I believe that what you are saying it possible. The 1 problem we have is that the zmq_free_fn does have to acquire the GIL. If that can't happen, you have trouble.
[16:52] jonrafkind has anyone used zmq in a real-time game? mostly I just need low latency
[16:52] mato bgranger: that's precisely what I'm seeing
[16:52] bgranger There is nothing we can do to get around this.
[16:52] bgranger Can you just hold onto the socket ref to prevent gc?
[16:52] mato Yes, but I have transient sockets in this application
[16:52] mato So tons of fds get leaked
[16:52] bgranger Hmmm, that might be tough
[16:53] bgranger What do you mean by that?
[16:53] mato Well, it's precisely my workaround (holding onto the socket refs)
[16:53] bgranger Ahh, OK.
[16:53] mato But that means the underlying fds hang around, so eventually you'll run out.
[16:53] bgranger Is the socket that is being gc's the one that is sending the msg though?
[16:53] bgranger right, you don't want that
[16:54] mato That's hard to tell at the moment, but at least you've confirmed that this can happen.
[16:54] mato So the backtraces I see make sense.
[16:54] mato I'll write it up tomorrow.
[16:54] sustrik ModusPwnens: if you want me to look at it, strip it down to the simplest possible example that reproduces the bahviour
[16:54] bgranger I can do the following. When a socket send a msg, it can add itself to a list of sockets that the Message hold on to. That way the message can prevent the gc, but when the msg goes away, the socket will as well.
[16:55] sustrik ModusPwnens: aren't you overwriting the buffer you've sent to 0MQ?
[16:55] bgranger but that might keep socket around longer than you want if you have a message that love a long time.
[16:55] mato bgranger: That might be a solution, yes.
[16:55] mato bgranger: Anyhow, gc is not supposed to be instant, no?
[16:56] bgranger Depends
[16:56] bgranger if there are cycles or not.
[16:56] mato bgranger: So if the socket hangs around for a bit... does it matter too much? As long as it goes away eventually.
[16:56] sustrik jonrafkind: i think there are couple of game devs here, rbraley for example
[16:56] bgranger Depends on what "a bit" means
[16:56] mato True.
[16:56] ModusPwnens Sustrik: Well it's in a loop and I close the message at the end of the loop, initializing it again at the beginning
[16:57] mato bgranger: Let me sleep on it, and write up, this is useful to have on the list
[16:57] mato bgranger: since other people dealing with GC-based languages may run into similar problems.
[16:57] bgranger mato: Great
[16:57] bgranger Aboslutely
[16:57] ModusPwnens and the data that I am initializing it with is a char * which i free at the end of each iteration too
[16:57] sustrik show me the sending code
[16:58] mato bgranger: Thanks for your help
[16:58] bgranger later
[16:59] ModusPwnens
[17:00] ModusPwnens there's a lot of debugging stuff in there so i'm sorry that it is messy
[17:00] sustrik ModusPwnens: when using zmq_msg_init_data you are passing ownership of the buffer to 0MQ
[17:00] sustrik so you have to give it a free function
[17:00] sustrik and don't touch the buffer afterwards
[17:01] sustrik if you don't need zero-copy
[17:01] sustrik just init the message using zmq_msg_init_size
[17:01] sustrik and copt the data into i
[17:01] sustrik it
[17:01] ModusPwnens So just don't use init_data at all?
[17:02] sustrik do you need zero-copy?
[17:03] ModusPwnens I'm not really sure what that is, so I don't think so.
[17:03] sustrik than don't use it :)
[17:03] ModusPwnens Ok! So just use size to initialize it and then memcpy into it?
[17:03] sustrik exactly
[17:03] ModusPwnens okie doke. Thanks! I will try that!
[17:03] ModusPwnens Sorry for the constant questions :S
[17:04] sustrik np
[17:06] jonrafkind oh i just realied, imatrix is the same company that made SFL. I use that in my project :p
[20:50] ModusPwnens Hi guys, I have encountered a problem with the official benchmarking utility.
[20:51] ModusPwnens It appears to crash if you enter in a very large number of messages, and I was w ndering if this was supposed to happen and if so, why?
[21:16] cremes ModusPwnens: which benchmarking utility and what number did you pass to it?
[21:17] ModusPwnens The site has changed so I don't know where the utility is offhand anymore
[21:18] cremes what's the name of it?
[21:18] ModusPwnens actually, it's in my zeromq folder
[21:18] ModusPwnens in the bin
[21:18] ModusPwnens remote_thr
[21:18] cremes oh, the local_thr/remote_thr pair?
[21:18] ModusPwnens ya
[21:19] cremes so what arguments did you pass it? (i recommend you pastie the output from your shell along with any displayed error)
[21:19] ModusPwnens i passed 50 and 25000000
[21:20] ModusPwnens C:\Users\David Dawson\Desktop\zeromq-2.0.7\zeromq-2.0.7\bin>remote_thr.exe tcp:/
[21:20] ModusPwnens Assertion failed: end_chunk->next (c:\users\david dawson\desktop\zeromq-2.0.7\ze
[21:20] ModusPwnens romq-2.0.7\src\yqueue.hpp:108)
[21:20] ModusPwnens This application has requested the Runtime to terminate it in an unusual way.
[21:20] ModusPwnens Please contact the application's support team for more information.
[21:20] cremes did it fail only with the 25 million number or with 50 too?
[21:20] ModusPwnens bah, sorry, the command prompt is strange
[21:20] ModusPwnens no, it's just the 25 million
[21:20] ModusPwnens if i lower it it works fine
[21:20] ModusPwnens but I was just curious if it is supposed to fail that way
[21:21] cremes no, you may have found a bug
[21:21] cremes but before reporting it, install 2.0.8 and try again
[21:21] ModusPwnens ok
[21:21] cremes no sense in reporting a bug against an old release
[21:21] ModusPwnens true enough
[21:22] ModusPwnens ok i have to recompile the source, hold on
[21:22] cremes just a note on irc etiquette...
[21:22] cremes give as much information as possible...
[21:23] cremes don't paste more than 2 lines directly into the channel; use a pastie srevice like or for longer stuff
[21:23] cremes tell us the name of the programs involved and the version of the library
[21:23] ModusPwnens Ok. I will note that for the future.
[21:23] ModusPwnens thanks!
[21:23] cremes if you see a crash, it is never *supposed* to happen so asking if it is seems a bit silly
[21:24] cremes np
[21:24] ModusPwnens Well, i guess I meant to say if it was already known
[21:24] cremes then search the issues on github; all known bugs are reported and tracked there
[21:25] ModusPwnens Hmm, ok. I didn't know about that..
[21:25] cremes and now you know! ;)
[21:25] ModusPwnens that would be under the issues section?
[21:26] cremes correct
[21:26] ModusPwnens Okay. Sorry about that..
[21:27] cremes we all had to learn it at some point, so don't worry about it
[22:31] ModusPwnens ok, so it still crashes
[22:31] ModusPwnens i will paste the output in as ec
[22:33] ModusPwnens
[22:33] ModusPwnens I am using windows 7 on both computers
[22:33] ModusPwnens The computer that generated the error is 64-bit as well
[22:34] ModusPwnens the computer running local is only 32-bit