[Time] Name | Message |
[00:03] ModusPwnens
|
Hi, I'm having an issue with the first byte of a message being corrupted when it is sent. Has anyone experienced this?
|
[00:05] bgranger
|
mcxx: You need to grab the stable 2.0.7 releases of pyzmq and zeromq.
|
[00:06] bgranger
|
Trunk of pyzmq works with stable 2.0.8 of zeromq.
|
[00:06] bgranger
|
We have not started following the post 2.0.8 master of zeromq
|
[00:06] ModusPwnens
|
clarification, it's actually the first two bytes that get corrupted
|
[02:40] lestrrat
|
I assume this is the right place to talk about zeromq?
|
[02:42] lestrrat
|
I just wanted to make if zeromq's recv() is supposed to bail out properly (or not) when it receives a signal while it's waiting for a message?
|
[02:42] lestrrat
|
my perl binding seems to be stuck, and I'm trying to figure out if it's my problem or not.
|
[03:46] bgranger
|
lestrrat: I think I know what is going on.
|
[03:46] bgranger
|
Are you still around?
|
[03:47] lestrrat
|
yes!
|
[03:47] bgranger
|
Hey, we have the same issue in the Python bindings.
|
[03:47] bgranger
|
It perl is like Python, then it sets its own signal handlers.
|
[03:47] lestrrat
|
ah, so glad I'm not alone on this :)
|
[03:47] lestrrat
|
right
|
[03:47] bgranger
|
BUT, while a blocking recv is waiting, perl is completely out of control.
|
[03:48] AndrewBC
|
I wrapped mine in a try except in python, excepting a KeyboardInterrupt
|
[03:48] AndrewBC
|
worked fine
|
[03:48] bgranger
|
SO what happens is that Perl's signal handler will receive the signal, but it won't be handled until the recv returns. To test if this is what is really going on, try to create a simple example , then send the signal and send an additional msg to the socket so that recv returns.
|
[03:48] AndrewBC
|
so you might have to manually catch ^C if that's possible in perl
|
[03:49] bgranger
|
AndrewBC, it won't work if there is not a msg recv'd
|
[03:49] bgranger
|
Or I should say, the KeyboardInt won't be raised until recv returns.
|
[03:49] lestrrat
|
bgranger: ah right
|
[03:49] AndrewBC
|
oh, hm
|
[03:49] bgranger
|
You can't really interupt it with SIGINT
|
[03:50] lestrrat
|
I can go and confirm, but if that's the case, it's really zeromq's responsibility to return on a signal, no? (with errno = EINTER or whatever)
|
[03:50] bgranger
|
But the same is true when Python calls *any* external C/C++ code.
|
[03:50] lestrrat
|
right
|
[03:50] bgranger
|
No, because Python's signal handler is still in place, it just doesn't get to run.
|
[03:51] lestrrat
|
yeah, but we can propagate the signal as long as we get back control
|
[03:51] bgranger
|
But you can't get back control
|
[03:52] lestrrat
|
hmm?
|
[03:52] bgranger
|
The signal handler is the one at the C level, not the one that you can control using the signal module
|
[03:52] lestrrat
|
yeah, but I'm writing the Perl<->C binding, so I have control :)
|
[03:53] bgranger
|
To change the behavior you actually to have use signal.h and set jump/long jump logic.
|
[03:53] bgranger
|
But it can be done
|
[03:54] bgranger
|
It is just subtle logic and hard to get right in a cross platform manner.
|
[03:54] bgranger
|
I don't think any of the language bindings are doing this yet, but eventually all should
|
[03:54] lestrrat
|
I dunno the python internals, but I don't think we need a jump logic for perl
|
[03:54] bgranger
|
Meanwhile, we are just not using blocking recv when it needs to be interupted
|
[03:55] bgranger
|
If perl sets its own signal handlers at the C level, you likely will.
|
[03:55] lestrrat
|
perl's signal handling doesn't run on "real time". it's deferred until the next perl op. so as long as I'm in C land, I can tell the Perl interpreter that there was a signal, and properly emulate it.
|
[03:56] bgranger
|
That is how Python works as well and it why the default signal handler doesn't actually interrupt "in real time:"
|
[03:56] lestrrat
|
then I don't see why you need a jump...? what am I missing?
|
[03:57] bgranger
|
I don't remember the details honestly. I think the issue is that you have to be very careful about restoring Python's regular signal handler and putting the intepreter back in the right place in the stack.
|
[03:57] lestrrat
|
hmm.
|
[03:58] bgranger
|
That is, if you want control to return to Perl/Python in a way that doesn't simply kill the process.
|
[03:58] lestrrat
|
actually, I don't really care either way -- I don't have to have zmq_recv to get back to the Perl handler -- I just need it to get back to me once a signal is sent w/o killing the process
|
[03:58] lestrrat
|
otherwise servers hang.
|
[03:59] bgranger
|
yep
|
[03:59] lestrrat
|
glad to know I'm the only one. I can safely write an RFE :)
|
[04:43] sustrik
|
bgranger, lestrrat: i am still not sure about EINTR behaviour
|
[04:43] sustrik
|
would returning EINTR from blocking call help in any way?
|
[04:47] bgranger
|
In what situations?
|
[04:49] bgranger
|
Are you thinking of installing a signal handler that catches SIGINT and translates that into EINTR?
|
[04:49] bgranger
|
That would help us, but that signal handler would have to play nice with that of Python
|
[04:52] sustrik
|
bgragner: no, i meant the peculiar functionality of 0mq
|
[04:52] sustrik
|
that catches OS's EINTR
|
[04:52] sustrik
|
and instead of returning it to the user
|
[04:52] sustrik
|
just restarts the blocking operation
|
[04:53] bgranger
|
I am not familiar with what zmq is doing right now
|
[04:53] sustrik
|
but afaiu that's not the problem you are dealing with
|
[04:53] bgranger
|
When does the OS signal EINTR?
|
[04:53] bgranger
|
And how does zeromq currently handle it?
|
[04:54] sustrik
|
if a threads gets a signal while in blocking operation
|
[04:54] sustrik
|
0mq currently ignores it, i.e. restarts the blocking op
|
[04:56] bgranger
|
I guess what I don't know if how all of this interplays with Python's signal handling.
|
[04:56] sustrik
|
exactly
|
[04:56] bgranger
|
Python only receives signals in the main thread
|
[04:56] bgranger
|
Does zeromq install any signal handlers?
|
[04:56] sustrik
|
that's the one that's blocked in 0mq?
|
[04:57] sustrik
|
no, it does not
|
[04:57] bgranger
|
OK
|
[04:57] bgranger
|
but when a socket call returns EINTR, it just continues
|
[04:57] sustrik
|
right
|
[04:57] sustrik
|
actaully, you can check what's happening yourself
|
[04:57] bgranger
|
OK, then another ?
|
[04:57] bgranger
|
OK
|
[04:57] bgranger
|
How...
|
[04:57] sustrik
|
when waiting for a message
|
[04:57] sustrik
|
0mq is stuck in a read call
|
[04:58] bgranger
|
Correct
|
[04:58] sustrik
|
let me find the exact code...
|
[04:58] bgranger
|
OK great, that would be helpful.
|
[04:59] bgranger
|
Are the other occasions where EINTR is returned?
|
[05:00] sustrik
|
polling
|
[05:00] sustrik
|
are you on linux?
|
[05:00] lestrrat
|
sorry was away
|
[05:00] bgranger
|
linux and Mac
|
[05:00] sustrik
|
hi
|
[05:00] bgranger
|
Talking about signal handling more...
|
[05:00] sustrik
|
trunk or 2.0.8?
|
[05:00] bgranger
|
pyzmq doesn't work with trunk yet, sso 2.0.8 or 2.0.7
|
[05:00] lestrrat
|
I'm using 2.0.8
|
[05:01] sustrik
|
check this: http://github.com/zeromq/zeromq2/blob/v2.0.8/src/signaler.cpp#L269
|
[05:01] bgranger
|
but if we can do signal -> EINTR -> Python bindings -> raise KeyboardInterrupt that would be great
|
[05:02] sustrik
|
yes
|
[05:02] sustrik
|
you see the loop, right?
|
[05:02] lestrrat
|
right
|
[05:02] bgranger
|
Yep
|
[05:02] sustrik
|
just put a printf there or something
|
[05:02] sustrik
|
and check whether it loops in case of SIGINT
|
[05:02] bgranger
|
OK I will do that
|
[05:02] bgranger
|
I will check on Linux and OS X
|
[05:02] sustrik
|
if so we can return EINTR instead of looping
|
[05:03] sustrik
|
great
|
[05:03] sustrik
|
thanks
|
[05:03] bgranger
|
Cool
|
[05:03] bgranger
|
Thanks!
|
[05:03] bgranger
|
Won't get to it tonight though
|
[05:03] bgranger
|
Getting late here...
|
[05:03] sustrik
|
sure
|
[05:03] sustrik
|
lestrrat: you may try the same, if in hurry
|
[05:03] lestrrat
|
yeah, I'm on $day_job too, so I'm not going to be able to check it right away. Will see if I can sneak that in in the next couple of hours
|
[05:04] bgranger
|
If I want to try the change, I can just have the function return EINTR?
|
[05:04] lestrrat
|
yep, will do when pointy haired boss isn't looking
|
[05:04] sustrik
|
:)
|
[05:04] lestrrat
|
AFK for now
|
[05:04] bgranger
|
Will the return value of that function be returned to the caller
|
[05:04] sustrik
|
bgranger: probably
|
[05:04] sustrik
|
let me check
|
[05:04] bgranger
|
Well, it looks like it must return a bool though
|
[05:05] bgranger
|
May have to change more logic.
|
[05:05] sustrik
|
ah, ok
|
[05:05] sustrik
|
it's not propagated
|
[05:05] sustrik
|
but i can fix that
|
[05:05] sustrik
|
the question is whether it would help
|
[05:05] bgranger
|
OK, let's test first
|
[05:05] bgranger
|
right
|
[05:09] sustrik
|
lestrrat: are you in japan? i've notices some of your twitter notes
|
[05:09] sustrik
|
had no idea what they were about though :)
|
[05:10] guido_g
|
hi all
|
[05:10] guido_g
|
good morning sustrik
|
[05:10] sustrik
|
morning
|
[05:11] guido_g
|
did you read that the poll segfault in master is still there?
|
[05:11] sustrik
|
yep, seen you saying it yesterday
|
[05:11] sustrik
|
can i reproduce it?
|
[05:11] guido_g
|
i hope so
|
[05:11] sustrik
|
which test program?
|
[05:12] guido_g
|
code is at http://github.com/guidog/cpp/tree/master/zmqcpp/
|
[05:12] guido_g
|
zmqcpp and sender
|
[05:12] guido_g
|
zmqcpp segfaults in poll
|
[05:13] sustrik
|
what order to start them in?
|
[05:13] guido_g
|
result run is http://gist.github.com/560863
|
[05:13] guido_g
|
i started zmqcpp first and then sender
|
[05:13] guido_g
|
sender is a one shot thing
|
[05:13] sustrik
|
ok
|
[05:14] guido_g
|
going to write a ømq test program for that
|
[05:14] sustrik
|
ack
|
[05:14] guido_g
|
seems to be a nasty corner that needs some regression
|
[05:22] sustrik
|
guido_g:
|
[05:22] sustrik
|
In file included from zmqcpp.cpp:9:
|
[05:22] sustrik
|
worker.hpp:14:21: error: pqxx/pqxx: No such file or directory
|
[05:22] sustrik
|
...
|
[05:23] guido_g
|
oh shit
|
[05:23] guido_g
|
sorry
|
[05:23] sustrik
|
I'll rather wait for your test program
|
[05:23] guido_g
|
ack
|
[05:23] guido_g
|
never mind then
|
[05:23] guido_g
|
"i'll be back!" :)
|
[05:24] sustrik
|
;)
|
[05:24] guido_g
|
so, day job is waiting, cu
|
[05:31] lestrrat
|
sustrik: yes I'm in Japan :)
|
[05:32] sustrik
|
lot of japanese tweets about 0mq lately
|
[05:32] sustrik
|
it's fun
|
[05:33] lestrrat
|
let's see if I can work on that signal thing...
|
[05:34] sustrik
|
ok
|
[05:40] lestrrat
|
alright, so I'm a autoconf noob. how do I make this go away?:
|
[05:41] sustrik
|
you need newer version of autoconf
|
[05:41] sustrik
|
are you on OSX?
|
[05:42] sustrik
|
this one would deserve going into FAQ btw
|
[05:43] lestrrat
|
yeah, I'm on OSX
|
[05:43] sustrik
|
it's a known issue
|
[05:43] lestrrat
|
grrr
|
[05:43] sustrik
|
osx has old version of autoconf installed by default
|
[05:44] lestrrat
|
grr, and it's not there in homebrew
|
[05:47] lestrrat
|
man, this is going to take some time :)
|
[05:47] sustrik
|
lestrrat: sorry, it's pkg-config
|
[05:47] sustrik
|
http://www.zeromq.org/docs:procedures#toc2
|
[05:47] lestrrat
|
oh cool
|
[05:48] lestrrat
|
hmm, I seem to have the latest pkg-config
|
[05:49] sustrik
|
are you building witg --with-pgm btw?
|
[05:50] lestrrat
|
nay, just the default with a custom --prefix
|
[05:50] sustrik
|
hm
|
[05:50] sustrik
|
well, you'll have to ask on the mailing list
|
[05:50] sustrik
|
as i say, i'm an autotools idiot
|
[05:53] lestrrat
|
trying out autoconf anyways
|
[06:09] CIA-20
|
zeromq2: 03Martin Sustrik 07maint * rd5b6f68 10/ AUTHORS : Mikael Kjaer added to AUTHORS - http://bit.ly/aI6jV2
|
[06:09] CIA-20
|
zeromq2: 03Bernd Melchers 07maint * r8ec0743 10/ (AUTHORS src/signaler.cpp): Fix for signaler_t on HP-UX and AIX platforms - http://bit.ly/9XM0R0
|
[06:09] CIA-20
|
zeromq2: 03Jon Dyte 07maint * r14853c2 10/ (src/prefix_tree.cpp src/prefix_tree.hpp):
|
[06:09] CIA-20
|
zeromq2: Prior to this patch prefix_tree asserts.
|
[06:09] CIA-20
|
zeromq2: This is because as it adds the 255th element at a node it attempts to calculate
|
[06:09] CIA-20
|
zeromq2: the count member var which is an unsigned char via count = (255 -0) + 1; and
|
[06:09] CIA-20
|
zeromq2: pass the result to realloc. Unfortunately the result is zero and realloc returns
|
[06:09] CIA-20
|
zeromq2: null; the prefix_tree asserts. I have fixed it by making the count an unsigned
|
[06:09] CIA-20
|
zeromq2: short. - http://bit.ly/9eKDIf
|
[06:09] CIA-20
|
zeromq2: 03Martin Sustrik 07master * rd5b6f68 10/ AUTHORS : Mikael Kjaer added to AUTHORS - http://bit.ly/aI6jV2
|
[06:09] CIA-20
|
zeromq2: 03Bernd Melchers 07master * r8ec0743 10/ (AUTHORS src/signaler.cpp): Fix for signaler_t on HP-UX and AIX platforms - http://bit.ly/9XM0R0
|
[06:09] CIA-20
|
zeromq2: 03Jon Dyte 07master * r14853c2 10/ (src/prefix_tree.cpp src/prefix_tree.hpp):
|
[06:09] CIA-20
|
zeromq2: Prior to this patch prefix_tree asserts.
|
[06:09] CIA-20
|
zeromq2: This is because as it adds the 255th element at a node it attempts to calculate
|
[06:09] CIA-20
|
zeromq2: the count member var which is an unsigned char via count = (255 -0) + 1; and
|
[06:09] CIA-20
|
zeromq2: pass the result to realloc. Unfortunately the result is zero and realloc returns
|
[06:09] CIA-20
|
zeromq2: null; the prefix_tree asserts. I have fixed it by making the count an unsigned
|
[06:09] CIA-20
|
zeromq2: short. - http://bit.ly/9eKDIf
|
[06:10] CIA-20
|
zeromq2: 03Martin Sustrik 07master * r0a1f7e3 10/ (AUTHORS src/signaler.cpp src/trie.cpp src/trie.hpp): (log message trimmed)
|
[06:10] CIA-20
|
zeromq2: Merge branch 'maint'
|
[06:10] CIA-20
|
zeromq2: * maint:
|
[06:10] CIA-20
|
zeromq2: Prior to this patch prefix_tree asserts.
|
[06:10] CIA-20
|
zeromq2: Fix for signaler_t on HP-UX and AIX platforms
|
[06:10] CIA-20
|
zeromq2: Mikael Kjaer added to AUTHORS
|
[06:10] CIA-20
|
zeromq2: Conflicts:
|
[06:35] lestrrat
|
hmm, zeromq HEAD against my perl binding doesn't work -- (seeing some tests block on odd places)
|
[06:36] lestrrat
|
could be something on my part, of course.
|
[06:36] lestrrat
|
investigating...
|
[06:52] lestrrat
|
hmm, switching back to 2.0.8 makes the problem go away. I see all my destructors running, so I'm willing to be there's some sort of background thread or something that's kept alive in zeromq
|
[06:54] guido_g
|
re
|
[06:58] sustrik
|
lestrrat: HEAD is a dev version, testing with 2.0.8 is ok
|
[06:59] lestrrat
|
k
|
[06:59] sustrik
|
what about the SIGINT thing?
|
[06:59] lestrrat
|
trying now
|
[07:02] lestrrat
|
so yeah, looks like it's stuck in that loop
|
[07:02] lestrrat
|
tweaking to see if I can cleanly exit there...
|
[07:11] sustrik
|
lestrrat: i'm going to be away now for a while, but leave me a message here about whether returning EINTR from 0mq would help you
|
[07:11] lestrrat
|
yep, thanks
|
[07:23] lestrrat
|
sustrik: http://gist.github.com/562000
|
[07:23] lestrrat
|
this works for me!
|
[08:45] sustrik
|
lestrrat: great
|
[08:45] sustrik
|
i'll fix it in a bit different way (actually returning EINTR)
|
[08:46] sustrik
|
but it's good news to know it solves the Ctrl+C problem
|
[08:46] sustrik
|
we'll see whether it helps with python as well
|
[09:52] mato
|
sustrik: are you there?
|
[10:00] sssss
|
test
|
[10:06] sustrik
|
mato: re
|
[10:14] mato
|
sustrik: could you switch on all the test servers + switch please?
|
[10:15] mato
|
sustrik: actually, just one of the 8-core boxes is all i need right now
|
[10:15] mato
|
sustrik: and the two white ones
|
[10:15] mato
|
sustrik: so pick the quieter of janos or csaba :) i don't care...
|
[10:16] sustrik
|
oh my
|
[10:16] sustrik
|
ok
|
[10:16] mato
|
:)
|
[10:16] mato
|
loud music also works :)
|
[10:18] sustrik
|
started
|
[10:18] mato
|
damn
|
[10:18] sustrik
|
now i realise there's just one of those big boxes here
|
[10:18] sustrik
|
where's the other one?
|
[10:18] pieterh
|
hey, the lights just dimmed here...!
|
[10:18] sustrik
|
still at malosek's place?
|
[10:18] mato
|
sustrik: just one? i dunno? does malo still have the other?
|
[10:19] mato
|
sustrik: i can't connect to your IP at all :(
|
[10:19] mato
|
hmm
|
[10:19] sustrik
|
i see the world outside
|
[10:19] mato
|
i have a suspicion your chello connection is funny
|
[10:20] sustrik
|
?
|
[10:20] mato
|
well, first of all, your ip is on some weird subnet
|
[10:20] mato
|
which i've never seen
|
[10:20] mato
|
2nd, i can't ping or connect to it :(
|
[10:21] sustrik
|
:(
|
[10:21] sustrik
|
you can take the boxes
|
[10:44] guido_g
|
openvpn is really nice for this kind of things
|
[10:45] sustrik
|
it works now (not sure why)
|
[11:26] mato
|
wow
|
[11:26] mato
|
sustrik: quick question...
|
[11:26] mato
|
sustrik: zmq_free_fn() is called in the context of the *application* or *i/o* thread?
|
[11:27] sustrik
|
i/o
|
[11:27] mato
|
riiiiiight
|
[11:27] mato
|
that explains a lot
|
[11:27] sustrik
|
?
|
[11:28] mato
|
i'm having issues with the python GIL
|
[11:28] mato
|
at least i think that's what happening
|
[11:28] sustrik
|
a-ha, i see
|
[11:28] mato
|
it explains all the mysterious hanging close() with 2.0.x
|
[11:28] sustrik
|
python zero-copy functionality
|
[11:28] mato
|
ja
|
[11:28] mato
|
anyway, i'm going to lunch, we can discuss when i get back...
|
[11:29] sustrik
|
ok
|
[11:47] sustrik
|
thanks
|
[11:47] sustrik
|
have you thought of a pub to meet the other guys in?
|
[11:47] sustrik
|
it's only 3 people so far
|
[11:47] sustrik
|
so no need for complex arrangements
|
[11:48] sustrik
|
say it'll be 5-7 ultimately
|
[11:48] sustrik
|
oops :)
|
[11:48] sustrik
|
that was intended for another window
|
[11:49] sustrik
|
the london meetup stuff
|
[12:16] mikko
|
sustrik: theodore bullfrog is a decent place
|
[12:17] mikko
|
http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=theodore+bullfrog&sll=51.516293,-0.129493&sspn=0.007317,0.022724&ie=UTF8&hq=theodore+bullfrog&hnear=&ll=51.508843,-0.123961&spn=0.003659,0.011362&z=17&iwloc=A
|
[12:18] sustrik
|
looks central enough
|
[12:20] sustrik
|
is that restaurant or a pub?
|
[12:24] sustrik
|
mikko: hullo!
|
[12:24] mikko
|
pub
|
[12:24] mikko
|
but they serve food as well
|
[12:25] sustrik
|
ok, why not then
|
[12:26] mikko
|
http://www.theodore-bullfrog.co.uk/
|
[12:26] sustrik
|
nice
|
[12:36] mato
|
sustrik: ok, so this GIL thing
|
[12:36] mato
|
sustrik: it's an interesting problem
|
[12:37] sustrik
|
yes?
|
[12:37] mato
|
what i see is that in my python application zmq_close() is getting called, due to Python gc kicking in
|
[12:37] mato
|
however, *at the same time*, zmq_msg_close() on a method shared with Python is called in the i/o thread
|
[12:38] mato
|
now, zmq_close() calls process_commands() which is waiting for a signal from the i/o thread
|
[12:39] mato
|
but that signal never arrives, because the free_fn() called by the zmq_msg_close() in the i/o thread is trying to acquire the Python GIL
|
[12:39] mato
|
-> *deadlock*
|
[12:39] mato
|
you follow me?
|
[12:40] mato
|
anyway, i can work around it now by not closing yet more sockets :)
|
[12:40] mato
|
but it's an interesting case
|
[12:40] mato
|
i will have to make a test case and write it up on the mailing list so that other people can participate
|
[12:40] mato
|
it's not clear what the solution is
|
[12:40] sustrik
|
i have to draw it down :)
|
[12:41] sustrik
|
mato: yes, please
|
[12:41] sustrik
|
so that it's not lost
|
[12:41] mato
|
definitely
|
[12:41] mato
|
it needs to be solved, and i suspect the same problem can occur in 2.1
|
[12:41] mato
|
however, it's possible the solution needs to be done on the pyzmq side
|
[12:43] lestrrat
|
sustrik: wrt EINTR thing, what version should I be expecting it to be implemented? just want to know, cause I'd like to wait my first release of mongrel2 handler until that's fixed.
|
[12:43] lestrrat
|
don't want to tell my users to send a SIGKILL to stop their handlers :)
|
[12:43] lestrrat
|
(not in a hurry, just want to get an idea)
|
[12:44] sustrik
|
lestrrat: the problem is that it isn't backward compatible
|
[12:44] sustrik
|
:|
|
[12:44] lestrrat
|
ah
|
[12:44] mato
|
EINTR *again* ?
|
[12:44] lestrrat
|
yeah, true
|
[12:44] sustrik
|
yes
|
[12:44] sustrik
|
it turns out that Ctrl+C problem in bindings
|
[12:44] sustrik
|
can be solved by returning EINTR
|
[12:44] mato
|
sustrik: have you verified this?
|
[12:45] sustrik
|
lestrrat did with perl binding
|
[12:45] mato
|
hmm, i think i know why it "solves" the problem
|
[12:45] mato
|
or i can guess...
|
[12:45] sustrik
|
i'm still waiting for brian to verify it with python
|
[12:45] lestrrat
|
I'm willing to accept I'm a bit off :)
|
[12:45] sustrik
|
mato: go on
|
[12:46] mato
|
i reckon what happens is returning EINTR means "the language runtime" wakes up
|
[12:46] sustrik
|
right
|
[12:46] mato
|
depending on said language runtime's signal handling (out of our control, obviously), the runtime waking up *may* mean that it processes pending signals
|
[12:46] mato
|
hence the "solution"
|
[12:47] sustrik
|
well, it least the runtime has a chance to process the signal
|
[12:47] mato
|
hmm
|
[12:47] sustrik
|
right now it's completely stuck
|
[12:47] mato
|
yeah
|
[12:47] mato
|
of course the runtime might be doing something entirely different
|
[12:47] mato
|
so it's not a guaranteed solution
|
[12:47] sustrik
|
not our fault
|
[12:47] mato
|
just one that might help some people
|
[12:48] mato
|
hmm, except...
|
[12:48] sustrik
|
obviously we cannot give guarantees for broken runtimes
|
[12:48] mato
|
yes, but what is your proposed implementation?
|
[12:48] sustrik
|
forward EINTR to the user
|
[12:48] mato
|
yes, except...
|
[12:49] mato
|
the relationship between EINTR from a syscall and SIGINT is afaik nondeterministic
|
[12:49] mato
|
in other words, you might get an EINTR back from a blocking call
|
[12:49] mato
|
or not
|
[12:49] mato
|
you see where this is heading?
|
[12:49] lestrrat
|
should I explain how perl's signal handling works? if you guys know what's up, I'm not an entirely an expert either so I'll let you guys debate the actual implementation ;)
|
[12:49] mato
|
lestrrat: sure, go for it?
|
[12:49] mato
|
.
|
[12:50] lestrrat
|
well, the important part wrt to all of this is that, when we go down to the C world, Perl's signal handling is effectively silenced.
|
[12:51] lestrrat
|
(C world, as in binding)
|
[12:51] lestrrat
|
when the binding method is called, things are completely left up to the C/C++ routine, so you can do whatever with the signals
|
[12:51] lestrrat
|
and when zmq's recv() is being called, perl has absolutely no say there
|
[12:52] lestrrat
|
so from perl's perspective, all it needs is for recv() to return, and let it know from a return value or errno that there was a signal sent somewhere along the line
|
[12:53] lestrrat
|
then (if need be) the Perl binding can tell the perl interpreter that there was a signal, and let perl's sig handlers run
|
[12:53] lestrrat
|
at least that's how I understand it to work.
|
[12:53] sustrik
|
right
|
[12:54] mato
|
ok
|
[12:54] mato
|
i've not looked into the perl signal handling
|
[12:54] sustrik
|
what i think mato is saying is that OS does not have to interrupt a blocking call (at C level) when SIGINT happens
|
[12:54] mato
|
yes
|
[12:54] lestrrat
|
right
|
[12:54] sustrik
|
how come that Ctrl+C works then
|
[12:54] sustrik
|
?
|
[12:54] sustrik
|
for C programs?
|
[12:55] sustrik
|
if main thread was stuck in a blocking call
|
[12:55] mato
|
ah, the thing is multiple threads are involved
|
[12:55] mato
|
this is what makes it tricky
|
[12:55] sustrik
|
but how does it work in C world?
|
[12:55] mato
|
ok, in the simple case
|
[12:55] mato
|
(1 thread)
|
[12:55] mato
|
the OS will invoke the default signal handler
|
[12:55] mato
|
which happens to be "exit the program"
|
[12:56] mato
|
the problem is, if you're handling signals
|
[12:56] sustrik
|
ah, so no SIGINT
|
[12:56] sustrik
|
just hard exit
|
[12:56] mato
|
yes, you get SIGINT
|
[12:56] mato
|
it's hard to explain...
|
[12:56] sustrik
|
try
|
[12:56] lestrrat
|
:)
|
[12:56] sustrik
|
ascii graphics!
|
[12:57] mikko
|
sustrik: it's possible that i might be in Amsterdam whole of next week
|
[12:57] mikko
|
they are still trying to figure out schedules
|
[12:57] sustrik
|
aha
|
[12:57] sustrik
|
ok, we'll see
|
[12:57] mato
|
ok, so, case 0: 1 thread, no signal handler: result: on ^C default "exit the program" handler is called by the OS
|
[12:58] mato
|
case 1: 1 thread, *and* a signal handler in that thread: result: on ^C, threads SIGINT handler is called and when that returns the blocking call it interrupted returns EINTR
|
[12:58] mato
|
it is now up to the program to resolve that EINTR
|
[12:58] mato
|
if it did not exit in the signal handler already, of course
|
[12:59] sustrik
|
ack
|
[12:59] mato
|
now, here's what i imagine most language runtimes do in a signal handler
|
[12:59] mato
|
and in fact most apps with old-style (no separate thread with sigwait()) signal handlers
|
[12:59] mato
|
they set some flag
|
[12:59] mato
|
that's all :)
|
[13:00] mato
|
hence, when the call returns EINTR, the app/Perl/whatever checks it's "was i interrupted" flag
|
[13:00] mato
|
and if it was, deals with that in it's normal control flow
|
[13:00] lestrrat
|
right
|
[13:00] sustrik
|
ok
|
[13:00] mato
|
now, enter 0mq
|
[13:00] mato
|
the i/o threads ignore all signals
|
[13:01] sustrik
|
right
|
[13:01] mato
|
they may or may not get an EINTR if the *process* gets a signal
|
[13:01] mato
|
that is entirely OS-dependent
|
[13:01] mato
|
assuming for a moment that they did, the naive implementation would just return that EINTR, which would bubble up to the app thread logic and something useful might happen
|
[13:01] mato
|
however, that is a bad assumption
|
[13:02] mato
|
so, what to do?
|
[13:02] sustrik
|
wait a sec
|
[13:02] sustrik
|
are you saying that I/O threadsmay get the signal;?
|
[13:02] mato
|
I/O threads will never "get" the signal
|
[13:02] sustrik
|
ok
|
[13:03] mato
|
but depending on your OS, they *may* get an EINTR back from any random syscall
|
[13:03] sustrik
|
ack
|
[13:03] mato
|
well, not any random, but anything that does something complex
|
[13:03] mato
|
i.e. not getpid() but send() is a candidate
|
[13:03] sustrik
|
ok, but that's beside the point
|
[13:03] mato
|
it's not
|
[13:03] sustrik
|
beause we are interested in app threads
|
[13:03] sustrik
|
not i/o threads
|
[13:03] sustrik
|
whether i/o thread gets sigint or not, we don't care
|
[13:04] mato
|
yes we do, because if you were to be 100% compatible you'd have to "emulate" the behaviour i described in case 1
|
[13:04] mato
|
s/compatible/nice to broken apps/interpreters/
|
[13:04] sustrik
|
wait
|
[13:04] sustrik
|
i/o thread just loops
|
[13:04] lestrrat
|
isn't thre recv() being called on the app thread?
|
[13:05] sustrik
|
so if it gets EINTR it can just ignore it
|
[13:05] mato
|
hmm, yes
|
[13:05] sustrik
|
lestrrat: that's my point
|
[13:05] mato
|
ok, right, so you want the app thread side of the API to pass EINTR
|
[13:05] sustrik
|
ack
|
[13:05] lestrrat
|
right
|
[13:06] mato
|
hmm
|
[13:06] mato
|
ok, one way to do it w/o breaking compatibility
|
[13:06] mato
|
with existing code
|
[13:06] mato
|
is a context option
|
[13:06] mato
|
"ZMQ_INTERRUPTIBLE"
|
[13:07] sustrik
|
ok, so it would work?
|
[13:07] mato
|
meaning "API calls will return EINTR if interrupted by a signal"
|
[13:07] mato
|
i think so
|
[13:07] sustrik
|
woohoo
|
[13:07] mato
|
note that there is no change on the i/o thread side
|
[13:07] lestrrat
|
coolness
|
[13:07] sustrik
|
or a compile time option
|
[13:07] mato
|
no, compile time is bad
|
[13:08] mato
|
libzmq.so may be shared
|
[13:08] lestrrat
|
I'd vote for ZMQ_INTERRUPTABLE
|
[13:08] sustrik
|
there are no context-wide options
|
[13:08] mato
|
well, they'll have to be added then :)
|
[13:08] sustrik
|
but this is a hack
|
[13:08] mato
|
?
|
[13:08] sustrik
|
the right solution is tu return EINTR
|
[13:09] sustrik
|
we need the option only to stay backward compatible with the original lousy solution
|
[13:09] mato
|
hmm hmm
|
[13:09] sustrik
|
i don't like changing API because of a hack
|
[13:10] mato
|
sustrik: just a minute, i'm still thinking
|
[13:10] lestrrat
|
I was wondering from the beginning, but does it have to "return" EINTR? is it not enough to keep errno = EINTR ?
|
[13:10] sustrik
|
errno = EINTR
|
[13:10] mato
|
lestrrat: yes, we mean return (-1) with errno = EINTR
|
[13:10] lestrrat
|
ah
|
[13:10] sustrik
|
but still it breaks the API specification
|
[13:10] sustrik
|
EINTR is not described as valid error from recv()
|
[13:11] mato
|
recv() or zmq_recv()?
|
[13:11] sustrik
|
zmq_recv(), sorry
|
[13:11] mato
|
so we'll have to change that, i think
|
[13:11] sustrik
|
people will hate us
|
[13:11] mato
|
then add ZMQ_INTERRUPTIBLE
|
[13:11] sustrik
|
:)
|
[13:12] lestrrat
|
hey, I'm willing to accept a zmq_recv2() ;P
|
[13:12] mato
|
the thing is, the model of never getting EINTR is actually right if you do your handling properly
|
[13:12] mato
|
kind of
|
[13:12] sustrik
|
shrug
|
[13:12] lestrrat
|
does that mean the "correct" way is for me to install a sighandler in my binding?
|
[13:13] mato
|
lestrrat: that probably won't help because of the way the interpreter is architected
|
[13:13] mato
|
lestrrat: but yes, that would be the correct way
|
[13:13] mato
|
in an ideal world :)
|
[13:13] sustrik
|
in broader sense i would say: let's keep with POSIX API
|
[13:13] sustrik
|
trying to outsmart it just causes problems
|
[13:13] lestrrat
|
oh yeah, recv is a currently a loop, ain't it...
|
[13:13] mato
|
sustrik: well, what this requires then is...
|
[13:14] mato
|
sustrik: a nice thorough explanation by email
|
[13:14] mato
|
sustrik: combined with "sorry, we messed up"
|
[13:14] sustrik
|
we did
|
[13:14] mato
|
sustrik: and it'll just go into 2.1
|
[13:14] mato
|
eventually, not immediately
|
[13:14] mato
|
after actual verification with at least say Perl, Python, Ruby that it does *solve* the problem
|
[13:15] sustrik
|
actually, it was brian granger who asked for backward compatibility guarantees :)
|
[13:15] lestrrat
|
would it be acceptable to add a different, aptly named function?
|
[13:15] lestrrat
|
zmq_recv_intr() or whatever.
|
[13:15] mato
|
lestrrat: no, because it involves ALL API calls that can block
|
[13:15] lestrrat
|
hmm
|
[13:16] mato
|
it is actually a mistake on our part, but i didn't realise until today why it was a mistake
|
[13:16] mato
|
well, we could be anal and say "make your signal handling work like this", but that's impossible in the real world
|
[13:17] sustrik
|
ok, let me write an email describing the problem
|
[13:17] mato
|
do you understand it well enough?
|
[13:17] mato
|
this email must not be compressed :-)
|
[13:17] sustrik
|
and asking whether breaking backward compatibility is acceptable in this case
|
[13:17] sustrik
|
no
|
[13:17] sustrik
|
you can write it
|
[13:17] sustrik
|
but it involves saying "i am an idiot, sorry"
|
[13:17] mato
|
i'd like to verify that it actually helps
|
[13:18] mato
|
yes, true, you were the one that defined the behaviour
|
[13:18] mato
|
ok, look, i'll write the text describing the problem
|
[13:18] mato
|
you can send it out, adding "I'm an idiot"
|
[13:18] mato
|
ok? :)
|
[13:18] sustrik
|
:)
|
[13:18] sustrik
|
ok, let's first check whether it helps
|
[13:18] mato
|
but before you publicy denounce yourself, it would be nice to check first
|
[13:18] mato
|
precisely
|
[13:18] mato
|
sustrik: i would suggest being very pendantic about this
|
[13:19] sustrik
|
if it turns out that it does not i don't have to call myself an idiotr
|
[13:19] mato
|
i.e. make three test cases (Perl, Python, Ruby)
|
[13:19] sustrik
|
ok, i can create a topic branch
|
[13:19] mato
|
ensure they hang currently
|
[13:19] sustrik
|
fix it there
|
[13:19] mato
|
i.e. ^C doesn't work
|
[13:19] mato
|
then make your change
|
[13:19] mato
|
and ensure that all works as expected
|
[13:19] mato
|
sustrik: yes, topic branch, involves doc changes and so on
|
[13:19] mato
|
good idea
|
[13:20] sustrik
|
ok, let me do it
|
[13:20] lestrrat
|
let me know when I can test it :)
|
[13:20] mato
|
sustrik: i will find the magic command for you to email the topic branch patch set around
|
[13:20] sustrik
|
lestrrat: i'll ping you
|
[13:20] sustrik
|
i'll ping brian as well
|
[13:20] mato
|
sustrik: so that you can give it to e.g. lestrrat
|
[13:20] mato
|
ja
|
[13:20] sustrik
|
he's willing to test it, we've discussed it in the morning
|
[13:20] mato
|
just email the patches around privately if you don't want to call yourself a potential idiot just yet :)
|
[13:20] sustrik
|
not sure about ruby
|
[13:20] mato
|
(in public) :)
|
[13:21] sustrik
|
cremes: are you here?
|
[13:21] lestrrat
|
+1 for branch (just easier to pull ;)
|
[13:22] sustrik
|
lestrrat: what's you email
|
[13:22] sustrik
|
how should i ping you?
|
[13:22] lestrrat
|
lestrrat@gmail.com
|
[13:22] sustrik
|
thx
|
[13:22] lestrrat
|
lestrrat @ github, lestrrat @ twitter
|
[13:22] sustrik
|
ok
|
[13:22] mato
|
lestrrat: shhh... i'm slowly teaching sustrik git
|
[13:22] lestrrat
|
lol
|
[13:22] mato
|
start with local branches
|
[13:22] mato
|
:)
|
[13:24] sustrik
|
ok
|
[13:29] mato
|
ok, i have to concentrate on something else for a bit
|
[13:29] mato
|
ping me if you need me, bbl
|
[13:31] sustrik
|
cya
|
[13:53] cremes
|
sustrik: just got here; what do you need?
|
[13:54] sustrik
|
are you seeing problem with Ctrl+C in Ruby?
|
[13:54] sustrik
|
i mean, application not responding to SIGINT?
|
[13:55] sustrik
|
when stuck in 0mq blocking call?
|
[13:55] cremes
|
yes
|
[13:55] sustrik
|
aha, good
|
[13:55] sustrik
|
we think we've found a solution
|
[13:55] cremes
|
i don't think the ruby signal handler runs when external C code is executing
|
[13:55] cremes
|
really? that's good
|
[13:55] sustrik
|
would you be willing to test it once i have a fix?
|
[13:56] sustrik
|
the idea is that the blocking calls would return EINTR in case of Ctrl+C
|
[13:56] sustrik
|
then the binding can take care or what happens next
|
[13:57] cremes
|
sure
|
[13:57] sustrik
|
great, i'll ping you once i have it
|
[13:57] cremes
|
ok
|
[14:04] mrm2m
|
Hi, I'm very confused at the moment: That works: http://paste.pocoo.org/show/257618/
|
[14:05] mrm2m
|
That doesn't work: http://paste.pocoo.org/show/257617/
|
[14:05] mrm2m
|
ignore those random and threading they are not used.
|
[14:06] mrm2m
|
the "thread started" shows up, but nothing is sent to the corresponding server.
|
[14:09] sustrik
|
mrm2m: the application exits after sending the message?
|
[14:10] mrm2m
|
right
|
[14:10] sustrik
|
in 2.0.8 all the unsent data are discarded when socket is closed
|
[14:10] mrm2m
|
Uh - ok
|
[14:11] sustrik
|
the semantics is changed in 2.1
|
[14:11] sustrik
|
it's: block context termination while all data are sent
|
[14:12] guido_g
|
so you can't exit if something is broken?
|
[14:12] sustrik
|
it's annoying, i know
|
[14:13] sustrik
|
what i want to do is to add SO_LINGER socket option
|
[14:13] lestrrat
|
would be nice if it was configurable
|
[14:13] guido_g
|
ack
|
[14:13] sustrik
|
with same semantics as with POSIX sockets
|
[14:13] sustrik
|
that should do imo
|
[15:25] gavinstark
|
I am following the PUB/SUB example in the 'guide' but I'm wondering how I might setup a way for there to be multiple publishers that each subscriber can receive from. Or do I need independent bind/connects for that?
|
[15:27] mikko
|
gavinstark: the latter
|
[15:34] cremes
|
gavinstark: you also need a forwarder device to aggregate all of the publisher's messages
|
[15:35] cremes
|
(not strictly true, but i think it's the cleanest way to set things up)
|
[15:42] gavinstark
|
Do I have to pre-list all the publishers in the forwarder config? It seems like I'd have to have multiple <in> with unique values for each publisher that might appear? Or am I missing something?
|
[15:55] sustrik
|
gavinstark: bind your forwarder devices
|
[15:55] sustrik
|
connect the publishers and subscribers
|
[15:57] gavinstark
|
sustrik, what if I do not know the qty of publishers before hand? Won't each publisher have to bind uniquely? (tcp://....:5555, tcp://....:5556, etc.?)
|
[15:59] sustrik
|
pubishers should _connect_ to the forwared
|
[15:59] sustrik
|
forwarder*
|
[16:06] gavinstark
|
surstrik, ah, ok. I just tried that, having the publisher zmq_connect instead of bind, still not quite working. Here is what I did:
|
[16:06] gavinstark
|
forwarder config: http://pastie.org/1134026
|
[16:07] gavinstark
|
Publisher: http://pastie.org/1134029
|
[16:07] gavinstark
|
subscriber: http://pastie.org/1134028
|
[16:08] gavinstark
|
ah, I think I see, I was supposed to "bind" on the "in" entry?
|
[16:12] sustrik
|
gavinstark: forwarder should bind both in and out
|
[16:12] zedas
|
lestrrat: re: mongrel2 handlers blocking, i fixed it by making the zeromq IO threads be 1. I've been running it like that for weeks without any problems, so you should be fine.
|
[16:13] gavinstark
|
sustrik: Thanks, working perfectly now.
|
[16:13] sustrik
|
zedas: this is a different issue
|
[16:13] sustrik
|
annoying interactions between language runime, OS signals and 0MQ async architecture
|
[16:15] zedas
|
sustrik: ah. you got a link for me about it?
|
[16:15] mato
|
that's a very polite way of putting it :-)
|
[16:15] sustrik
|
mato: are you able to summarise it for zed?
|
[16:16] mato
|
yeah
|
[16:16] mato
|
signal handling is fucked :-)
|
[16:16] mato
|
end of summary
|
[16:16] sustrik
|
up to the point
|
[16:17] sustrik
|
basically, it has to do with handling Ctrl+C in interpreted languages
|
[16:17] mato
|
to elaborate on my summary, the issue is with the 0mq API not returning EINTR when API calls are interrupted by a signal in the application thread
|
[16:17] sustrik
|
i'm working on a fix now
|
[16:17] mato
|
zedas: problem is most language runtimes do delayed signal handling
|
[16:17] mato
|
zedas: i.e. handler() just sets some flag
|
[16:17] mato
|
zedas: it doesn't actually *do* anything
|
[16:17] mato
|
zedas: flag gets picked up when the runtime wakes up
|
[16:18] mato
|
zedas: but since 0mq calls never return EINTR, runtime never wakes up
|
[16:26] zedas
|
mato: ah yes, that'd explain it.
|
[16:35] ModusPwnens
|
hi, i'm back again. Hopefully with a question that's not as dumb this time. Anyways, does anyone know what would cause the first two bytes of a message to get corrupted?
|
[16:39] bgranger
|
zedas: can you say more about what you have been doing to solve this. I was going to look into the EINTR stuff today.
|
[16:40] mato
|
bgranger: I think we have a solution for the EINTR stuff
|
[16:40] sustrik
|
ModusPwnens: do you have a test program?
|
[16:40] bgranger
|
mato: ?
|
[16:40] mato
|
bgranger: oh, hang on, maybe you mean zed's issue
|
[16:40] bgranger
|
Is that not the same thing?
|
[16:41] mato
|
as opposed to the general issue which manifests itself as "^C doesn't work in $RANDOM_LANGUAGE"
|
[16:41] bgranger
|
I missed some of the discussion so I am trying to piece it together
|
[16:41] mato
|
ah
|
[16:41] bgranger
|
Are there 2 different issues?
|
[16:41] mato
|
yeah
|
[16:41] mato
|
zedas has an issue with multiple i/o threads
|
[16:41] mato
|
which may or may not have something (different) to do with EINTR
|
[16:42] mato
|
not clear, haven't had time to look at that
|
[16:42] bgranger
|
And signals?
|
[16:42] mato
|
signals
|
[16:42] mato
|
The problem is as I described just above.
|
[16:42] bgranger
|
What is the idea about EINTR?
|
[16:42] mato
|
When Martin Sustrik made the original API he decided not to return EINTR from blocking calls
|
[16:42] bgranger
|
I did talk to sustrik late last night (for me) and he pointed me to some code ...
|
[16:42] bgranger
|
Right he showed me that code
|
[16:43] mato
|
Right, except that basically breaks standard signal handling
|
[16:43] bgranger
|
Today I am going to put in a print statement in that logic and try to see if it prints with a SIGINT in the Python bindings.
|
[16:43] bgranger
|
It is still not clear if this will work with the Python bindings, but we will see.
|
[16:43] mato
|
To recap, imagine the simplest case with a C program blocking on some syscall, while at the same time handling say SIGINT.
|
[16:43] bgranger
|
I think it may
|
[16:43] ModusPwnens
|
sustrik: Yeah, it's happening in my code. I'm not entirely sure why. It seems like it happens when it is sent, as the sending side can properly decode the message. However, the receving side cannot because the first two bytes get corrupted for some reason. I can paste my code to a pastebin if you like.
|
[16:44] mato
|
bgranger: Now, if "handling" SIGINT in this programs case means it just prints "Interrupted!" and exits, then fine.
|
[16:44] mato
|
bgranger: That will work even with the current situation in 0MQ.
|
[16:44] bgranger
|
Not in the python bindings...
|
[16:45] mato
|
bgranger: But, if it instead means that the program just sets some random flag, and then expects to process that flag "later", it won't work.
|
[16:45] bgranger
|
Right
|
[16:45] mato
|
bgranger: Which is precisely the Python/Perl (at least) case
|
[16:45] bgranger
|
Which is what Python does...yep
|
[16:45] sustrik
|
ModusPwnens: try it
|
[16:45] mato
|
bgranger: So, the only real solution is that 0MQ *API* calls return EINTR if they get EINTR back from a blocking system call.
|
[16:46] bgranger
|
Will that happen regardless of what signal handlers have been installed?
|
[16:46] mato
|
Yup
|
[16:46] ModusPwnens
|
sustrik: do you want the entire code or just the functions in question?
|
[16:46] mato
|
bgranger: It's what the OS does.
|
[16:46] bgranger
|
That would definittely solve our problems then! e would be very happy about that.
|
[16:46] sustrik
|
well, i would like a simple example
|
[16:46] sustrik
|
showing the problem
|
[16:47] mato
|
bgranger: If you do e.g. poll () in C, handle a signal in the same thread, then that poll () will return EINTR
|
[16:47] mato
|
bgranger: by "handle a signal" I mean the "set a flag case"
|
[16:47] sustrik
|
bgranger: i'll fix it and let you know
|
[16:47] bgranger
|
mato: right OK
|
[16:47] sustrik
|
you can test it with python, others will test with perl and ruby
|
[16:47] mato
|
bgranger: By the way, while you're here, have you had any more Python/GIL issues?
|
[16:47] bgranger
|
sustrik: can you fix it in the 2.0.8 branch. We are not using trunk yet
|
[16:48] sustrik
|
backwards compatibility :|
|
[16:48] bgranger
|
mato: No we have solved those. It was super sutble to get non-copy send/recv working with the GIL though.
|
[16:48] bgranger
|
mato: But it is working well.
|
[16:48] mato
|
bgranger: I have a really interesting case which looks like one of those...
|
[16:48] mato
|
bgranger: Will send to the mailing list, tomorrow.
|
[16:49] ModusPwnens
|
Hmm.well i don't really have an example..it's just happening in the code I have written, so i could give you that if you wanted to see it. It started happening after i used google protobufs to encode into a byte array instead of a string.
|
[16:49] bgranger
|
mato: Ok, I will watch for it. The challenge was getting the ref counts of zmq messages synch'd with those of Python.
|
[16:49] mato
|
bgranger: But the short story is, I see gc trying to close() a socket, while at the same time zmq_free_fn() in a different thread is trying to acquire the GIL in order to decrease the message refcount
|
[16:49] mato
|
bgranger: and the result is deadlock
|
[16:50] bgranger
|
Is this in trunk where sockets can move threads?
|
[16:50] mato
|
bgranger: nope, 2.0.8
|
[16:50] bgranger
|
What language?
|
[16:50] mato
|
Python...
|
[16:50] bgranger
|
What version of pyzmq?
|
[16:50] mato
|
latest-ish, let me check
|
[16:51] bgranger
|
Since Saturday?
|
[16:51] mato
|
ah, no
|
[16:51] mato
|
2.0.7 pyzmq actually
|
[16:51] mato
|
with 2.0.8 zmq
|
[16:51] bgranger
|
I did a bunch of work then. I release a 2.0.7 stable release and master is now 2.0.8 cmpatible.
|
[16:51] bgranger
|
But, I believe that what you are saying it possible. The 1 problem we have is that the zmq_free_fn does have to acquire the GIL. If that can't happen, you have trouble.
|
[16:52] jonrafkind
|
has anyone used zmq in a real-time game? mostly I just need low latency
|
[16:52] mato
|
bgranger: that's precisely what I'm seeing
|
[16:52] bgranger
|
There is nothing we can do to get around this.
|
[16:52] bgranger
|
Can you just hold onto the socket ref to prevent gc?
|
[16:52] mato
|
Yes, but I have transient sockets in this application
|
[16:52] mato
|
So tons of fds get leaked
|
[16:52] bgranger
|
Hmmm, that might be tough
|
[16:53] bgranger
|
What do you mean by that?
|
[16:53] mato
|
Well, it's precisely my workaround (holding onto the socket refs)
|
[16:53] bgranger
|
Ahh, OK.
|
[16:53] mato
|
But that means the underlying fds hang around, so eventually you'll run out.
|
[16:53] bgranger
|
Is the socket that is being gc's the one that is sending the msg though?
|
[16:53] bgranger
|
right, you don't want that
|
[16:54] mato
|
That's hard to tell at the moment, but at least you've confirmed that this can happen.
|
[16:54] mato
|
So the backtraces I see make sense.
|
[16:54] mato
|
I'll write it up tomorrow.
|
[16:54] sustrik
|
ModusPwnens: if you want me to look at it, strip it down to the simplest possible example that reproduces the bahviour
|
[16:54] bgranger
|
I can do the following. When a socket send a msg, it can add itself to a list of sockets that the Message hold on to. That way the message can prevent the gc, but when the msg goes away, the socket will as well.
|
[16:55] sustrik
|
ModusPwnens: aren't you overwriting the buffer you've sent to 0MQ?
|
[16:55] bgranger
|
but that might keep socket around longer than you want if you have a message that love a long time.
|
[16:55] mato
|
bgranger: That might be a solution, yes.
|
[16:55] mato
|
bgranger: Anyhow, gc is not supposed to be instant, no?
|
[16:56] bgranger
|
Depends
|
[16:56] bgranger
|
if there are cycles or not.
|
[16:56] mato
|
bgranger: So if the socket hangs around for a bit... does it matter too much? As long as it goes away eventually.
|
[16:56] sustrik
|
jonrafkind: i think there are couple of game devs here, rbraley for example
|
[16:56] bgranger
|
Depends on what "a bit" means
|
[16:56] mato
|
True.
|
[16:56] ModusPwnens
|
Sustrik: Well it's in a loop and I close the message at the end of the loop, initializing it again at the beginning
|
[16:57] mato
|
bgranger: Let me sleep on it, and write up, this is useful to have on the list
|
[16:57] mato
|
bgranger: since other people dealing with GC-based languages may run into similar problems.
|
[16:57] bgranger
|
mato: Great
|
[16:57] bgranger
|
Aboslutely
|
[16:57] ModusPwnens
|
and the data that I am initializing it with is a char * which i free at the end of each iteration too
|
[16:57] sustrik
|
show me the sending code
|
[16:58] mato
|
bgranger: Thanks for your help
|
[16:58] bgranger
|
later
|
[16:59] ModusPwnens
|
http://pastebin.com/uVgxmb6K
|
[17:00] ModusPwnens
|
there's a lot of debugging stuff in there so i'm sorry that it is messy
|
[17:00] sustrik
|
ModusPwnens: when using zmq_msg_init_data you are passing ownership of the buffer to 0MQ
|
[17:00] sustrik
|
so you have to give it a free function
|
[17:00] sustrik
|
and don't touch the buffer afterwards
|
[17:01] sustrik
|
if you don't need zero-copy
|
[17:01] sustrik
|
just init the message using zmq_msg_init_size
|
[17:01] sustrik
|
and copt the data into i
|
[17:01] sustrik
|
it
|
[17:01] ModusPwnens
|
So just don't use init_data at all?
|
[17:02] sustrik
|
do you need zero-copy?
|
[17:03] ModusPwnens
|
I'm not really sure what that is, so I don't think so.
|
[17:03] sustrik
|
than don't use it :)
|
[17:03] ModusPwnens
|
Ok! So just use size to initialize it and then memcpy into it?
|
[17:03] sustrik
|
exactly
|
[17:03] ModusPwnens
|
okie doke. Thanks! I will try that!
|
[17:03] ModusPwnens
|
Sorry for the constant questions :S
|
[17:04] sustrik
|
np
|
[17:06] jonrafkind
|
oh i just realied, imatrix is the same company that made SFL. I use that in my project :p
|
[20:50] ModusPwnens
|
Hi guys, I have encountered a problem with the official benchmarking utility.
|
[20:51] ModusPwnens
|
It appears to crash if you enter in a very large number of messages, and I was w ndering if this was supposed to happen and if so, why?
|
[21:16] cremes
|
ModusPwnens: which benchmarking utility and what number did you pass to it?
|
[21:17] ModusPwnens
|
The site has changed so I don't know where the utility is offhand anymore
|
[21:18] cremes
|
what's the name of it?
|
[21:18] ModusPwnens
|
actually, it's in my zeromq folder
|
[21:18] ModusPwnens
|
in the bin
|
[21:18] ModusPwnens
|
remote_thr
|
[21:18] cremes
|
oh, the local_thr/remote_thr pair?
|
[21:18] ModusPwnens
|
ya
|
[21:19] cremes
|
so what arguments did you pass it? (i recommend you pastie the output from your shell along with any displayed error)
|
[21:19] ModusPwnens
|
i passed 50 and 25000000
|
[21:20] ModusPwnens
|
C:\Users\David Dawson\Desktop\zeromq-2.0.7\zeromq-2.0.7\bin>remote_thr.exe tcp:/
|
[21:20] ModusPwnens
|
Assertion failed: end_chunk->next (c:\users\david dawson\desktop\zeromq-2.0.7\ze
|
[21:20] ModusPwnens
|
romq-2.0.7\src\yqueue.hpp:108)
|
[21:20] ModusPwnens
|
This application has requested the Runtime to terminate it in an unusual way.
|
[21:20] ModusPwnens
|
Please contact the application's support team for more information.
|
[21:20] cremes
|
did it fail only with the 25 million number or with 50 too?
|
[21:20] ModusPwnens
|
bah, sorry, the command prompt is strange
|
[21:20] ModusPwnens
|
no, it's just the 25 million
|
[21:20] ModusPwnens
|
if i lower it it works fine
|
[21:20] ModusPwnens
|
but I was just curious if it is supposed to fail that way
|
[21:21] cremes
|
no, you may have found a bug
|
[21:21] cremes
|
but before reporting it, install 2.0.8 and try again
|
[21:21] ModusPwnens
|
ok
|
[21:21] cremes
|
no sense in reporting a bug against an old release
|
[21:21] ModusPwnens
|
true enough
|
[21:22] ModusPwnens
|
ok i have to recompile the source, hold on
|
[21:22] cremes
|
just a note on irc etiquette...
|
[21:22] cremes
|
give as much information as possible...
|
[21:23] cremes
|
don't paste more than 2 lines directly into the channel; use a pastie srevice like pastie.org or gist.github.com for longer stuff
|
[21:23] cremes
|
tell us the name of the programs involved and the version of the library
|
[21:23] ModusPwnens
|
Ok. I will note that for the future.
|
[21:23] ModusPwnens
|
thanks!
|
[21:23] cremes
|
if you see a crash, it is never *supposed* to happen so asking if it is seems a bit silly
|
[21:24] cremes
|
np
|
[21:24] ModusPwnens
|
Well, i guess I meant to say if it was already known
|
[21:24] cremes
|
then search the issues on github; all known bugs are reported and tracked there
|
[21:25] ModusPwnens
|
Hmm, ok. I didn't know about that..
|
[21:25] cremes
|
and now you know! ;)
|
[21:25] ModusPwnens
|
that would be under the issues section?
|
[21:26] cremes
|
correct
|
[21:26] ModusPwnens
|
Okay. Sorry about that..
|
[21:27] cremes
|
we all had to learn it at some point, so don't worry about it
|
[22:31] ModusPwnens
|
ok, so it still crashes
|
[22:31] ModusPwnens
|
i will paste the output in as ec
|
[22:33] ModusPwnens
|
http://pastebin.com/GV4pxUth
|
[22:33] ModusPwnens
|
I am using windows 7 on both computers
|
[22:33] ModusPwnens
|
The computer that generated the error is 64-bit as well
|
[22:34] ModusPwnens
|
the computer running local is only 32-bit
|