[Time] Name | Message |
[00:14] gui81
|
anyone there
|
[00:15] gui81
|
is it possible to do this: http://pastebin.com/8KHzgu3a
|
[00:16] gui81
|
I am trying to publish to tcp://*:5555 and subscribe to tcp://localhost:5555, but it doesn't seem to work
|
[00:17] gui81
|
I can get the Hello World from pub_thread and Hello World from sub_thread messages to print out, but it doesn't seem to publish from pub_main or subscribe from sub_main
|
[00:27] lt_schmidt_jr
|
Regarding the named sockets - if I have a publisher with named subscribers that are connected to it. When a subscriber goes away and reconnects, is there any way to know if the subscriber has missed messages?
|
[00:27] gui81
|
nevermind, after reading through the Problem Solver, I realized that I didn't have setsockopt properly set the ZMQ_SUBSCRIBE option
|
[00:34] lt_schmidt_jr
|
sustrik: this may be a question for you
|
[00:34] lt_schmidt_jr
|
sustrik: how do i know (when using a named socket) if I have missed messages
|
[00:37] lt_schmidt_jr
|
named socket = DURABLE socket
|
[00:39] cremes
|
lt_schmidt_jr: a PUB socket is like a radio broadcast
|
[00:39] cremes
|
if you aren't connected when some messages go out, you miss them and never know it
|
[00:40] lt_schmidt_jr
|
even if I connect with a durable socket
|
[00:40] lt_schmidt_jr
|
?
|
[00:40] lt_schmidt_jr
|
then disconnect
|
[00:40] lt_schmidt_jr
|
and come back?
|
[00:40] cremes
|
if you disconnect, your queue is flushed
|
[00:40] lt_schmidt_jr
|
so durable sockets don't work with pub/sub
|
[00:40] cremes
|
you'll have to detect this via a sequence number gap (or similar) in the broadcast messages
|
[00:41] cremes
|
perhaps i don't understand what you mean by durable socket
|
[00:41] cremes
|
never heard of one
|
[00:41] lt_schmidt_jr
|
its in the guide
|
[00:42] cremes
|
i'll look in a minute...
|
[00:42] cremes
|
if you can provide a link, that would be best
|
[00:44] lt_schmidt_jr
|
http://zguide.zeromq.org/chapter:all#toc37
|
[00:56] lt_schmidt_jr
|
I guess I should just try it out
|
[01:01] cremes
|
lt_schmidt_jr: ok, i hadn't heard this term before
|
[01:01] cremes
|
but note that it *only* keeps the send buffer around
|
[01:01] cremes
|
anything in transit or in the receiver's OS buffers or queue will be lost
|
[01:01] cremes
|
so you probably still need to be able to identity gaps and recover from them
|
[01:02] lt_schmidt_jr
|
how does that interact with INPROC
|
[01:02] lt_schmidt_jr
|
loss is still possible?
|
[01:07] cremes
|
lt_schmidt_jr: i don't know
|
[01:07] cremes
|
that's a good question for the mailing list
|
[01:07] lt_schmidt_jr
|
thanks
|
[01:08] lt_schmidt_jr
|
I have subscriber sockets on the server on behalf of the websockets that may get disconnected
|
[01:08] lt_schmidt_jr
|
I actually have them run for a bit to see if client websocket reconects
|
[01:10] lt_schmidt_jr
|
but I wonder if I can terminate them instead
|
[01:10] lt_schmidt_jr
|
if they DURABLE
|
[01:21] stodge
|
Is there a pyzmq example of using PGM?
|
[01:36] jugg
|
yrashk, fyi, 'ezmq' is already used for an Eiffel 0mq binding.
|
[01:37] yrashk
|
jugg: oh well
|
[01:37] yrashk
|
who uses eiffel anyway?
|
[01:38] jugg
|
how about enifzmq ?
|
[01:38] yrashk
|
horrible :D
|
[01:38] jugg
|
:)
|
[01:38] yrashk
|
btw apparently ezmq is faster than erlzmq
|
[01:39] yrashk
|
up to 2x
|
[01:39] jugg
|
what changed? I read some of the backlog... your original tests showed similar performance... how'd you up it?
|
[01:41] yrashk
|
jugg: well I don't quite know yet
|
[01:41] yrashk
|
it warmed up :D
|
[01:41] yrashk
|
I rewrote it form scratch as a NIF as you can see
|
[01:41] yrashk
|
from*
|
[01:42] jugg
|
yes, I'm stuck on R13B04 unfortunately.
|
[01:43] yrashk
|
and it can reach about 60-80K on my mac pro as opposed to 30K with erlzmq
|
[01:43] yrashk
|
I would assume this is because there is not so much communication overhead now
|
[01:44] jugg
|
that is ezmq/ezmq or erlang/ezmq ?
|
[01:44] jugg
|
eh
|
[01:44] jugg
|
zmq/ezmq
|
[01:45] jugg
|
I mean I saw something about testing C zmq to erlang zmq.
|
[01:46] jugg
|
those 60-80k is ezmq to ezmq?
|
[01:47] yrashk
|
yes
|
[02:02] jugg
|
yrashk, do you plan on making these bindings usable for receiving messages without having to call recv(noblock) on an interval? If so, how?
|
[02:03] yrashk
|
jugg: you mean nb recv?
|
[02:05] jugg
|
I mean you can't call zmq_poll, and there is no out of band recv support. So, you either have to call recv on an interval, or call it and block the vm.
|
[02:05] yrashk
|
yes right now what you can see there is brecv()
|
[02:05] yrashk
|
a blocking recv
|
[02:05] yrashk
|
and here I have a non blocking recv() in my WIP
|
[02:06] yrashk
|
right now I am solving this with having a thread per socket
|
[02:06] yrashk
|
it works but only for some time, I have a bug somewhere
|
[02:07] yrashk
|
that causes a crash
|
[02:07] yrashk
|
that's why I am not committing it yet
|
[02:18] jugg
|
Why is 'send' not called 'bsend'?
|
[02:18] jugg
|
yrashk
|
[02:20] yrashk
|
jugg: because it's not inherently a waiting operation
|
[02:20] yrashk
|
you can use noblock flag for it
|
[02:21] jugg
|
and you can't in brecv?
|
[02:22] yrashk
|
you can, but the nature of recv is different
|
[02:22] yrashk
|
it is inherently a waiting operation
|
[02:30] yrashk
|
nb recv() is about 10 times slower than blocking
|
[02:45] jugg
|
yrashk, is it possible to call an escript function using apply()? I tried using ?MODULE for the module name, but that didn't work... I'm trying to get those perf tests to run under R13B04.
|
[02:49] yrashk
|
jugg -mode(compile) will help with the ?MODULE thing afair
|
[02:51] yrashk
|
*if I am not mistaken*
|
[02:52] jugg
|
that worked, thanks.
|
[02:54] yrashk
|
yw
|
[03:58] jugg
|
yrashk, https://gist.github.com/833227 is what I'll commit on csrl/erlzmq. Do you oppose being labeled the author?
|
[04:36] jugg
|
yrashk, if the ezmq:recv concept was also used for ezmq:send and for a ezmq:poll implementation, those bindings would become quite usable. Nice (and fast) work.
|
[05:22] yrashk
|
jugg: I don't oppose
|
[05:23] yrashk
|
jugg: you think so? I might try doing recv trick with send, too
|
[05:23] yrashk
|
but recv is quite slow
|
[05:23] yrashk
|
brecv is much faster
|
[05:27] jugg
|
how many cpu/cores do you have?
|
[05:32] yrashk
|
8 (16 virtual)
|
[05:33] yrashk
|
dual quadcore
|
[05:35] jugg
|
it'd be interesting to know what the scheduler is doing then... playing with thread affinity could be interesting. 10x slow down seems odd.
|
[05:37] yrashk
|
ya I am not sure what's happening
|
[05:37] yrashk
|
but recv is about 10 times slower than brecv
|
[05:37] yrashk
|
there's certainly *some* overhead
|
[05:37] yrashk
|
but I doubt there is that much of an overhead
|
[05:38] yrashk
|
although who knows
|
[05:38] yrashk
|
recv is pretty tricky
|
[05:39] yrashk
|
push, pull, recv, end
|
[05:39] yrashk
|
*send
|
[05:39] jugg
|
wild guess, but receiver_thread's enif_send() is the likely candidate since it is out of band, but brecv, the erlang scheduler is already blocking for the result.
|
[05:39] yrashk
|
it is the candidate
|
[05:39] yrashk
|
but I have no idea how else can you communicate result back
|
[05:39] yrashk
|
;)
|
[05:40] yrashk
|
also I suspect that ezmq is leaking memory
|
[05:40] yrashk
|
but that should be fixable since its pretty small
|
[05:40] yrashk
|
shouldn't be that hard to trace the leak
|
[05:41] yrashk
|
I have no proof of leaking yet tho
|
[05:41] jugg
|
maybe figure out a batch recv implementation... it'll loop until zmq_recv(noblock) returns eagain (or a max number is hit), then return the batch result. Certainly useful for when SNDMORE exists.
|
[05:42] yrashk
|
yeah I haven't even begin to deal with sndmore
|
[05:42] yrashk
|
was out of scope
|
[05:42] yrashk
|
if you only you can tell erlang to designate a scheduler to only one process
|
[05:43] yrashk
|
that is the one that talks to the NIF
|
[05:43] yrashk
|
this way blocking would be irrelevant
|
[05:43] yrashk
|
:D
|
[05:44] yrashk
|
but I don't see any way to do so
|
[05:46] yrashk
|
sigh
|
[05:49] jugg
|
I think you'll need to refine your get/setsockopt implementation... zmq is rather exact about value lengths. Also, not all socket options are valid for 'get'.
|
[05:49] yrashk
|
I am ready to merge pull reqs
|
[05:49] jugg
|
:)
|
[05:49] yrashk
|
I'd rather focus on performance issues right now
|
[05:49] yrashk
|
that was the original intention
|
[05:49] yrashk
|
even though it is faster than erlzmq
|
[05:49] yrashk
|
there's still a lot of work
|
[05:50] yrashk
|
so if you can actually help me with sockopt thing, I will greatly appreciate that
|
[05:50] yrashk
|
although you're on R13 :-(
|
[05:51] jugg
|
yah, I can't even compile it... The best I can do is refer you to csrl/erlzmq implementation.
|
[05:51] yrashk
|
re sockopt and stuff: after all ezmq is very new, I spent like 3 hours with it or so
|
[05:51] yrashk
|
well you can also install R14 somewhere... hehehe :)
|
[05:51] jugg
|
:)
|
[05:53] yrashk
|
I am really trying to understand how to get recv almost as fast as brecv
|
[05:56] yrashk
|
I have a rought idea of msend
|
[05:56] yrashk
|
but it's too vague
|
[05:57] jugg
|
ezmq_nif_recv should directly call zmq_recv(noblock|flags) and only if it returns EAGAIN do you hand it off to the receive_thread.
|
[05:58] yrashk
|
or even asend.. hey this is an idea!
|
[05:58] jugg
|
well, hand it off if (flags & noblock) == 0.
|
[05:58] yrashk
|
jugg: hey, not a bad idea
|
[05:58] jugg
|
it is what I do in erlzmq...
|
[05:59] yrashk
|
just tested brecv again on this mac pro just in case
|
[05:59] yrashk
|
66K
|
[05:59] yrashk
|
twice as good as erlzmq
|
[05:59] yrashk
|
still there, good
|
[06:00] yrashk
|
now, recv
|
[06:01] yrashk
|
27K
|
[06:01] yrashk
|
hm not 10 times, roughly 2 times
|
[06:01] yrashk
|
well I didn't recv on mac pro before yet
|
[06:01] yrashk
|
only on macbook air
|
[06:02] yrashk
|
which is much slower and fewer cores
|
[06:12] yrashk
|
jugg: horaay
|
[06:12] jugg
|
the suspense...
|
[06:13] yrashk
|
now recv is more or less on par with brecv
|
[06:13] yrashk
|
deviation is minimal
|
[06:13] jugg
|
good deal
|
[06:13] yrashk
|
thanks a lot!
|
[06:15] yrashk
|
committed
|
[06:16] yrashk
|
jugg: https://github.com/yrashk/ezmq/commit/77060d1b74d37c91ac7b403c90ea84095d30e5b9
|
[06:20] yrashk
|
now... how can we make it even faster?
|
[06:20] jugg
|
nice. so, your old recv performed similarly to erlzmq?
|
[06:20] yrashk
|
I haven't read erlzmq, honestly
|
[06:20] yrashk
|
my recv was just unconditionally handing commands to the thread
|
[06:20] yrashk
|
now it checks first
|
[06:21] jugg
|
you said it was doing 27K msg/s? erlzmq was about the same, yes?
|
[06:21] yrashk
|
commands off*
|
[06:21] yrashk
|
no, current codebase is within 60-80K msg/s
|
[06:21] yrashk
|
that was for brecv prior to that commit
|
[06:21] jugg
|
yah, sorry, I mean the old version.
|
[06:21] yrashk
|
and recv was roughly 30K
|
[06:21] yrashk
|
and now recv is on par with brecv
|
[06:21] jugg
|
ok
|
[06:22] yrashk
|
because if there's a lot of messages it does pretty much what brecv oes
|
[06:22] yrashk
|
does
|
[06:22] yrashk
|
thanks to your suggestion
|
[06:22] jugg
|
that makes me think that the message passing out of the driver (nif/port) is the difference... a port driver always has to do message passing.
|
[06:22] yrashk
|
yes
|
[06:22] yrashk
|
and this is why I wrote NIF in the first place
|
[06:22] yrashk
|
I blamed port comm for delays
|
[06:23] yrashk
|
and apparently it can attribute to about 2x diff in timming
|
[06:23] yrashk
|
timing*
|
[06:23] jugg
|
well, I guess that is a bit of the reason NIFs exist anyway...
|
[06:23] yrashk
|
ya
|
[06:24] yrashk
|
and they are going to get even better
|
[06:24] yrashk
|
in regards of async stuff
|
[06:24] yrashk
|
but since it's only in the future I am dealing with what he have now
|
[06:24] jugg
|
sure
|
[06:24] yrashk
|
and I am not yet happy with 60-80K msg/sec
|
[06:24] yrashk
|
and also sending is superslow
|
[06:24] yrashk
|
looks like batching doesn't kick in
|
[06:25] yrashk
|
it takes roughly 13 seconds to send those msgs
|
[06:25] yrashk
|
insane
|
[06:25] yrashk
|
C version takes less than a half of a second
|
[06:26] yrashk
|
if I send messages with C version, ezmq reads at about 120K msg/sec
|
[06:26] yrashk
|
so yet another 2x gain can be achieved if sending was somehow optimized
|
[06:31] yrashk
|
hmm send seem to crash sometimes on my laptop
|
[06:31] yrashk
|
not on the pro
|
[06:31] yrashk
|
I guess ezmq needs a little bit more of maturing
|
[06:52] jugg
|
yrashk, any progress, looks like I got disconnected for a while.
|
[06:53] cremes
|
pieterh, sustrik: that mailbox fix has eliminated a ton of weird shit; so glad we solved it!
|
[06:53] sustrik
|
cremes: i hope it haven't introduced other weird shit
|
[06:54] cremes
|
sustrik: not at all; it's probably the most important fix i have seen yet
|
[06:54] sustrik
|
:)
|
[06:54] cremes
|
minor in code changes, huge in effect
|
[06:54] sustrik
|
btw, do you have an osx box?
|
[06:54] cremes
|
double :)
|
[06:54] cremes
|
yes, i do
|
[06:55] sustrik
|
i would like to fix the buffer resizing algorithm so that it works on osx
|
[06:55] cremes
|
sustrik: i can give you an account on it
|
[06:55] sustrik
|
if i do the patch, can you possibly test it?
|
[06:55] cremes
|
sure, either way
|
[06:55] sustrik
|
that would be great
|
[06:55] sustrik
|
ssh?
|
[06:55] cremes
|
of course
|
[06:55] cremes
|
i'm going to bed now (1am) so i'll take care of it tomorrow
|
[06:56] sustrik
|
great. let me know by email then
|
[06:56] cremes
|
will do
|
[06:56] sustrik
|
thanks
|
[06:56] cremes
|
good night
|
[06:56] sustrik
|
good night
|
[06:56] yrashk
|
jugg: no
|
[06:57] yrashk
|
jugg: just thinking how to improve performance further
|
[06:57] yrashk
|
jugg: also I think I need to use rwlocks to guard sockets
|
[06:58] jugg
|
or you could only allow the owner pid to use the socket?
|
[06:59] jugg
|
not sure if that calling pid is availble in NIFs?
|
[06:59] yrashk
|
or that
|
[06:59] yrashk
|
it is
|
[06:59] yrashk
|
but I think rwlock is a better method
|
[06:59] yrashk
|
probably
|
[06:59] yrashk
|
also
|
[06:59] yrashk
|
I am not sure but afair there is no guarantee that call will always be made within the same scheduler
|
[07:00] yrashk
|
hence it will be another thread
|
[07:00] yrashk
|
and hence it will crash 0mq
|
[07:00] yrashk
|
that's why I am thinking rwlocks
|
[07:00] jugg
|
I don't... a socket should only be used by a single process. Else you have troubles with multi-part messages, and socket states (eg REQ toggling of send/recv).
|
[07:00] yrashk
|
and I think these occasional crashes might be attributed to this
|
[07:01] yrashk
|
true
|
[07:01] jugg
|
it is fine for sockets to be used in different threads (I assume you are using zmq 2.1.x)
|
[07:01] yrashk
|
but not simultaneously
|
[07:01] yrashk
|
I guess
|
[07:01] jugg
|
correct
|
[07:01] yrashk
|
hence rwlocks
|
[07:01] jugg
|
or a single erlang process...
|
[07:01] yrashk
|
again it doesn't guarantee the same scheduler (thread)
|
[07:02] jugg
|
doesn't need to
|
[07:02] jugg
|
a single process can't call into multiple nifs at once.
|
[07:02] yrashk
|
true
|
[07:02] yrashk
|
then we need to record pid on socket creation
|
[07:02] yrashk
|
easy to do
|
[07:02] jugg
|
yes
|
[07:03] yrashk
|
and return badarg or something more appropriate if it is not from the same process
|
[07:04] yrashk
|
this all doesn't explain rare segfaults
|
[07:05] yrashk
|
basically *sometimes* 0mq fails on reading receiver_thread's pull socket
|
[07:05] yrashk
|
which is something I can't yet explain
|
[07:06] yrashk
|
A ØMQ context is thread safe and may be shared among as many application threads as necessary, without any additional locking required on the part of the caller. Each ØMQ socket belonging to a particular context may only be used by the thread that created it using zmq_socket().
|
[07:06] yrashk
|
so not anymore?
|
[07:07] jugg
|
not on 2.1.x
|
[07:07] yrashk
|
too bad api doc is outdated
|
[07:07] yrashk
|
:-(
|
[07:07] jugg
|
in source, or on the web?
|
[07:07] yrashk
|
web
|
[07:07] jugg
|
the web is for 2.0.x still.
|
[07:07] yrashk
|
any reason why?
|
[07:07] jugg
|
it is the 'stable' version.
|
[07:08] yrashk
|
ah
|
[07:08] yrashk
|
so will be updated for 2.2?
|
[07:08] jugg
|
I'm not sure how they're handling that...
|
[07:09] jugg
|
I do wish they'd provide namespacing at api.zeromq.org tho for the different versions...
|
[07:09] jugg
|
but I believe that's been discussed and rejected.
|
[07:09] yrashk
|
that would be awesome
|
[09:03] mikko
|
pieterh_: there?
|
[09:03] pieterh
|
mikko: yup
|
[09:03] mikko
|
pieterh: cJSON
|
[09:04] pieterh
|
it's giving build errors?
|
[09:04] mikko
|
is there a reason why the .c file is included in zfl_config_json?
|
[09:04] pieterh
|
ah, that's just to simplify things
|
[09:04] mikko
|
well, it's not included in make dist atm
|
[09:04] mikko
|
fixing that
|
[09:04] mikko
|
make dist, take the tar.gz and try to build
|
[09:04] pieterh
|
it's used only by that one zfl class
|
[09:04] mikko
|
yeah
|
[09:04] pieterh
|
true, we never tested a tarball yet
|
[09:17] mikko
|
pieterh: larger config refactoring on zfl, want it on separate branch first?
|
[09:17] mikko
|
i went through most of it last night and all aspects of build work for me
|
[09:18] pieterh
|
I don't think so mikko, let's do everything on master
|
[09:18] mikko
|
ok, i can always revert if it breaks things in a bad way
|
[09:18] pieterh
|
the sooner it breaks the more time we have to fix it :-)
|
[09:18] pieterh
|
oh, we never revert :-)
|
[09:19] pieterh
|
i'm serious, the only process I know is to publish & improve
|
[09:19] mikko
|
3 files changed, 67 insertions(+), 149 deletions(-)
|
[09:19] mikko
|
heh
|
[09:20] pieterh
|
commit it, I'll quickly test on this box I'm on
|
[09:20] pieterh
|
I have 6 minutes then need to leave :-)
|
[09:20] pieterh
|
I mean, push the commit...
|
[09:20] mikko
|
pushed
|
[09:21] mikko
|
let me know if something breaks. i'll test solaris today as well
|
[09:22] pieterh
|
Tests passed OK
|
[09:22] pieterh
|
PASS: zfl_selftest
|
[09:22] pieterh
|
=============
|
[09:22] pieterh
|
1 test passed
|
[09:22] pieterh
|
=============
|
[09:22] pieterh
|
that's on Ubuntu
|
[09:22] pieterh
|
nice stuff!
|
[09:22] mikko
|
good stuff
|
[09:22] mikko
|
now i can add the gcov script and make the daily builds do this properly
|
[09:36] sustrik
|
mikko: morning
|
[09:36] sustrik
|
win7/msvc build seems to be failing
|
[09:36] sustrik
|
when in build on my XP/msvc there's no problem
|
[09:37] sustrik
|
is the source up to date there?
|
[09:37] sustrik
|
maybe it has something to do with pgm?
|
[09:37] sustrik
|
hm
|
[09:39] sustrik
|
aha, that's probably it
|
[09:44] sustrik
|
here it is: types.h:44: # define bool BOOL
|
[09:57] mikko
|
yes
|
[09:57] mikko
|
the pgm folder keeps disappearing for some reason
|
[09:57] mikko
|
i think jenkins cleans the workspace at some point
|
[09:58] sustrik
|
it's a different problem
|
[09:58] sustrik
|
i've already sent an email about it to openpgm mailing list
|
[10:00] mikko
|
ah good
|
[10:00] mikko
|
so maybe the file permissions are working after all
|
[10:00] mikko
|
if you got time at some point can you test this branch https://github.com/mkoppanen/zeromq2/tree/openpgm-autoconf
|
[10:00] mikko
|
./configure --with-pgm
|
[10:01] mikko
|
you should see during configure that it invokes openpgm configure
|
[10:01] mikko
|
and everything works like magic
|
[10:02] sustrik
|
i have ubuntu here
|
[10:02] sustrik
|
would testing on that help you?
|
[10:05] mikko
|
yeah, if you can
|
[10:05] mikko
|
i've been only running it on my local vm
|
[10:05] mikko
|
where everything seems to work ok
|
[10:06] sustrik
|
ok, wait a sec
|
[10:06] mikko
|
sun studio complains about same thing as msvc
|
[10:07] sustrik
|
yes, same problem
|
[10:13] sustrik
|
mikko: tested
|
[10:13] sustrik
|
builds ok
|
[10:14] mikko
|
good
|
[10:14] mikko
|
so it's ready(ish)
|
[10:14] sustrik
|
nice
|
[10:14] mikko
|
todo list emptying faster than i hoped
|
[10:33] yrashk
|
I am confused now
|
[10:33] yrashk
|
in PUB/SUBs who should connect and who should bind?
|
[10:33] mikko
|
yrashk: doesnt matter
|
[10:34] yrashk
|
looks like it works either way
|
[10:34] mikko
|
yes
|
[10:34] yrashk
|
mikko: that's what I thought, thanks
|
[10:34] yrashk
|
I just received a patch "fixing" this in my tests
|
[10:35] yrashk
|
and it got me puzzled because I never even thought about it
|
[11:53] mikko
|
heyo
|
[11:53] pieterh
|
heyo, mikko
|
[11:54] mikko
|
q: would it be useful to have sockopt to prevent durable subscribers on server side?
|
[11:54] pieterh
|
IMO yes
|
[11:54] mikko
|
currently the server side is pretty vulnerable to DoS
|
[11:54] mikko
|
connect tons of clients with identities, disconnect them and it should run out of memory
|
[11:54] pieterh
|
Even more so, have sockopt that *allows* this
|
[11:54] mikko
|
and another thing is removing subscriptions
|
[11:54] pieterh
|
Or else limit nbr of durable peers
|
[11:55] mikko
|
or controlling their lifetime
|
[11:55] pieterh
|
indeed
|
[11:55] mikko
|
as in "if peer has missed N messages consider it dead"
|
[11:55] pieterh
|
well, that's the whole point of durable sockets
|
[11:55] mikko
|
or "if peer hasn't been back in 2 hours consider it dead"
|
[11:55] pieterh
|
peer can go away for a long time
|
[11:55] mikko
|
i mean controllable time for durability
|
[11:56] pieterh
|
from a paranoid POV, I'd like
|
[11:56] pieterh
|
- default HWM for durable sockets
|
[11:56] mikko
|
that way you don't have to worry about restarting server if you remove a durable subscriber
|
[11:56] pieterh
|
- default limit on number of those sockets
|
[11:56] pieterh
|
- default limit on total memory used by durable socket queues
|
[11:57] pieterh
|
good topic for discussion on list IMO
|
[11:57] mikko
|
- ability to remove durable subscription explicitly
|
[11:57] mikko
|
like remove subscription of "company ABC" while keeping others
|
[11:57] pieterh
|
- timeout on durable sockets
|
[11:57] mikko
|
yes
|
[11:58] mikko
|
one of the important features for 2.1.0 i could see is not failing on invalid connection uri
|
[11:59] pieterh
|
we should list the outstanding 'bugs' in 2.1.0
|
[12:00] mikko
|
is atlassian stack overkill for us?
|
[12:00] mikko
|
they give licenses to open source as far as i know
|
[12:00] mikko
|
jira/confluence/etc
|
[12:00] pieterh
|
oh, I'd rather not
|
[12:00] pieterh
|
we used to use Jira for all issue tracking
|
[12:00] pieterh
|
it is a great, fantastic product
|
[12:01] pieterh
|
you just have to pay someone to reboot the *@%@$E# server once a week
|
[12:01] pieterh
|
i would 10x rather use github's simple but maintenance free issue tracking
|
[12:02] pieterh
|
anyhow, I was thinking of a wiki page, like the 2.0 roadmap
|
[12:02] pieterh
|
*3.0
|
[12:02] mikko
|
what i would like is somehow automatically assign issues to roadmap milestones
|
[12:02] mikko
|
i wonder if zfl tests should be broken into separate files
|
[12:03] mikko
|
currently it shows "1 test succeeded"
|
[12:03] pieterh
|
regarding issue tracking, discuss on list, it's too contentious
|
[12:04] pieterh
|
remember our process is not driven by issues but by patches
|
[12:04] pieterh
|
for zfl tests, multiple executables would work for me, sure...
|
[12:04] pieterh
|
it's more work to maintain though
|
[12:05] pieterh
|
when we have four files for each class I start to get tempted by code generation
|
[12:05] pieterh
|
and that gets ugly, you don't want to see that :-)
|
[12:06] yrashk
|
hey pieterh_
|
[12:06] pieterh
|
hi yrashk
|
[12:07] yrashk
|
mikko: I use open source license for bamboo, nice stuff
|
[12:11] mikko
|
i find jenkins a lot better than bamboo
|
[12:11] mikko
|
especially for distributed builds
|
[12:17] yrashk
|
hmm may be I should add select() based active mode for ezmq.. not sure if this help with the performance issue, though
|
[12:17] yrashk
|
this will*
|
[12:22] yrashk
|
or, rather poll() one
|
[12:44] sustrik
|
mikko, pieterh_: re durable subscribers: +1
|
[12:44] sustrik
|
i would even remove the identity option altogether
|
[13:02] tormaroe
|
Just found 0MQ, and I really exited. Built from source on Windows without problems. Now want to install Ruby gem, but having problems.
|
[13:02] tormaroe
|
Trying gem install zmq -- --with-zmq-dir=c:\zeromq
|
[13:02] tormaroe
|
but still getting ERROR: Failed to build gem native extension. extconf.rb:25: Couldn't find zmq library. try setting --with-zmq-dir=<path> to tell me where it is. (RuntimeError)
|
[13:02] tormaroe
|
please help :)
|
[13:03] tormaroe
|
I just copied the build output to my c:\zeromq. No other "installation" required, right?
|
[13:05] sustrik
|
no
|
[13:06] sustrik
|
you need just the library an the header file
|
[13:07] tormaroe
|
I have no clue about c++, so what's the header file?
|
[13:07] tormaroe
|
got dll, exp, ilk, lib and pdb
|
[13:09] yrashk
|
it looks like I forgot a lot about poll()-related matters
|
[13:10] yrashk
|
my recv blocks after receving a ZMQ_POLLIN revent
|
[13:10] yrashk
|
eh
|
[13:12] yrashk
|
don't quite understand how this could happen, but likely my 5am bug :D
|
[13:16] CIA-21
|
zeromq2: 03Martin Sustrik 07master * r17e2ca7 10/ (5 files):
|
[13:16] CIA-21
|
zeromq2: Logging of duplicit identities added
|
[13:16] CIA-21
|
zeromq2: Signed-off-by: Martin Sustrik <sustrik@250bpm.com> - http://bit.ly/dXqok7
|
[13:16] sustrik
|
tormaroe: zmq.h
|
[13:17] sustrik
|
yrashk: that should not happen
|
[13:21] tormaroe
|
Added zmq.h from the include folder and re-ran gem command, but getting same error :(
|
[13:21] sustrik
|
try asking on the mailing list, i am not a ruby expert
|
[13:21] tormaroe
|
ok, thanks anyway
|
[13:24] yrashk
|
sustrik: yeah I figured that out already and fixed the bug
|
[13:34] yrashk
|
doing poll & recv at the same time is probably an insanely bad idea, after all :D
|
[15:05] yrashk
|
sustrik: pieterh is that ok that send() with ZMQ_NOBLOCK takes roughly 10 usec?
|
[15:06] yrashk
|
I am profiling my C code
|
[15:06] sustrik
|
raw zmq_send()?
|
[15:06] yrashk
|
ya
|
[15:06] sustrik
|
that's too much
|
[15:06] sustrik
|
should be below 1 us
|
[15:07] yrashk
|
https://gist.github.com/af07bbd989ec1f7e659c
|
[15:08] yrashk
|
appended that gist with some typical tv.tv_usec-tv1.tv_usec printout
|
[15:08] yrashk
|
it looks more like 7 us
|
[15:08] yrashk
|
but still
|
[15:08] sustrik
|
have you tried to measure zmq_send() itself?
|
[15:08] sustrik
|
there's some other work being done there
|
[15:08] yrashk
|
well that if condition is false
|
[15:09] sustrik
|
ah
|
[15:09] sustrik
|
ok
|
[15:09] yrashk
|
but I can try
|
[15:09] sustrik
|
what OS are you on?
|
[15:09] yrashk
|
osx
|
[15:09] sustrik
|
how long does gettimeofday take on osx?
|
[15:09] yrashk
|
still 5+ us at least
|
[15:09] yrashk
|
with no if
|
[15:09] yrashk
|
no idea
|
[15:10] sustrik
|
it tends to be slow
|
[15:10] sustrik
|
it's kind of better with linux nowadays
|
[15:10] sustrik
|
they've got it to something like 1us
|
[15:10] sustrik
|
but before it was much worse
|
[15:10] sustrik
|
no idea about osx though
|
[15:11] sustrik
|
easiest way to measure is to make 1M zmq_send()s
|
[15:11] sustrik
|
measure the whole thing
|
[15:11] sustrik
|
and divide it by 1000000
|
[15:12] yrashk
|
apparently _lat tests are good on ezmq
|
[15:12] yrashk
|
well that's what I do
|
[15:13] yrashk
|
(07:12) <evaxsoftware> _lat results:
|
[15:13] yrashk
|
(07:12) <evaxsoftware> local_lat remote_lat mean latency real/usr/sys
|
[15:13] yrashk
|
(07:12) <evaxsoftware> C C 41 8.2/0.8/3.7
|
[15:13] yrashk
|
(07:12) <evaxsoftware> ezmq ezmq 47 9.5/2.6/2.3
|
[15:13] yrashk
|
(07:13) <evaxsoftware> pyzmq pyzmq 48 12.7/1.8/3.7
|
[15:13] yrashk
|
.win 26
|
[15:13] yrashk
|
oops
|
[15:13] yrashk
|
(07:14) <evaxsoftware> it's efficient (completes nearly as fast as the C version)
|
[15:13] yrashk
|
apparently better than pyzmq
|
[15:15] DarkGod
|
hello, I'm sorry but I am afradi I have a few, probably stupid questions. I read the manual and wanted to try the auto reconnect stuff, I took the simple REQ/REP C examples and copiled them, if I run the client and then the server it's all dandy, yet if I kill the server while the client is running and then restart the server it does not seem to reconnect
|
[15:15] DarkGod
|
it should no? or did I misunderstand ?
|
[15:16] sustrik
|
yrashk: nice
|
[15:17] yrashk
|
so I am not sure why thr is so bad
|
[15:17] sustrik
|
throughput is a silly metric
|
[15:17] sustrik
|
it exhibits large oscillation due to minor causes
|
[15:18] sustrik
|
shrug
|
[15:19] sustrik
|
DarkGod: it probably reconnects, but current request is lost
|
[15:19] sustrik
|
if you need reliability you should implement it on top of 0mq
|
[15:19] sustrik
|
iirc, the guide explains that
|
[15:20] DarkGod
|
ah, so I should put a timeout on the send on the REQ side so that it can fail
|
[15:20] sustrik
|
yes
|
[15:20] sustrik
|
then you can resend the request if needed
|
[15:20] DarkGod
|
I see
|
[15:21] DarkGod
|
thanks :)
|
[15:21] sustrik
|
np
|
[15:22] DarkGod
|
next silly question: the pub/sub sockets look neat, but in some cases I want to adress a specific client, I could use filtering but a ngrep of the ethernet interface shows me that all data is pushed to clients and that they then do the filtering, as I might have large data going through I really do notwant it transmited to all
|
[15:23] DarkGod
|
clietns when it is actaully only meant for one
|
[15:23] DarkGod
|
I must use pairs in this case? (setting up a pair per client ?)
|
[15:27] sustrik
|
there's a work in progress wrt filtering on the PUBside
|
[15:28] sustrik
|
(sub-forward branch)
|
[15:28] sustrik
|
you can help to finish that work
|
[15:29] DarkGod
|
I'm afraid I dont know zmq code quite well enough, but yeah that's what I'd need I imagine
|
[15:31] DarkGod
|
how advanced is this branch/what needs to be done ?
|
[15:32] sustrik
|
check the mailing list archives
|
[15:32] sustrik
|
there lot of related discussion there
|
[15:32] DarkGod
|
ok :)
|
[15:32] sustrik
|
so far the subscriptions are propagated up the distribution tree
|
[15:32] sustrik
|
what's missing is acutal filtering
|
[15:56] yrashk
|
https://gist.github.com/a3fd747d0bd7b3d61aba
|
[15:56] yrashk
|
sustrik: ^^ another bizzare crash :) now I don't even see any out of bound pointers :-D
|
[15:57] yrashk
|
does this tracce tell you anything worth checking out?
|
[16:12] sustrik
|
yrashk: what's the error?
|
[16:13] sustrik
|
SEGFAULT?
|
[16:13] yrashk
|
yes
|
[16:13] sustrik
|
segfault in glibc
|
[16:13] sustrik
|
nasty
|
[16:13] yrashk
|
I can show the whole socket printout
|
[16:13] yrashk
|
if you want
|
[16:13] sustrik
|
no need
|
[16:13] sustrik
|
is it reproducible?
|
[16:14] yrashk
|
not every time
|
[16:14] yrashk
|
and not on every computer
|
[16:14] yrashk
|
but it is
|
[16:14] sustrik
|
hm
|
[16:14] yrashk
|
just in case https://gist.github.com/dfc2e7eea5c89673ea74
|
[16:14] yrashk
|
:)
|
[16:14] sustrik
|
erlang, hm
|
[16:14] sustrik
|
:|
|
[16:14] yrashk
|
crash happens both on osx and linux
|
[16:15] yrashk
|
I don't think it has *anything* to do with erlang per se
|
[16:15] yrashk
|
this code is in a separate thread that Erlang has even a litle to no idea about
|
[16:16] yrashk
|
s/a little/little/
|
[16:19] sustrik
|
yrashk: any chance to reproduce the problem in C?
|
[16:20] yrashk
|
may be
|
[16:20] yrashk
|
but I have no idea how
|
[16:22] yrashk
|
(yet)
|
[16:22] sustrik
|
what's the OS btw?
|
[16:25] yrashk
|
both osx and linux
|
[16:25] yrashk
|
very similar crash
|
[16:38] yrashk
|
are there any circumstances under which context gets terminated implicitly, sustrik?
|
[16:38] sustrik
|
yrashk: no
|
[16:38] sustrik
|
you have to terminate by calling zmq_term()
|
[16:38] sustrik
|
it looks like a bug in 0mq anyway
|
[16:39] sustrik
|
so the goal now is to make a reproducible test case
|
[16:39] sustrik
|
and to fix it
|
[16:39] yrashk
|
do you think it is a 0mq bug?
|
[16:39] yrashk
|
huh
|
[16:39] yrashk
|
I was getting segfaults in recvfrom() before
|
[16:40] yrashk
|
last time it was due to an accidental rewrite over context
|
[16:40] sustrik
|
it's kind of strange
|
[16:40] sustrik
|
yes, it looks like memory overwrite
|
[16:40] sustrik
|
wither by 0mq or ezmq or erlang itself
|
[16:41] sustrik
|
that's why C use case would help
|
[16:41] sustrik
|
that would make it clear that the problem is in 0mq
|
[16:42] yrashk
|
yeah I know
|
[16:42] yrashk
|
it's barely reproducable here on my laptop
|
[16:42] sustrik
|
what's the use case?
|
[16:42] yrashk
|
much more frequent on aother guy's linux box
|
[16:42] yrashk
|
second thread recving on pull socket waiting for a command to recv another socket
|
[16:44] yrashk
|
sustrik: does context change over time or is it immutable?
|
[16:46] sustrik
|
you mean the internals of the context?
|
[16:46] yrashk
|
ya
|
[16:46] sustrik
|
yes, there's a lis of inproc endpoints for example
|
[16:46] sustrik
|
a table of open sockets
|
[16:46] sustrik
|
and similar
|
[16:46] yrashk
|
that might be the case
|
[16:46] yrashk
|
does any zmq operation might change the context?
|
[16:47] sustrik
|
creating a socket
|
[16:47] sustrik
|
closing a socket
|
[16:47] sustrik
|
binding to inproc endpoint
|
[16:48] yrashk
|
that's it?
|
[16:48] yrashk
|
we do all three
|
[16:48] yrashk
|
;)
|
[16:48] yrashk
|
and definitely bind to inproc
|
[16:49] yrashk
|
in that context
|
[16:49] yrashk
|
in fact it is only used for inproc
|
[16:50] sustrik
|
yes, but what's the deal?
|
[16:50] sustrik
|
why should changing the context be a problem?
|
[16:50] yrashk
|
"As soon as you write towards a shared state either through static variables or enif_priv_data you need to supply your own explicit synchronization. "
|
[16:51] yrashk
|
this is from NIF documentation
|
[16:51] yrashk
|
that context that we use there is a static variable
|
[16:52] sustrik
|
but it's only a pointer to context, right?
|
[16:52] sustrik
|
which never changes
|
[16:52] yrashk
|
yup
|
[16:52] sustrik
|
the context itself is threadsafe
|
[16:52] sustrik
|
so it should be ok imo
|
[16:53] yrashk
|
we're just trying to find any possible explanation for the crash :)
|
[16:55] yrashk
|
ok, bed time
|
[16:55] yrashk
|
I am exhausted and need to get up in 3 hours :]
|
[16:57] sustrik
|
good god
|
[16:57] sustrik
|
see you later then
|
[16:58] CIA-21
|
zeromq2: 03Martin Sustrik 07master * r12486fe 10/ (src/pgm_socket.hpp src/zmq.cpp):
|
[16:58] CIA-21
|
zeromq2: Fix MSVC and SunStudio builds with OpenPGM
|
[16:58] CIA-21
|
zeromq2: Signed-off-by: Martin Sustrik <sustrik@250bpm.com> - http://bit.ly/hxuwyD
|
[17:36] lt_schmidt_jr
|
sustrik: for inproc sub, are durable sockets trully durable?
|
[17:37] sustrik
|
what's durable?
|
[17:38] lt_schmidt_jr
|
http://zguide.zeromq.org/chapter:all#toc37
|
[17:38] lt_schmidt_jr
|
sockets with explicit identity
|
[17:48] sustrik
|
and what's "trully durable"? :)
|
[17:50] lt_schmidt_jr
|
sustrik: I mean if the connection is closed I would not loose messages given that there are no network buffers
|
[17:51] lt_schmidt_jr
|
sustrik: I wonder if with inproc the client sub buffer gets filled
|
[17:52] sustrik
|
when you close the SUB side
|
[17:52] sustrik
|
the buffer on PUB side is filled
|
[17:53] sustrik
|
(SUB buffer doesn't exist at the moment)
|
[17:53] lt_schmidt_jr
|
ah, so this may actually work for me
|
[17:54] sustrik
|
i don't know how well inproc works with identities
|
[17:54] sustrik
|
you have to test it yourself, i would say
|
[17:54] lt_schmidt_jr
|
ah
|
[17:54] sustrik
|
the problem with inproc is that there are still pieces missing
|
[17:54] sustrik
|
like reconnect, for example
|
[17:55] sustrik
|
not sure about identitie
|
[17:55] lt_schmidt_jr
|
did not realize that
|
[17:56] lt_schmidt_jr
|
the reason I am asking - is that I have an inproc forwarder connected to other servers symmetrically
|
[17:57] lt_schmidt_jr
|
and I have sub sockets that I keep on behalf of web clients , and if the web clietnts disconnect, I keep the sub socket for a bit in case they reconnect
|
[17:58] lt_schmidt_jr
|
I am thinking to use identity to have zmq do that for me
|
[17:58] lt_schmidt_jr
|
but sounds like that may not be the best approach currently
|
[18:55] bhuga
|
i'm getting an assertion failure while using zeromq from the ruby FFI: Assertion failed: inpipes [current_in].active (xrep.cpp:229). Is that indicative of something I'm doing wrong, or a bug?
|
[20:05] cremes
|
bhuga_: can you pastie your code that's causing that? it may be a bug in 0mq
|
[20:29] bhuga
|
it is kind of convoluted :/, pretty far from a minimal test case. it is also not deterministic, happening for sure, but not always after the same number of messages have passed. i have an strace running up to it, if that helps?
|
[21:02] cremes
|
bhuga_: is your code similar to the setup described in this ticket?
|
[21:02] cremes
|
https://github.com/zeromq/zeromq2/issuesearch?state=open&q=xrep#issue/164
|
[21:02] cremes
|
that raises the same assertion as your code
|
[21:03] cremes
|
i am wondering if the configuration of sockets & threads is similar in your case
|
[21:03] bhuga
|
mine is simpler
|
[21:03] bhuga
|
but still difficult to hand over for reproduction
|
[21:04] cremes
|
how many sockets are you using?
|
[21:04] cremes
|
i.e. how many xrep, how many xreq?
|
[21:04] bhuga
|
the ruby end is sitting on an xrep socket, one thread (ruby, after all) and the other end is common lisp via ffi on a req-rep
|
[21:04] cremes
|
ok
|
[21:04] cremes
|
are they both using the same build of libzmq?
|
[21:04] bhuga
|
yes, 2.1.0
|
[21:05] cremes
|
2.1.0 from the tarball or 2.1.0 from github master?
|
[21:05] bhuga
|
yowza, good question, i didnt write that down when i put it into puppet last month. probably the tarball.
|
[21:05] cremes
|
ok
|
[21:05] cremes
|
i would recommend trying master and seeing if the problem persists
|
[21:06] bhuga
|
okay, i can do that. will be back in 15 or 20
|
[21:06] cremes
|
there have been many bug fixes since the 2.1.0 tarball was cut 2 or 3 months ago
|
[21:06] cremes
|
k
|
[21:07] bhuga
|
(about to try it, but my buddy on the lisp end, debugging something else, just realized that putting 1 second between calls fixes it)
|
[21:10] cremes
|
bhuga_: yuck, that's a sucky fix :)
|
[21:10] bhuga
|
yeah, it kinda defeats the point of a sweet high-performance message queue :)
|
[21:10] cremes
|
definitely
|
[21:11] cremes
|
i have dozens of xrep/xreq sockets doing very high volume communications (from ruby) and i haven't hit that particular assertin
|
[21:11] cremes
|
(though i was tearing my hair out about another one that just got fixed a few days back)
|
[21:12] bhuga
|
this one has our hair pulled, i have to admit
|
[21:12] cremes
|
heh
|
[21:12] bhuga
|
its pretty much brought us to a standstill
|
[21:12] cremes
|
let me know how it goes after the update
|
[21:12] bhuga
|
figuring out how to make autoconf work on head still :)
|
[21:13] cremes
|
what os are you on?
|
[21:13] bhuga
|
ubuntu 10.10
|
[21:13] cremes
|
ko
|
[21:13] cremes
|
er, ok
|
[21:15] bhuga
|
autoreconf --install and autoconf are neither doing the trick :/
|
[21:16] bhuga
|
(then i found autogen.sh, not mentioned in the INSTALL :) )
|
[21:25] bhuga
|
(initial results with the new version are quite promising. thanks!)
|
[21:33] cremes
|
bhuga_: good to hear
|
[21:33] bhuga
|
we are still getting it but its more deterministic now
|
[21:34] bhuga
|
makes me think we're using it wrong now as opposed to weird bugs
|
[21:34] cremes
|
interesting
|
[21:34] cremes
|
a reproducible case would be a great addition to that ticket i referenced earlier
|
[21:35] bhuga
|
well you know how testing these things are
|
[21:36] bhuga
|
by definition they're made to work with different languages, daemons, etc
|
[21:36] bhuga
|
if we can make a minimal case we will
|
[21:36] cremes
|
cool
|
[21:36] bhuga
|
(though i suspect if i could i could just fix it)
|
[21:36] cremes
|
:)
|
[21:44] bhuga
|
more testing reveals some more non-deterministic failures :( but i guess if nobody's heard of it before we'll have to figure it out
|
[21:45] cremes
|
bhuga_: do you have any QUEUE devices in between your clients & servers?
|
[21:45] cremes
|
if i understand your setup a bit, perhaps i can suggest a place to start looking
|
[21:46] bhuga
|
its really simple
|
[21:46] bhuga
|
the two are on the same machine, talking over an IPC (it might actually be ITC, i'd need to check) socket
|
[21:46] bhuga
|
REQ-REP from lisp talking to XREP on ruby
|
[21:47] bhuga
|
no other devices, no shenanigans
|
[21:47] cremes
|
ok
|
[21:47] bhuga
|
the general feel of the upgrade is that it happens less, but it still happens
|
[21:47] cremes
|
what do you mean by REQ-REP from lisp? is a req socket talking to *both* the rep lisp and xrep ruby sockets?
|
[21:47] bhuga
|
it's doing a request-reply pattern, sending a request, blocking waiting for a reply
|
[21:48] bhuga
|
the ruby bit is data-driven, receiving that message and sending a response
|
[21:48] cremes
|
ok, so you have single req socket in the lisp program talking to a single xrep socket in ruby
|
[21:49] cremes
|
and only the ruby side crashes?
|
[21:49] bhuga
|
only the ruby side crashes, correct
|
[21:51] cremes
|
does it crash on recv or on send? and are these recv/sends blocking or non-blocking?
|
[21:52] bhuga
|
i dont actually know. it's hard to debug the assert. i have an strace of it
|
[21:52] bhuga
|
and ruby's not doing any syscalls on its own, i can say
|
[21:52] cremes
|
ok
|
[21:52] bhuga
|
(apologies for spam:)
|
[21:52] bhuga
|
send(5, "\220#.\r\n\0\0\0\377\377\377\377\1\0\0\0\fQ\200\r\0302\236\f", 24, 0) = 24
|
[21:52] bhuga
|
recv(15, "\230P\200\r\2\0\0\0h\201E\r\24\0\0\0t\201E\r\364\357\f\267", 24, MSG_DONTWAIT) = 24
|
[21:52] bhuga
|
recv(15, "\230P\200\r\2\0\0\0\30#\6\r\330\264\251\10t\201E\r\364\357\f\267", 24, MSG_DONTWAIT) = 24
|
[21:52] bhuga
|
recv(15, "\230P\200\r\t\0\0\0h\201E\r\364\277\20\267\30#\6\r\3\0\0\0", 24, MSG_DONTWAIT) = 24
|
[21:52] bhuga
|
send(5, "h\201E\r\n\0\0\0\377\377\377\377\0\0\0\0\fQ\200\r\0302\236\f", 24, 0) = 24
|
[21:52] bhuga
|
recv(15, "\230P\200\r\4\0\0\0p\323o\r\210\34-\r\21\0\0\0\10O\266\v", 24, MSG_DONTWAIT) = 24
|
[21:52] bhuga
|
recv(15, "\250\335S\r\7\0\0\0\300\223M\267\364\177M\267\300\223M\267\30#\6\r", 24, MSG_DONTWAIT) = 24
|
[21:52] bhuga
|
send(5, "X\1_\r\10\0\0\0\300\223M\267\364\177M\267\300\223M\267\230P\200\r", 24, 0) = 24
|
[21:52] bhuga
|
recv(15, "\240\345<\r\5\0\0\0\300\223M\267\364\177M\267\300\223M\267\30#\6\r", 24, MSG_DONTWAIT) = 24
|
[21:52] bhuga
|
recv(15, "\230P\200\r\v\0\0\0\324\201E\r\0\0\0\0\n\0\0\0\364\277\20\267", 24, MSG_DONTWAIT) = 24
|
[21:52] bhuga
|
recv(15, 0xbfe7b514, 24, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
|
[21:52] cremes
|
can you puts a debug statement around the ruby send/recv calls to see which one is active during the crash?
|
[21:52] bhuga
|
send(5, "p<\350\f\7\0\0\0\300\223M\267\364\177M\267\300\223M\267\364\277\20\267", 24, 0) = 24
|
[21:52] bhuga
|
recv(15, "\240\345<\r\10\0\0\0h\201E\r\0\0\0\0\n\0\0\0\364\277\20\267", 24, MSG_DONTWAIT) = 24
|
[21:53] cremes
|
bhuga_: use pastie.org or gist.github.com for anything over 2 lines, pls; that's really hard to read in the channel
|
[21:53] bhuga
|
yeah http://pastie.org/1580392
|
[21:53] bhuga
|
i got throttled anyway
|
[21:53] cremes
|
can you puts a debug statement around the ruby send/recv calls to see which one is active during the crash?
|
[21:54] bhuga
|
based on the assertion, it seems to be receiving
|
[21:54] bhuga
|
it's asserting that it shouldn't be doing what it's about to do unless the message has more parts
|
[21:54] cremes
|
ok
|
[21:55] cremes
|
how are you forming the messages on the lisp end for transmission? are they simple strings?
|
[21:55] cremes
|
encoded as json? protobufs, etc?
|
[21:55] bhuga
|
that i know less about :/
|
[21:55] cremes
|
hell, s-expressions?
|
[21:55] cremes
|
ok
|
[21:55] bhuga
|
im finding out
|
[21:56] bhuga
|
utf-8 encoded, null-terminated strings
|
[21:56] cremes
|
ok
|
[21:56] bhuga
|
the content of them does not appear to matter
|
[21:57] bhuga
|
in that a given message (which we send from the lisp end, and reproduce) can be sent any number of times
|
[21:57] bhuga
|
until the random crash
|
[21:58] bhuga
|
(after playing with it, it seems a 50-100ms delay is sufficient to prevent the issue from appearing in simple tests)
|
[21:58] cremes
|
can you try a req socket on the ruby end instead of xrep? it doesn't sound like you need the flexibility of an xrep anyway
|
[21:58] bhuga
|
hrm. i can try it, i think.
|
[21:58] bhuga
|
it will take me a minute
|
[21:58] cremes
|
ok
|
[21:59] cremes
|
with a rep socket you don't have to worry about creating the null message delimiter and such
|
[21:59] bhuga
|
(though the xrep is kind of what you need on ruby--we're currently only doing one task at a time but that's not the goal, and blocking is the devil)
|
[21:59] cremes
|
yeah, but how does xrep help in that case? it's blocking too
|
[22:00] cremes
|
you can use both rep and xrep with ZMQ::NOBLOCK; it isn't limited to just xrep
|
[22:00] bhuga
|
i guess i need to keep reading (i have taken this bug over from someone else)
|
[22:00] cremes
|
can you pastie the ruby method that does the recv and the one that does the send?
|
[22:01] cremes
|
i can probably help with the socket switch out... i wrote the ffi-rzmq gem so i'm pretty familiar with this stuff
|
[22:02] bhuga
|
okay, i can do that
|
[22:03] bhuga
|
http://pastie.org/1580429
|
[22:04] bhuga
|
there is a lot of silly metaprogramming going on
|
[22:04] bhuga
|
but the log_req in that method never happens
|
[22:04] bhuga
|
i can toss in a debug to make sure that it's getting to that socket.recv_string, which i suspect is the line
|
[22:05] cremes
|
right
|
[22:05] cremes
|
do you set any options on this socket?
|
[22:05] cremes
|
e.g. socket.setsockopt(option, value)
|
[22:05] bhuga
|
the metaprogramming thing again. i'll try and output the current options in my forthcoming debug addition
|
[22:06] cremes
|
ok
|
[22:06] cremes
|
btw, nothing you are doing there requires an xrep socket
|
[22:06] bhuga
|
i believe you :)
|
[22:07] cremes
|
but you do make things more complicated for yourself when you have to send the reply
|
[22:07] bhuga
|
and i will change it if need be
|
[22:07] bhuga
|
im afraid to change it just now since if we are doing the null-terminator thing that would perhaps no longer be correct?
|
[22:07] cremes
|
right
|
[22:07] cremes
|
you need to save the "return envelope" for the reply
|
[22:08] cremes
|
and it is separated from the body of your messages by an empty/null message
|
[22:08] bhuga
|
is there a nice way to get all of the socket options?
|
[22:08] cremes
|
search the code for setsockopt
|
[22:09] cremes
|
if you don't find it, then you haven't set any beyond the defaults
|
[22:09] bhuga
|
sockets[shard_id].setsockopt(ZMQ::LINGER, -1)
|
[22:09] bhuga
|
which i thing fixed a bug we had on exiting
|
[22:10] cremes
|
yeah, it prevents the socket from closing until all packets are flushed
|
[22:10] bhuga
|
sounds about right, i think it was giving all kinds of errors when we tried to exit
|
[22:10] cremes
|
just for kicks, can you comment that out and run your test? it crashes before it's done anyway, right?
|
[22:10] bhuga
|
yeah, i can do that
|
[22:13] bhuga
|
i dont think the linger is on the server
|
[22:13] bhuga
|
i think this is from a (being replaced by lisp) ruby client
|
[22:13] bhuga
|
could a server expecting a lingering client cause this behavior?
|
[22:14] cremes
|
no
|
[22:14] cremes
|
so, i think it is worthwhile for you to figure out how to swap the xrep for a rep socket
|
[22:15] cremes
|
xrep is a lower-level 0mq socket so it is trickier to work with
|
[22:15] cremes
|
from what i have seen, you don't need it; a rep socket and an xrep socket both 'block' the same way when sending/recving
|
[22:16] cremes
|
you only use xrep when you want to break the strict send/recv/send/recv REQ-REP pattern
|
[22:17] bhuga
|
well, it will need to, of sorts. eventually the 'client' in this case will send out one request and get results of initially-unknown length back from n different workers
|
[22:17] bhuga
|
but not *yet*, necessarily :)
|
[22:18] cremes
|
ok
|
[22:18] bhuga
|
but i'll investigate taht further and plug in some debug to make sure i know exactly what line of ruby is causing it
|
[22:18] cremes
|
but remember the YAGNI principle :)
|
[22:18] bhuga
|
then update the issue
|
[22:18] cremes
|
cool
|
[22:18] bhuga
|
if i pin it down
|
[22:18] bhuga
|
thanks for your time anyway
|
[22:18] cremes
|
sure
|
[22:19] bhuga
|
best gem author ever
|