Friday February 18, 2011

[Time] NameMessage
[00:14] gui81 anyone there
[00:15] gui81 is it possible to do this:
[00:16] gui81 I am trying to publish to tcp://*:5555 and subscribe to tcp://localhost:5555, but it doesn't seem to work
[00:17] gui81 I can get the Hello World from pub_thread and Hello World from sub_thread messages to print out, but it doesn't seem to publish from pub_main or subscribe from sub_main
[00:27] lt_schmidt_jr Regarding the named sockets - if I have a publisher with named subscribers that are connected to it. When a subscriber goes away and reconnects, is there any way to know if the subscriber has missed messages?
[00:27] gui81 nevermind, after reading through the Problem Solver, I realized that I didn't have setsockopt properly set the ZMQ_SUBSCRIBE option
[00:34] lt_schmidt_jr sustrik: this may be a question for you
[00:34] lt_schmidt_jr sustrik: how do i know (when using a named socket) if I have missed messages
[00:37] lt_schmidt_jr named socket = DURABLE socket
[00:39] cremes lt_schmidt_jr: a PUB socket is like a radio broadcast
[00:39] cremes if you aren't connected when some messages go out, you miss them and never know it
[00:40] lt_schmidt_jr even if I connect with a durable socket
[00:40] lt_schmidt_jr ?
[00:40] lt_schmidt_jr then disconnect
[00:40] lt_schmidt_jr and come back?
[00:40] cremes if you disconnect, your queue is flushed
[00:40] lt_schmidt_jr so durable sockets don't work with pub/sub
[00:40] cremes you'll have to detect this via a sequence number gap (or similar) in the broadcast messages
[00:41] cremes perhaps i don't understand what you mean by durable socket
[00:41] cremes never heard of one
[00:41] lt_schmidt_jr its in the guide
[00:42] cremes i'll look in a minute...
[00:42] cremes if you can provide a link, that would be best
[00:44] lt_schmidt_jr
[00:56] lt_schmidt_jr I guess I should just try it out
[01:01] cremes lt_schmidt_jr: ok, i hadn't heard this term before
[01:01] cremes but note that it *only* keeps the send buffer around
[01:01] cremes anything in transit or in the receiver's OS buffers or queue will be lost
[01:01] cremes so you probably still need to be able to identity gaps and recover from them
[01:02] lt_schmidt_jr how does that interact with INPROC
[01:02] lt_schmidt_jr loss is still possible?
[01:07] cremes lt_schmidt_jr: i don't know
[01:07] cremes that's a good question for the mailing list
[01:07] lt_schmidt_jr thanks
[01:08] lt_schmidt_jr I have subscriber sockets on the server on behalf of the websockets that may get disconnected
[01:08] lt_schmidt_jr I actually have them run for a bit to see if client websocket reconects
[01:10] lt_schmidt_jr but I wonder if I can terminate them instead
[01:10] lt_schmidt_jr if they DURABLE
[01:21] stodge Is there a pyzmq example of using PGM?
[01:36] jugg yrashk, fyi, 'ezmq' is already used for an Eiffel 0mq binding.
[01:37] yrashk jugg: oh well
[01:37] yrashk who uses eiffel anyway?
[01:38] jugg how about enifzmq ?
[01:38] yrashk horrible :D
[01:38] jugg :)
[01:38] yrashk btw apparently ezmq is faster than erlzmq
[01:39] yrashk up to 2x
[01:39] jugg what changed? I read some of the backlog... your original tests showed similar performance... how'd you up it?
[01:41] yrashk jugg: well I don't quite know yet
[01:41] yrashk it warmed up :D
[01:41] yrashk I rewrote it form scratch as a NIF as you can see
[01:41] yrashk from*
[01:42] jugg yes, I'm stuck on R13B04 unfortunately.
[01:43] yrashk and it can reach about 60-80K on my mac pro as opposed to 30K with erlzmq
[01:43] yrashk I would assume this is because there is not so much communication overhead now
[01:44] jugg that is ezmq/ezmq or erlang/ezmq ?
[01:44] jugg eh
[01:44] jugg zmq/ezmq
[01:45] jugg I mean I saw something about testing C zmq to erlang zmq.
[01:46] jugg those 60-80k is ezmq to ezmq?
[01:47] yrashk yes
[02:02] jugg yrashk, do you plan on making these bindings usable for receiving messages without having to call recv(noblock) on an interval? If so, how?
[02:03] yrashk jugg: you mean nb recv?
[02:05] jugg I mean you can't call zmq_poll, and there is no out of band recv support. So, you either have to call recv on an interval, or call it and block the vm.
[02:05] yrashk yes right now what you can see there is brecv()
[02:05] yrashk a blocking recv
[02:05] yrashk and here I have a non blocking recv() in my WIP
[02:06] yrashk right now I am solving this with having a thread per socket
[02:06] yrashk it works but only for some time, I have a bug somewhere
[02:07] yrashk that causes a crash
[02:07] yrashk that's why I am not committing it yet
[02:18] jugg Why is 'send' not called 'bsend'?
[02:18] jugg yrashk
[02:20] yrashk jugg: because it's not inherently a waiting operation
[02:20] yrashk you can use noblock flag for it
[02:21] jugg and you can't in brecv?
[02:22] yrashk you can, but the nature of recv is different
[02:22] yrashk it is inherently a waiting operation
[02:30] yrashk nb recv() is about 10 times slower than blocking
[02:45] jugg yrashk, is it possible to call an escript function using apply()? I tried using ?MODULE for the module name, but that didn't work... I'm trying to get those perf tests to run under R13B04.
[02:49] yrashk jugg -mode(compile) will help with the ?MODULE thing afair
[02:51] yrashk *if I am not mistaken*
[02:52] jugg that worked, thanks.
[02:54] yrashk yw
[03:58] jugg yrashk, is what I'll commit on csrl/erlzmq. Do you oppose being labeled the author?
[04:36] jugg yrashk, if the ezmq:recv concept was also used for ezmq:send and for a ezmq:poll implementation, those bindings would become quite usable. Nice (and fast) work.
[05:22] yrashk jugg: I don't oppose
[05:23] yrashk jugg: you think so? I might try doing recv trick with send, too
[05:23] yrashk but recv is quite slow
[05:23] yrashk brecv is much faster
[05:27] jugg how many cpu/cores do you have?
[05:32] yrashk 8 (16 virtual)
[05:33] yrashk dual quadcore
[05:35] jugg it'd be interesting to know what the scheduler is doing then... playing with thread affinity could be interesting. 10x slow down seems odd.
[05:37] yrashk ya I am not sure what's happening
[05:37] yrashk but recv is about 10 times slower than brecv
[05:37] yrashk there's certainly *some* overhead
[05:37] yrashk but I doubt there is that much of an overhead
[05:38] yrashk although who knows
[05:38] yrashk recv is pretty tricky
[05:39] yrashk push, pull, recv, end
[05:39] yrashk *send
[05:39] jugg wild guess, but receiver_thread's enif_send() is the likely candidate since it is out of band, but brecv, the erlang scheduler is already blocking for the result.
[05:39] yrashk it is the candidate
[05:39] yrashk but I have no idea how else can you communicate result back
[05:39] yrashk ;)
[05:40] yrashk also I suspect that ezmq is leaking memory
[05:40] yrashk but that should be fixable since its pretty small
[05:40] yrashk shouldn't be that hard to trace the leak
[05:41] yrashk I have no proof of leaking yet tho
[05:41] jugg maybe figure out a batch recv implementation... it'll loop until zmq_recv(noblock) returns eagain (or a max number is hit), then return the batch result. Certainly useful for when SNDMORE exists.
[05:42] yrashk yeah I haven't even begin to deal with sndmore
[05:42] yrashk was out of scope
[05:42] yrashk if you only you can tell erlang to designate a scheduler to only one process
[05:43] yrashk that is the one that talks to the NIF
[05:43] yrashk this way blocking would be irrelevant
[05:43] yrashk :D
[05:44] yrashk but I don't see any way to do so
[05:46] yrashk sigh
[05:49] jugg I think you'll need to refine your get/setsockopt implementation... zmq is rather exact about value lengths. Also, not all socket options are valid for 'get'.
[05:49] yrashk I am ready to merge pull reqs
[05:49] jugg :)
[05:49] yrashk I'd rather focus on performance issues right now
[05:49] yrashk that was the original intention
[05:49] yrashk even though it is faster than erlzmq
[05:49] yrashk there's still a lot of work
[05:50] yrashk so if you can actually help me with sockopt thing, I will greatly appreciate that
[05:50] yrashk although you're on R13 :-(
[05:51] jugg yah, I can't even compile it... The best I can do is refer you to csrl/erlzmq implementation.
[05:51] yrashk re sockopt and stuff: after all ezmq is very new, I spent like 3 hours with it or so
[05:51] yrashk well you can also install R14 somewhere... hehehe :)
[05:51] jugg :)
[05:53] yrashk I am really trying to understand how to get recv almost as fast as brecv
[05:56] yrashk I have a rought idea of msend
[05:56] yrashk but it's too vague
[05:57] jugg ezmq_nif_recv should directly call zmq_recv(noblock|flags) and only if it returns EAGAIN do you hand it off to the receive_thread.
[05:58] yrashk or even asend.. hey this is an idea!
[05:58] jugg well, hand it off if (flags & noblock) == 0.
[05:58] yrashk jugg: hey, not a bad idea
[05:58] jugg it is what I do in erlzmq...
[05:59] yrashk just tested brecv again on this mac pro just in case
[05:59] yrashk 66K
[05:59] yrashk twice as good as erlzmq
[05:59] yrashk still there, good
[06:00] yrashk now, recv
[06:01] yrashk 27K
[06:01] yrashk hm not 10 times, roughly 2 times
[06:01] yrashk well I didn't recv on mac pro before yet
[06:01] yrashk only on macbook air
[06:02] yrashk which is much slower and fewer cores
[06:12] yrashk jugg: horaay
[06:12] jugg the suspense...
[06:13] yrashk now recv is more or less on par with brecv
[06:13] yrashk deviation is minimal
[06:13] jugg good deal
[06:13] yrashk thanks a lot!
[06:15] yrashk committed
[06:16] yrashk jugg:
[06:20] yrashk now... how can we make it even faster?
[06:20] jugg nice. so, your old recv performed similarly to erlzmq?
[06:20] yrashk I haven't read erlzmq, honestly
[06:20] yrashk my recv was just unconditionally handing commands to the thread
[06:20] yrashk now it checks first
[06:21] jugg you said it was doing 27K msg/s? erlzmq was about the same, yes?
[06:21] yrashk commands off*
[06:21] yrashk no, current codebase is within 60-80K msg/s
[06:21] yrashk that was for brecv prior to that commit
[06:21] jugg yah, sorry, I mean the old version.
[06:21] yrashk and recv was roughly 30K
[06:21] yrashk and now recv is on par with brecv
[06:21] jugg ok
[06:22] yrashk because if there's a lot of messages it does pretty much what brecv oes
[06:22] yrashk does
[06:22] yrashk thanks to your suggestion
[06:22] jugg that makes me think that the message passing out of the driver (nif/port) is the difference... a port driver always has to do message passing.
[06:22] yrashk yes
[06:22] yrashk and this is why I wrote NIF in the first place
[06:22] yrashk I blamed port comm for delays
[06:23] yrashk and apparently it can attribute to about 2x diff in timming
[06:23] yrashk timing*
[06:23] jugg well, I guess that is a bit of the reason NIFs exist anyway...
[06:23] yrashk ya
[06:24] yrashk and they are going to get even better
[06:24] yrashk in regards of async stuff
[06:24] yrashk but since it's only in the future I am dealing with what he have now
[06:24] jugg sure
[06:24] yrashk and I am not yet happy with 60-80K msg/sec
[06:24] yrashk and also sending is superslow
[06:24] yrashk looks like batching doesn't kick in
[06:25] yrashk it takes roughly 13 seconds to send those msgs
[06:25] yrashk insane
[06:25] yrashk C version takes less than a half of a second
[06:26] yrashk if I send messages with C version, ezmq reads at about 120K msg/sec
[06:26] yrashk so yet another 2x gain can be achieved if sending was somehow optimized
[06:31] yrashk hmm send seem to crash sometimes on my laptop
[06:31] yrashk not on the pro
[06:31] yrashk I guess ezmq needs a little bit more of maturing
[06:52] jugg yrashk, any progress, looks like I got disconnected for a while.
[06:53] cremes pieterh, sustrik: that mailbox fix has eliminated a ton of weird shit; so glad we solved it!
[06:53] sustrik cremes: i hope it haven't introduced other weird shit
[06:54] cremes sustrik: not at all; it's probably the most important fix i have seen yet
[06:54] sustrik :)
[06:54] cremes minor in code changes, huge in effect
[06:54] sustrik btw, do you have an osx box?
[06:54] cremes double :)
[06:54] cremes yes, i do
[06:55] sustrik i would like to fix the buffer resizing algorithm so that it works on osx
[06:55] cremes sustrik: i can give you an account on it
[06:55] sustrik if i do the patch, can you possibly test it?
[06:55] cremes sure, either way
[06:55] sustrik that would be great
[06:55] sustrik ssh?
[06:55] cremes of course
[06:55] cremes i'm going to bed now (1am) so i'll take care of it tomorrow
[06:56] sustrik great. let me know by email then
[06:56] cremes will do
[06:56] sustrik thanks
[06:56] cremes good night
[06:56] sustrik good night
[06:56] yrashk jugg: no
[06:57] yrashk jugg: just thinking how to improve performance further
[06:57] yrashk jugg: also I think I need to use rwlocks to guard sockets
[06:58] jugg or you could only allow the owner pid to use the socket?
[06:59] jugg not sure if that calling pid is availble in NIFs?
[06:59] yrashk or that
[06:59] yrashk it is
[06:59] yrashk but I think rwlock is a better method
[06:59] yrashk probably
[06:59] yrashk also
[06:59] yrashk I am not sure but afair there is no guarantee that call will always be made within the same scheduler
[07:00] yrashk hence it will be another thread
[07:00] yrashk and hence it will crash 0mq
[07:00] yrashk that's why I am thinking rwlocks
[07:00] jugg I don't... a socket should only be used by a single process. Else you have troubles with multi-part messages, and socket states (eg REQ toggling of send/recv).
[07:00] yrashk and I think these occasional crashes might be attributed to this
[07:01] yrashk true
[07:01] jugg it is fine for sockets to be used in different threads (I assume you are using zmq 2.1.x)
[07:01] yrashk but not simultaneously
[07:01] yrashk I guess
[07:01] jugg correct
[07:01] yrashk hence rwlocks
[07:01] jugg or a single erlang process...
[07:01] yrashk again it doesn't guarantee the same scheduler (thread)
[07:02] jugg doesn't need to
[07:02] jugg a single process can't call into multiple nifs at once.
[07:02] yrashk true
[07:02] yrashk then we need to record pid on socket creation
[07:02] yrashk easy to do
[07:02] jugg yes
[07:03] yrashk and return badarg or something more appropriate if it is not from the same process
[07:04] yrashk this all doesn't explain rare segfaults
[07:05] yrashk basically *sometimes* 0mq fails on reading receiver_thread's pull socket
[07:05] yrashk which is something I can't yet explain
[07:06] yrashk A ØMQ context is thread safe and may be shared among as many application threads as necessary, without any additional locking required on the part of the caller. Each ØMQ socket belonging to a particular context may only be used by the thread that created it using zmq_socket().
[07:06] yrashk so not anymore?
[07:07] jugg not on 2.1.x
[07:07] yrashk too bad api doc is outdated
[07:07] yrashk :-(
[07:07] jugg in source, or on the web?
[07:07] yrashk web
[07:07] jugg the web is for 2.0.x still.
[07:07] yrashk any reason why?
[07:07] jugg it is the 'stable' version.
[07:08] yrashk ah
[07:08] yrashk so will be updated for 2.2?
[07:08] jugg I'm not sure how they're handling that...
[07:09] jugg I do wish they'd provide namespacing at tho for the different versions...
[07:09] jugg but I believe that's been discussed and rejected.
[07:09] yrashk that would be awesome
[09:03] mikko pieterh_: there?
[09:03] pieterh mikko: yup
[09:03] mikko pieterh: cJSON
[09:04] pieterh it's giving build errors?
[09:04] mikko is there a reason why the .c file is included in zfl_config_json?
[09:04] pieterh ah, that's just to simplify things
[09:04] mikko well, it's not included in make dist atm
[09:04] mikko fixing that
[09:04] mikko make dist, take the tar.gz and try to build
[09:04] pieterh it's used only by that one zfl class
[09:04] mikko yeah
[09:04] pieterh true, we never tested a tarball yet
[09:17] mikko pieterh: larger config refactoring on zfl, want it on separate branch first?
[09:17] mikko i went through most of it last night and all aspects of build work for me
[09:18] pieterh I don't think so mikko, let's do everything on master
[09:18] mikko ok, i can always revert if it breaks things in a bad way
[09:18] pieterh the sooner it breaks the more time we have to fix it :-)
[09:18] pieterh oh, we never revert :-)
[09:19] pieterh i'm serious, the only process I know is to publish & improve
[09:19] mikko 3 files changed, 67 insertions(+), 149 deletions(-)
[09:19] mikko heh
[09:20] pieterh commit it, I'll quickly test on this box I'm on
[09:20] pieterh I have 6 minutes then need to leave :-)
[09:20] pieterh I mean, push the commit...
[09:20] mikko pushed
[09:21] mikko let me know if something breaks. i'll test solaris today as well
[09:22] pieterh Tests passed OK
[09:22] pieterh PASS: zfl_selftest
[09:22] pieterh =============
[09:22] pieterh 1 test passed
[09:22] pieterh =============
[09:22] pieterh that's on Ubuntu
[09:22] pieterh nice stuff!
[09:22] mikko good stuff
[09:22] mikko now i can add the gcov script and make the daily builds do this properly
[09:36] sustrik mikko: morning
[09:36] sustrik win7/msvc build seems to be failing
[09:36] sustrik when in build on my XP/msvc there's no problem
[09:37] sustrik is the source up to date there?
[09:37] sustrik maybe it has something to do with pgm?
[09:37] sustrik hm
[09:39] sustrik aha, that's probably it
[09:44] sustrik here it is: types.h:44: # define bool BOOL
[09:57] mikko yes
[09:57] mikko the pgm folder keeps disappearing for some reason
[09:57] mikko i think jenkins cleans the workspace at some point
[09:58] sustrik it's a different problem
[09:58] sustrik i've already sent an email about it to openpgm mailing list
[10:00] mikko ah good
[10:00] mikko so maybe the file permissions are working after all
[10:00] mikko if you got time at some point can you test this branch
[10:00] mikko ./configure --with-pgm
[10:01] mikko you should see during configure that it invokes openpgm configure
[10:01] mikko and everything works like magic
[10:02] sustrik i have ubuntu here
[10:02] sustrik would testing on that help you?
[10:05] mikko yeah, if you can
[10:05] mikko i've been only running it on my local vm
[10:05] mikko where everything seems to work ok
[10:06] sustrik ok, wait a sec
[10:06] mikko sun studio complains about same thing as msvc
[10:07] sustrik yes, same problem
[10:13] sustrik mikko: tested
[10:13] sustrik builds ok
[10:14] mikko good
[10:14] mikko so it's ready(ish)
[10:14] sustrik nice
[10:14] mikko todo list emptying faster than i hoped
[10:33] yrashk I am confused now
[10:33] yrashk in PUB/SUBs who should connect and who should bind?
[10:33] mikko yrashk: doesnt matter
[10:34] yrashk looks like it works either way
[10:34] mikko yes
[10:34] yrashk mikko: that's what I thought, thanks
[10:34] yrashk I just received a patch "fixing" this in my tests
[10:35] yrashk and it got me puzzled because I never even thought about it
[11:53] mikko heyo
[11:53] pieterh heyo, mikko
[11:54] mikko q: would it be useful to have sockopt to prevent durable subscribers on server side?
[11:54] pieterh IMO yes
[11:54] mikko currently the server side is pretty vulnerable to DoS
[11:54] mikko connect tons of clients with identities, disconnect them and it should run out of memory
[11:54] pieterh Even more so, have sockopt that *allows* this
[11:54] mikko and another thing is removing subscriptions
[11:54] pieterh Or else limit nbr of durable peers
[11:55] mikko or controlling their lifetime
[11:55] pieterh indeed
[11:55] mikko as in "if peer has missed N messages consider it dead"
[11:55] pieterh well, that's the whole point of durable sockets
[11:55] mikko or "if peer hasn't been back in 2 hours consider it dead"
[11:55] pieterh peer can go away for a long time
[11:55] mikko i mean controllable time for durability
[11:56] pieterh from a paranoid POV, I'd like
[11:56] pieterh - default HWM for durable sockets
[11:56] mikko that way you don't have to worry about restarting server if you remove a durable subscriber
[11:56] pieterh - default limit on number of those sockets
[11:56] pieterh - default limit on total memory used by durable socket queues
[11:57] pieterh good topic for discussion on list IMO
[11:57] mikko - ability to remove durable subscription explicitly
[11:57] mikko like remove subscription of "company ABC" while keeping others
[11:57] pieterh - timeout on durable sockets
[11:57] mikko yes
[11:58] mikko one of the important features for 2.1.0 i could see is not failing on invalid connection uri
[11:59] pieterh we should list the outstanding 'bugs' in 2.1.0
[12:00] mikko is atlassian stack overkill for us?
[12:00] mikko they give licenses to open source as far as i know
[12:00] mikko jira/confluence/etc
[12:00] pieterh oh, I'd rather not
[12:00] pieterh we used to use Jira for all issue tracking
[12:00] pieterh it is a great, fantastic product
[12:01] pieterh you just have to pay someone to reboot the *@%@$E# server once a week
[12:01] pieterh i would 10x rather use github's simple but maintenance free issue tracking
[12:02] pieterh anyhow, I was thinking of a wiki page, like the 2.0 roadmap
[12:02] pieterh *3.0
[12:02] mikko what i would like is somehow automatically assign issues to roadmap milestones
[12:02] mikko i wonder if zfl tests should be broken into separate files
[12:03] mikko currently it shows "1 test succeeded"
[12:03] pieterh regarding issue tracking, discuss on list, it's too contentious
[12:04] pieterh remember our process is not driven by issues but by patches
[12:04] pieterh for zfl tests, multiple executables would work for me, sure...
[12:04] pieterh it's more work to maintain though
[12:05] pieterh when we have four files for each class I start to get tempted by code generation
[12:05] pieterh and that gets ugly, you don't want to see that :-)
[12:06] yrashk hey pieterh_
[12:06] pieterh hi yrashk
[12:07] yrashk mikko: I use open source license for bamboo, nice stuff
[12:11] mikko i find jenkins a lot better than bamboo
[12:11] mikko especially for distributed builds
[12:17] yrashk hmm may be I should add select() based active mode for ezmq.. not sure if this help with the performance issue, though
[12:17] yrashk this will*
[12:22] yrashk or, rather poll() one
[12:44] sustrik mikko, pieterh_: re durable subscribers: +1
[12:44] sustrik i would even remove the identity option altogether
[13:02] tormaroe Just found 0MQ, and I really exited. Built from source on Windows without problems. Now want to install Ruby gem, but having problems.
[13:02] tormaroe Trying gem install zmq -- --with-zmq-dir=c:\zeromq
[13:02] tormaroe but still getting ERROR: Failed to build gem native extension. extconf.rb:25: Couldn't find zmq library. try setting --with-zmq-dir=<path> to tell me where it is. (RuntimeError)
[13:02] tormaroe please help :)
[13:03] tormaroe I just copied the build output to my c:\zeromq. No other "installation" required, right?
[13:05] sustrik no
[13:06] sustrik you need just the library an the header file
[13:07] tormaroe I have no clue about c++, so what's the header file?
[13:07] tormaroe got dll, exp, ilk, lib and pdb
[13:09] yrashk it looks like I forgot a lot about poll()-related matters
[13:10] yrashk my recv blocks after receving a ZMQ_POLLIN revent
[13:10] yrashk eh
[13:12] yrashk don't quite understand how this could happen, but likely my 5am bug :D
[13:16] CIA-21 zeromq2: 03Martin Sustrik 07master * r17e2ca7 10/ (5 files):
[13:16] CIA-21 zeromq2: Logging of duplicit identities added
[13:16] CIA-21 zeromq2: Signed-off-by: Martin Sustrik <> -
[13:16] sustrik tormaroe: zmq.h
[13:17] sustrik yrashk: that should not happen
[13:21] tormaroe Added zmq.h from the include folder and re-ran gem command, but getting same error :(
[13:21] sustrik try asking on the mailing list, i am not a ruby expert
[13:21] tormaroe ok, thanks anyway
[13:24] yrashk sustrik: yeah I figured that out already and fixed the bug
[13:34] yrashk doing poll & recv at the same time is probably an insanely bad idea, after all :D
[15:05] yrashk sustrik: pieterh is that ok that send() with ZMQ_NOBLOCK takes roughly 10 usec?
[15:06] yrashk I am profiling my C code
[15:06] sustrik raw zmq_send()?
[15:06] yrashk ya
[15:06] sustrik that's too much
[15:06] sustrik should be below 1 us
[15:07] yrashk
[15:08] yrashk appended that gist with some typical tv.tv_usec-tv1.tv_usec printout
[15:08] yrashk it looks more like 7 us
[15:08] yrashk but still
[15:08] sustrik have you tried to measure zmq_send() itself?
[15:08] sustrik there's some other work being done there
[15:08] yrashk well that if condition is false
[15:09] sustrik ah
[15:09] sustrik ok
[15:09] yrashk but I can try
[15:09] sustrik what OS are you on?
[15:09] yrashk osx
[15:09] sustrik how long does gettimeofday take on osx?
[15:09] yrashk still 5+ us at least
[15:09] yrashk with no if
[15:09] yrashk no idea
[15:10] sustrik it tends to be slow
[15:10] sustrik it's kind of better with linux nowadays
[15:10] sustrik they've got it to something like 1us
[15:10] sustrik but before it was much worse
[15:10] sustrik no idea about osx though
[15:11] sustrik easiest way to measure is to make 1M zmq_send()s
[15:11] sustrik measure the whole thing
[15:11] sustrik and divide it by 1000000
[15:12] yrashk apparently _lat tests are good on ezmq
[15:12] yrashk well that's what I do
[15:13] yrashk (07:12) <evaxsoftware> _lat results:
[15:13] yrashk (07:12) <evaxsoftware> local_lat remote_lat mean latency real/usr/sys
[15:13] yrashk (07:12) <evaxsoftware> C C 41 8.2/0.8/3.7
[15:13] yrashk (07:12) <evaxsoftware> ezmq ezmq 47 9.5/2.6/2.3
[15:13] yrashk (07:13) <evaxsoftware> pyzmq pyzmq 48 12.7/1.8/3.7
[15:13] yrashk .win 26
[15:13] yrashk oops
[15:13] yrashk (07:14) <evaxsoftware> it's efficient (completes nearly as fast as the C version)
[15:13] yrashk apparently better than pyzmq
[15:15] DarkGod hello, I'm sorry but I am afradi I have a few, probably stupid questions. I read the manual and wanted to try the auto reconnect stuff, I took the simple REQ/REP C examples and copiled them, if I run the client and then the server it's all dandy, yet if I kill the server while the client is running and then restart the server it does not seem to reconnect
[15:15] DarkGod it should no? or did I misunderstand ?
[15:16] sustrik yrashk: nice
[15:17] yrashk so I am not sure why thr is so bad
[15:17] sustrik throughput is a silly metric
[15:17] sustrik it exhibits large oscillation due to minor causes
[15:18] sustrik shrug
[15:19] sustrik DarkGod: it probably reconnects, but current request is lost
[15:19] sustrik if you need reliability you should implement it on top of 0mq
[15:19] sustrik iirc, the guide explains that
[15:20] DarkGod ah, so I should put a timeout on the send on the REQ side so that it can fail
[15:20] sustrik yes
[15:20] sustrik then you can resend the request if needed
[15:20] DarkGod I see
[15:21] DarkGod thanks :)
[15:21] sustrik np
[15:22] DarkGod next silly question: the pub/sub sockets look neat, but in some cases I want to adress a specific client, I could use filtering but a ngrep of the ethernet interface shows me that all data is pushed to clients and that they then do the filtering, as I might have large data going through I really do notwant it transmited to all
[15:23] DarkGod clietns when it is actaully only meant for one
[15:23] DarkGod I must use pairs in this case? (setting up a pair per client ?)
[15:27] sustrik there's a work in progress wrt filtering on the PUBside
[15:28] sustrik (sub-forward branch)
[15:28] sustrik you can help to finish that work
[15:29] DarkGod I'm afraid I dont know zmq code quite well enough, but yeah that's what I'd need I imagine
[15:31] DarkGod how advanced is this branch/what needs to be done ?
[15:32] sustrik check the mailing list archives
[15:32] sustrik there lot of related discussion there
[15:32] DarkGod ok :)
[15:32] sustrik so far the subscriptions are propagated up the distribution tree
[15:32] sustrik what's missing is acutal filtering
[15:56] yrashk
[15:56] yrashk sustrik: ^^ another bizzare crash :) now I don't even see any out of bound pointers :-D
[15:57] yrashk does this tracce tell you anything worth checking out?
[16:12] sustrik yrashk: what's the error?
[16:13] sustrik SEGFAULT?
[16:13] yrashk yes
[16:13] sustrik segfault in glibc
[16:13] sustrik nasty
[16:13] yrashk I can show the whole socket printout
[16:13] yrashk if you want
[16:13] sustrik no need
[16:13] sustrik is it reproducible?
[16:14] yrashk not every time
[16:14] yrashk and not on every computer
[16:14] yrashk but it is
[16:14] sustrik hm
[16:14] yrashk just in case
[16:14] yrashk :)
[16:14] sustrik erlang, hm
[16:14] sustrik :|
[16:14] yrashk crash happens both on osx and linux
[16:15] yrashk I don't think it has *anything* to do with erlang per se
[16:15] yrashk this code is in a separate thread that Erlang has even a litle to no idea about
[16:16] yrashk s/a little/little/
[16:19] sustrik yrashk: any chance to reproduce the problem in C?
[16:20] yrashk may be
[16:20] yrashk but I have no idea how
[16:22] yrashk (yet)
[16:22] sustrik what's the OS btw?
[16:25] yrashk both osx and linux
[16:25] yrashk very similar crash
[16:38] yrashk are there any circumstances under which context gets terminated implicitly, sustrik?
[16:38] sustrik yrashk: no
[16:38] sustrik you have to terminate by calling zmq_term()
[16:38] sustrik it looks like a bug in 0mq anyway
[16:39] sustrik so the goal now is to make a reproducible test case
[16:39] sustrik and to fix it
[16:39] yrashk do you think it is a 0mq bug?
[16:39] yrashk huh
[16:39] yrashk I was getting segfaults in recvfrom() before
[16:40] yrashk last time it was due to an accidental rewrite over context
[16:40] sustrik it's kind of strange
[16:40] sustrik yes, it looks like memory overwrite
[16:40] sustrik wither by 0mq or ezmq or erlang itself
[16:41] sustrik that's why C use case would help
[16:41] sustrik that would make it clear that the problem is in 0mq
[16:42] yrashk yeah I know
[16:42] yrashk it's barely reproducable here on my laptop
[16:42] sustrik what's the use case?
[16:42] yrashk much more frequent on aother guy's linux box
[16:42] yrashk second thread recving on pull socket waiting for a command to recv another socket
[16:44] yrashk sustrik: does context change over time or is it immutable?
[16:46] sustrik you mean the internals of the context?
[16:46] yrashk ya
[16:46] sustrik yes, there's a lis of inproc endpoints for example
[16:46] sustrik a table of open sockets
[16:46] sustrik and similar
[16:46] yrashk that might be the case
[16:46] yrashk does any zmq operation might change the context?
[16:47] sustrik creating a socket
[16:47] sustrik closing a socket
[16:47] sustrik binding to inproc endpoint
[16:48] yrashk that's it?
[16:48] yrashk we do all three
[16:48] yrashk ;)
[16:48] yrashk and definitely bind to inproc
[16:49] yrashk in that context
[16:49] yrashk in fact it is only used for inproc
[16:50] sustrik yes, but what's the deal?
[16:50] sustrik why should changing the context be a problem?
[16:50] yrashk "As soon as you write towards a shared state either through static variables or enif_priv_data you need to supply your own explicit synchronization. "
[16:51] yrashk this is from NIF documentation
[16:51] yrashk that context that we use there is a static variable
[16:52] sustrik but it's only a pointer to context, right?
[16:52] sustrik which never changes
[16:52] yrashk yup
[16:52] sustrik the context itself is threadsafe
[16:52] sustrik so it should be ok imo
[16:53] yrashk we're just trying to find any possible explanation for the crash :)
[16:55] yrashk ok, bed time
[16:55] yrashk I am exhausted and need to get up in 3 hours :]
[16:57] sustrik good god
[16:57] sustrik see you later then
[16:58] CIA-21 zeromq2: 03Martin Sustrik 07master * r12486fe 10/ (src/pgm_socket.hpp src/zmq.cpp):
[16:58] CIA-21 zeromq2: Fix MSVC and SunStudio builds with OpenPGM
[16:58] CIA-21 zeromq2: Signed-off-by: Martin Sustrik <> -
[17:36] lt_schmidt_jr sustrik: for inproc sub, are durable sockets trully durable?
[17:37] sustrik what's durable?
[17:38] lt_schmidt_jr
[17:38] lt_schmidt_jr sockets with explicit identity
[17:48] sustrik and what's "trully durable"? :)
[17:50] lt_schmidt_jr sustrik: I mean if the connection is closed I would not loose messages given that there are no network buffers
[17:51] lt_schmidt_jr sustrik: I wonder if with inproc the client sub buffer gets filled
[17:52] sustrik when you close the SUB side
[17:52] sustrik the buffer on PUB side is filled
[17:53] sustrik (SUB buffer doesn't exist at the moment)
[17:53] lt_schmidt_jr ah, so this may actually work for me
[17:54] sustrik i don't know how well inproc works with identities
[17:54] sustrik you have to test it yourself, i would say
[17:54] lt_schmidt_jr ah
[17:54] sustrik the problem with inproc is that there are still pieces missing
[17:54] sustrik like reconnect, for example
[17:55] sustrik not sure about identitie
[17:55] lt_schmidt_jr did not realize that
[17:56] lt_schmidt_jr the reason I am asking - is that I have an inproc forwarder connected to other servers symmetrically
[17:57] lt_schmidt_jr and I have sub sockets that I keep on behalf of web clients , and if the web clietnts disconnect, I keep the sub socket for a bit in case they reconnect
[17:58] lt_schmidt_jr I am thinking to use identity to have zmq do that for me
[17:58] lt_schmidt_jr but sounds like that may not be the best approach currently
[18:55] bhuga i'm getting an assertion failure while using zeromq from the ruby FFI: Assertion failed: inpipes [current_in].active (xrep.cpp:229). Is that indicative of something I'm doing wrong, or a bug?
[20:05] cremes bhuga_: can you pastie your code that's causing that? it may be a bug in 0mq
[20:29] bhuga it is kind of convoluted :/, pretty far from a minimal test case. it is also not deterministic, happening for sure, but not always after the same number of messages have passed. i have an strace running up to it, if that helps?
[21:02] cremes bhuga_: is your code similar to the setup described in this ticket?
[21:02] cremes
[21:02] cremes that raises the same assertion as your code
[21:03] cremes i am wondering if the configuration of sockets & threads is similar in your case
[21:03] bhuga mine is simpler
[21:03] bhuga but still difficult to hand over for reproduction
[21:04] cremes how many sockets are you using?
[21:04] cremes i.e. how many xrep, how many xreq?
[21:04] bhuga the ruby end is sitting on an xrep socket, one thread (ruby, after all) and the other end is common lisp via ffi on a req-rep
[21:04] cremes ok
[21:04] cremes are they both using the same build of libzmq?
[21:04] bhuga yes, 2.1.0
[21:05] cremes 2.1.0 from the tarball or 2.1.0 from github master?
[21:05] bhuga yowza, good question, i didnt write that down when i put it into puppet last month. probably the tarball.
[21:05] cremes ok
[21:05] cremes i would recommend trying master and seeing if the problem persists
[21:06] bhuga okay, i can do that. will be back in 15 or 20
[21:06] cremes there have been many bug fixes since the 2.1.0 tarball was cut 2 or 3 months ago
[21:06] cremes k
[21:07] bhuga (about to try it, but my buddy on the lisp end, debugging something else, just realized that putting 1 second between calls fixes it)
[21:10] cremes bhuga_: yuck, that's a sucky fix :)
[21:10] bhuga yeah, it kinda defeats the point of a sweet high-performance message queue :)
[21:10] cremes definitely
[21:11] cremes i have dozens of xrep/xreq sockets doing very high volume communications (from ruby) and i haven't hit that particular assertin
[21:11] cremes (though i was tearing my hair out about another one that just got fixed a few days back)
[21:12] bhuga this one has our hair pulled, i have to admit
[21:12] cremes heh
[21:12] bhuga its pretty much brought us to a standstill
[21:12] cremes let me know how it goes after the update
[21:12] bhuga figuring out how to make autoconf work on head still :)
[21:13] cremes what os are you on?
[21:13] bhuga ubuntu 10.10
[21:13] cremes ko
[21:13] cremes er, ok
[21:15] bhuga autoreconf --install and autoconf are neither doing the trick :/
[21:16] bhuga (then i found, not mentioned in the INSTALL :) )
[21:25] bhuga (initial results with the new version are quite promising. thanks!)
[21:33] cremes bhuga_: good to hear
[21:33] bhuga we are still getting it but its more deterministic now
[21:34] bhuga makes me think we're using it wrong now as opposed to weird bugs
[21:34] cremes interesting
[21:34] cremes a reproducible case would be a great addition to that ticket i referenced earlier
[21:35] bhuga well you know how testing these things are
[21:36] bhuga by definition they're made to work with different languages, daemons, etc
[21:36] bhuga if we can make a minimal case we will
[21:36] cremes cool
[21:36] bhuga (though i suspect if i could i could just fix it)
[21:36] cremes :)
[21:44] bhuga more testing reveals some more non-deterministic failures :( but i guess if nobody's heard of it before we'll have to figure it out
[21:45] cremes bhuga_: do you have any QUEUE devices in between your clients & servers?
[21:45] cremes if i understand your setup a bit, perhaps i can suggest a place to start looking
[21:46] bhuga its really simple
[21:46] bhuga the two are on the same machine, talking over an IPC (it might actually be ITC, i'd need to check) socket
[21:46] bhuga REQ-REP from lisp talking to XREP on ruby
[21:47] bhuga no other devices, no shenanigans
[21:47] cremes ok
[21:47] bhuga the general feel of the upgrade is that it happens less, but it still happens
[21:47] cremes what do you mean by REQ-REP from lisp? is a req socket talking to *both* the rep lisp and xrep ruby sockets?
[21:47] bhuga it's doing a request-reply pattern, sending a request, blocking waiting for a reply
[21:48] bhuga the ruby bit is data-driven, receiving that message and sending a response
[21:48] cremes ok, so you have single req socket in the lisp program talking to a single xrep socket in ruby
[21:49] cremes and only the ruby side crashes?
[21:49] bhuga only the ruby side crashes, correct
[21:51] cremes does it crash on recv or on send? and are these recv/sends blocking or non-blocking?
[21:52] bhuga i dont actually know. it's hard to debug the assert. i have an strace of it
[21:52] bhuga and ruby's not doing any syscalls on its own, i can say
[21:52] cremes ok
[21:52] bhuga (apologies for spam:)
[21:52] bhuga send(5, "\220#.\r\n\0\0\0\377\377\377\377\1\0\0\0\fQ\200\r\0302\236\f", 24, 0) = 24
[21:52] bhuga recv(15, "\230P\200\r\2\0\0\0h\201E\r\24\0\0\0t\201E\r\364\357\f\267", 24, MSG_DONTWAIT) = 24
[21:52] bhuga recv(15, "\230P\200\r\2\0\0\0\30#\6\r\330\264\251\10t\201E\r\364\357\f\267", 24, MSG_DONTWAIT) = 24
[21:52] bhuga recv(15, "\230P\200\r\t\0\0\0h\201E\r\364\277\20\267\30#\6\r\3\0\0\0", 24, MSG_DONTWAIT) = 24
[21:52] bhuga send(5, "h\201E\r\n\0\0\0\377\377\377\377\0\0\0\0\fQ\200\r\0302\236\f", 24, 0) = 24
[21:52] bhuga recv(15, "\230P\200\r\4\0\0\0p\323o\r\210\34-\r\21\0\0\0\10O\266\v", 24, MSG_DONTWAIT) = 24
[21:52] bhuga recv(15, "\250\335S\r\7\0\0\0\300\223M\267\364\177M\267\300\223M\267\30#\6\r", 24, MSG_DONTWAIT) = 24
[21:52] bhuga send(5, "X\1_\r\10\0\0\0\300\223M\267\364\177M\267\300\223M\267\230P\200\r", 24, 0) = 24
[21:52] bhuga recv(15, "\240\345<\r\5\0\0\0\300\223M\267\364\177M\267\300\223M\267\30#\6\r", 24, MSG_DONTWAIT) = 24
[21:52] bhuga recv(15, "\230P\200\r\v\0\0\0\324\201E\r\0\0\0\0\n\0\0\0\364\277\20\267", 24, MSG_DONTWAIT) = 24
[21:52] bhuga recv(15, 0xbfe7b514, 24, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
[21:52] cremes can you puts a debug statement around the ruby send/recv calls to see which one is active during the crash?
[21:52] bhuga send(5, "p<\350\f\7\0\0\0\300\223M\267\364\177M\267\300\223M\267\364\277\20\267", 24, 0) = 24
[21:52] bhuga recv(15, "\240\345<\r\10\0\0\0h\201E\r\0\0\0\0\n\0\0\0\364\277\20\267", 24, MSG_DONTWAIT) = 24
[21:53] cremes bhuga_: use or for anything over 2 lines, pls; that's really hard to read in the channel
[21:53] bhuga yeah
[21:53] bhuga i got throttled anyway
[21:53] cremes can you puts a debug statement around the ruby send/recv calls to see which one is active during the crash?
[21:54] bhuga based on the assertion, it seems to be receiving
[21:54] bhuga it's asserting that it shouldn't be doing what it's about to do unless the message has more parts
[21:54] cremes ok
[21:55] cremes how are you forming the messages on the lisp end for transmission? are they simple strings?
[21:55] cremes encoded as json? protobufs, etc?
[21:55] bhuga that i know less about :/
[21:55] cremes hell, s-expressions?
[21:55] cremes ok
[21:55] bhuga im finding out
[21:56] bhuga utf-8 encoded, null-terminated strings
[21:56] cremes ok
[21:56] bhuga the content of them does not appear to matter
[21:57] bhuga in that a given message (which we send from the lisp end, and reproduce) can be sent any number of times
[21:57] bhuga until the random crash
[21:58] bhuga (after playing with it, it seems a 50-100ms delay is sufficient to prevent the issue from appearing in simple tests)
[21:58] cremes can you try a req socket on the ruby end instead of xrep? it doesn't sound like you need the flexibility of an xrep anyway
[21:58] bhuga hrm. i can try it, i think.
[21:58] bhuga it will take me a minute
[21:58] cremes ok
[21:59] cremes with a rep socket you don't have to worry about creating the null message delimiter and such
[21:59] bhuga (though the xrep is kind of what you need on ruby--we're currently only doing one task at a time but that's not the goal, and blocking is the devil)
[21:59] cremes yeah, but how does xrep help in that case? it's blocking too
[22:00] cremes you can use both rep and xrep with ZMQ::NOBLOCK; it isn't limited to just xrep
[22:00] bhuga i guess i need to keep reading (i have taken this bug over from someone else)
[22:00] cremes can you pastie the ruby method that does the recv and the one that does the send?
[22:01] cremes i can probably help with the socket switch out... i wrote the ffi-rzmq gem so i'm pretty familiar with this stuff
[22:02] bhuga okay, i can do that
[22:03] bhuga
[22:04] bhuga there is a lot of silly metaprogramming going on
[22:04] bhuga but the log_req in that method never happens
[22:04] bhuga i can toss in a debug to make sure that it's getting to that socket.recv_string, which i suspect is the line
[22:05] cremes right
[22:05] cremes do you set any options on this socket?
[22:05] cremes e.g. socket.setsockopt(option, value)
[22:05] bhuga the metaprogramming thing again. i'll try and output the current options in my forthcoming debug addition
[22:06] cremes ok
[22:06] cremes btw, nothing you are doing there requires an xrep socket
[22:06] bhuga i believe you :)
[22:07] cremes but you do make things more complicated for yourself when you have to send the reply
[22:07] bhuga and i will change it if need be
[22:07] bhuga im afraid to change it just now since if we are doing the null-terminator thing that would perhaps no longer be correct?
[22:07] cremes right
[22:07] cremes you need to save the "return envelope" for the reply
[22:08] cremes and it is separated from the body of your messages by an empty/null message
[22:08] bhuga is there a nice way to get all of the socket options?
[22:08] cremes search the code for setsockopt
[22:09] cremes if you don't find it, then you haven't set any beyond the defaults
[22:09] bhuga sockets[shard_id].setsockopt(ZMQ::LINGER, -1)
[22:09] bhuga which i thing fixed a bug we had on exiting
[22:10] cremes yeah, it prevents the socket from closing until all packets are flushed
[22:10] bhuga sounds about right, i think it was giving all kinds of errors when we tried to exit
[22:10] cremes just for kicks, can you comment that out and run your test? it crashes before it's done anyway, right?
[22:10] bhuga yeah, i can do that
[22:13] bhuga i dont think the linger is on the server
[22:13] bhuga i think this is from a (being replaced by lisp) ruby client
[22:13] bhuga could a server expecting a lingering client cause this behavior?
[22:14] cremes no
[22:14] cremes so, i think it is worthwhile for you to figure out how to swap the xrep for a rep socket
[22:15] cremes xrep is a lower-level 0mq socket so it is trickier to work with
[22:15] cremes from what i have seen, you don't need it; a rep socket and an xrep socket both 'block' the same way when sending/recving
[22:16] cremes you only use xrep when you want to break the strict send/recv/send/recv REQ-REP pattern
[22:17] bhuga well, it will need to, of sorts. eventually the 'client' in this case will send out one request and get results of initially-unknown length back from n different workers
[22:17] bhuga but not *yet*, necessarily :)
[22:18] cremes ok
[22:18] bhuga but i'll investigate taht further and plug in some debug to make sure i know exactly what line of ruby is causing it
[22:18] cremes but remember the YAGNI principle :)
[22:18] bhuga then update the issue
[22:18] cremes cool
[22:18] bhuga if i pin it down
[22:18] bhuga thanks for your time anyway
[22:18] cremes sure
[22:19] bhuga best gem author ever