Wednesday February 16, 2011

[Time] NameMessage
[00:00] cremes kdj: you are welcome; remember to pay it forward at some point ;)
[00:01] kdj Hopefully that won't involve inadvertently leading someone astray. ;)
[04:50] zedas sustrik: hey so i still see this poll 100%CPU bug even with the latest 2.1.0 and *cannot* figure out how to fix it.
[04:51] zedas sustrik: it looks like i'll have to dig into the zeromq code and pull out the error handling that zmq_poll does.
[07:07] sustrik zedas: any chance to reproduce the problem here?
[07:24] zedas sustrik: it happens at random on my servers, so next time i can gdb to it and debug for you.
[07:24] sustrik thanks
[07:25] sustrik find out what's looping there
[07:31] zedas well i'm pretty sure it's zmq_poll not handling an EAGAIN on zeromq socket objects.
[07:31] zedas but i'll confirm it and work up a fix. looking at the code the fix may be a flag that says to not stuff errors.
[07:42] sustrik let me have a look...
[07:42] sustrik zedas: is that linux?
[07:44] sustrik hm, the only operations on zeromq socket objects witihn zmq_poll is zmq_getsockopt()
[07:45] sustrik are you getting EAGAIN from zmq_getsockopt()? That should not happen as far as i am aware.
[08:41] enleth Hello
[08:44] sustrik hi
[08:44] enleth I've got a problem building OMQ - it's about the luuid dependency. OMQ reuires the OSSP UUID library, which, due to conflicts with (unmaintained and dropped a long time ago) e2fsprogrs libuuid was renamed to libossp-uuid in my Linux distribution and, FWIW, this was generally a very popular solution.
[08:45] enleth But OMQ looks for libuuid and the configure script does not accept an alternate name
[08:45] sustrik enleth: easy, patch the build system and submit the patch to the mailing list
[08:47] enleth Oh, and I just noticed that the proper libuuid provides an uuid-config program for the configure script to use
[08:48] enleth uuid-config --libs outputs -lossp-uuid, which should be used
[08:48] enleth I guess this is what the build system should do instead of using a hardcoded name
[08:49] sustrik great, post your suggestion to the mailing list
[08:49] enleth The problem is, my skills with autotools are crap
[08:49] sustrik so that build system maintainers can have a look at it
[08:49] enleth OK, will do
[08:49] sustrik thanks
[09:16] enleth No, wait. It does use the old e2fsprogrs-derived libuuid, my bad.
[09:25] enleth OK, there is no problem, the distro repository managed screwed up and I got a bad upgrade installed
[09:34] mikko pieterh: are you here sir?
[09:34] pieterh mikko: just arrived
[09:36] mikko is there a specific reason why test functions are compiled into zfl ?
[09:36] mikko are those symbols needed outside selftest?
[09:38] pieterh if you can find a way of compiling a single C source file into two objects, I'm hapopy
[09:38] pieterh *happy
[09:39] pieterh but the test code must, for me, be in the same source as the actual class
[09:40] mikko pieterh: ok
[09:41] pieterh mikko: if people are unhappy about extra code in their executables we could make these conditionally compiled
[09:42] mikko pieterh: currently i was prototyping something like:
[09:42] mikko separate tests/ directory
[09:42] mikko but i think it should be possible to create separate objects from same code as well
[09:43] pieterh aaaghhhh.....
[09:43] pieterh it's the reason the man pages are a real pain to maintain
[09:43] pieterh separate directories look very clean organizationally
[09:43] pieterh but they ensure pieces don't get updated
[09:44] pieterh also the test cases are essential documentation, like the rest of the source file
[09:44] pieterh running the selftest in its own directory is a good idea, some tests need to mess with files
[09:44] pieterh but I really, really don't want to find ourselves in the zmq situation of having lots of code that lacks test cases
[09:46] mikko hmmm, this gives me additional idea
[09:47] mikko in zfls case code coverage reports would make sense
[09:47] pieterh yes, as an additional insurance
[09:47] pieterh that's meta testing, i.e. testing the test cases
[09:48] pieterh it's a neat idea
[09:48] mikko i'll put this on my todo
[09:49] pieterh there's still space? I'm impressed...
[09:49] pieterh :-)
[09:49] ianbarber speaking off: mikko, did you move the pear server?
[09:49] ianbarber s/off/of
[09:50] mikko ianbarber: in the works
[09:51] mikko hmm
[09:51] mikko i guess the easiest would be to put it where rest of the stuff is
[09:51] mikko you can point the dns to
[09:51] ianbarber i'll point both php. and pear. at it
[09:56] mikko looking at the apache rewrite rules this makes me want to use nginx
[09:56] kristsk nginx is about the same, imho
[09:57] mikko kristsk: dynamic virtualhosting seems a lot more fluent in nginx
[10:03] kristsk might be because of nginx's config syntax, it does not feel so archaic
[10:05] kristsk in regard of vhosts lighthttpd is sought to be more powerfull
[10:48] Guthur sustrik: do you think having wsapoll on supported win platforms would be good to have?
[11:06] ianbarber pieterh: about?
[11:06] pieterh ianbarber: about 12ish
[11:06] pieterh :-) how can I help you?
[11:06] ianbarber :)
[11:08] ianbarber i discovered the wonderful land of martinique has a fun domain extension, so the PHP extension is now available on and (pear is the PHP package system). Was wondering - do you want to have listen on and as well, i can point in that direction (even if it's just doing a rewrite to
[11:08] ianbarber we can redirect from hosting as well, just seems like if someone does go to just, they should end up at the site. It's on mikko's geo redundant hosting at the mo :)
[11:09] pieterh oh... I like it
[11:10] ianbarber i can point them at if thats sensible - don't know if there are any weird wikidot issues or similar
[11:10] pieterh if you point to, then I'll add it to the custom domains on the website
[11:10] ianbarber cool
[11:10] ianbarber will do
[11:10] pieterh wow, we have a sneaky short domain name, so 2011...
[11:11] pieterh afair you can't point itself to a DNS name, you need to use the IP address there
[11:11] mikko you can
[11:11] mikko CNAME
[11:11] ianbarber should be able to cname it
[11:11] ianbarber yeah
[11:12] pieterh maybe I'm confusing with wildcards, I usually point * etc. to wikidot
[11:13] pieterh cname the heck out of it, ianbarber, I'll add the custom domain entries in an hour or so
[11:14] ianbarber cool :) I've pointed and, so we'll see then :)
[11:15] pieterh would it be worth doing something sneaky like...
[11:15] pieterh -> redirects to ?
[11:16] pieterh I can make that work
[11:16] pieterh ianbarber: DNS seems to have propagated already, that was fast
[11:17] pieterh presumably not cached anywhere
[11:17] ianbarber yeah, www wasn't set up before
[11:17] ianbarber redirect to community sounds like an idea, if that's doable on wikidot
[11:18] pieterh np, give me 5 minutes...
[11:20] enleth mikko: hey, just wanted to say thanks for the PHP bindings for ZMQ, TC and TT - good job!
[11:21] enleth It was pretty amusing when I opened the github page for ZMQ bindings a moment ago, saw your username and though "well, I know this guy - what else I might be using that he did?"
[11:21] pieterh ianbarber: ok, done, give it a whirl... :-)
[11:22] mikko enleth: my pleasure
[11:25] ianbarber pieterh: i seem to be getting a password page. that's odd
[11:25] pieterh ianbarber: ah, my bad, it's still a private site, will fix immediately
[11:25] ianbarber ah, cool
[11:26] pieterh ianbarber: try again now?
[11:26] ianbarber yep, that's looking good
[11:26] ianbarber very nice!
[11:26] pieterh it's very cool
[14:30] ianbarber pieterh: was thinking, I've noticed that there are a lot of questions on the mailing lists that are solved in broadly the same way, even from people who have read the guide (myself included). I was wondering whether there is any value in some sort of 0MQ pattern library.
[14:30] ianbarber sort of like but with messaging patterns at all kinds of scales
[14:31] ianbarber i like how the generic pattern is described and an example given in each one of those (
[14:32] ianbarber but still pretty simple, 1 page
[14:51] mikko cremes: you can run make check
[14:52] mikko (dont wanna confuse the thread as it has moved on from there)
[14:54] cremes mikko: here are the results:
[14:54] cremes failure...
[14:57] mikko No space left on device
[14:58] cremes how did i not see that?.... bleary eyed after 30 hours of debugging...
[14:58] mikko also, the tests wont output anything but they should assert on failure
[14:58] mikko return code for success is 0
[14:59] cremes oh wait, that out of space condition happened overnight as i was testing something
[14:59] cremes hold on a sec
[15:00] cremes mikko: reload the gist; it now shows all as passing
[15:01] cremes my problem with running the tests was i didn't know the right make target
[15:01] mikko make check is autotools default test target
[15:01] cremes i tried 'make test' and 'make all' but the former didn't exist and the latter didn't seem to run them
[15:01] cremes didn't know that
[15:02] mikko make test seems to be widely used as well
[15:02] cremes looks like all is well; chalk this up to user error
[15:02] cremes yeah, maybe adding it as an additional target would be a nice convenience
[15:02] mikko i'll add that on todo
[15:09] pieterh ianbarber, was eating lunch... back now
[15:10] pieterh imo there would be value in a pattern library but I'll use Sustrik's Law here
[15:10] pieterh find the person to collect and maintain the patterns, and the problem is solved :-)
[15:12] mikko
[15:12] mikko zfl code coverage
[15:13] ianbarber pieterh_: fair point, i do appreciate sustrik's law :)
[15:13] mikko hmm source code missing
[15:14] pieterh ianbarber, you can also apply Pieter's Response to Calls to Action
[15:15] pieterh "Excellent idea, Ian, I'm curious to see how you do it"
[15:15] pieterh Known in ruder groups as nypa :-)
[15:16] pieterh Actually, I do have a more positive idea
[15:17] pieterh When you see a question solved in a way you think is reusable, point me to it, and I'll cover it in the Guide at some stage
[15:17] pieterh there are a lot of chapters waiting to be written
[15:19] ianbarber yeah, i think that's good. the guide really is the basis for shared understanding about it
[15:19] mikko ah
[15:19] mikko finally it works
[15:19] mikko
[15:19] ianbarber i'm happy to do some patterns (at some point!) just wanted to check whether it fitted in with the direction you're taking the guide
[15:25] pieterh mikko: sweet!
[15:25] pieterh ianbarber, I guess the Guide aims to be the bible, eventually
[15:26] pieterh modest aims
[15:27] pieterh we can (and by 'we' I really mean 'you') start by collecting text on a wiki page
[15:27] pieterh that is trivial, shareable, reusable
[15:27] pieterh join the (great name) wiki if you're not already on it, start a docs:patterns page...
[15:27] ianbarber yeah. i think the tricky thing with the guide is balancing it for new users, and for experienced ones
[15:28] ianbarber yep, i'm on it, will do
[15:28] pieterh no problem, really... start with simple stuff, get more advanced as you go along
[15:28] pieterh patterns would be like a cookbook, stand alone section, with some good indexing
[15:28] ianbarber yeah
[15:28] ianbarber that's pretty much the idea, just to have a concise example of different interaction models really
[15:29] pieterh even copy/paste of solutions from the email list is a good start
[15:29] pieterh don't worry about producing prose, that's my speciality
[15:42] mikko hi Steve-o
[15:42] Steve-o hi mikko
[15:43] Steve-o working on new house this week, a foreclosure so many minor issues :/
[15:44] Steve-o back in HK next week and back to work
[15:44] mikko is your house in the states?
[15:44] Steve-o upstate NY
[15:45] mikko are you moving there?
[15:45] Steve-o near Martha Stewart is about the only notable point
[15:46] Steve-o eventually moving there, house prices very cheap so good time to buy
[15:46] Steve-o I have another year for my greencard it looks
[15:48] Steve-o so what is the status on autoconf in zeromq, anymore changes required?
[15:49] mikko i think we should get 2.1.0 out before refactoring the openpgm part
[15:49] mikko it seems to be working well with openpgm trunk
[15:50] mikko some open issues to solve but in general good
[15:50] mikko one of them is how to link openpgm if zeromq invokes openpgm built?
[15:50] mikko build*
[15:50] mikko install and use the shared lib?
[15:51] mikko use the object files directly?
[15:51] mikko etc
[15:51] Steve-o good question, distros would like shared libs,
[15:51] mikko linking libpgm.a into works on linux (assuming libpgm.a is position independent code) but not portable
[15:52] mikko yes, my only fear is the following scenario:
[15:52] Steve-o which is why I don't have a dll on Windows
[15:52] mikko user has libpgm installed, now installs zeromq with openpgm support, zeromq invokes openpgm build and overwrites the existing installation
[15:54] Steve-o well a common solution I have seen to that is to install the dependent library in a sub-directory of the product build instead of the OS preferred location
[15:55] mikko but distros dont like rpath
[15:55] Steve-o For convenience prefer static libraries but allow distributions to use shared libraries.
[15:55] Steve-o so out of the tarball build libpgm.a but allow configure options for
[15:56] mikko but how to use the libpgm.a ?
[15:56] mikko .a inside .so is not really portable
[15:56] Steve-o really? where isn't it valid?
[15:57] mikko i can check, i did a lot of googling on this
[16:01] mikko hp-ux seems to be one
[16:01] mikko is that even supported by openpgm?
[16:01] Steve-o not yet
[16:02] mikko Libtool convenience library
[16:02] mikko sounds like a solution
[16:02] mikko
[16:02] mikko groups together a set of object files
[16:02] Steve-o that's what zeromq is using now
[16:03] mikko but on different side of the fence
[16:04] Steve-o let me read up on HPUX, v10 was fine as I remember they broke various things with 11
[16:04] mikko Steve-o: how does bundling convenience lib on openpgm side sound like?
[16:04] mikko and then zeromq links that
[16:04] mikko i could at least investigate this as it seems like a portable option
[16:05] Steve-o ok, if you can provide the code, I'm not sure how this is supposed to work with two different projects
[16:06] mikko the ultimate goal i guess is to have both as shared libraries provided by distros
[16:06] mikko but in the meanwhile convenience lib sounds ok
[16:06] mikko i'll put this on my ever growing todo list
[16:07] mikko at least i got ZFL code coverage working today
[16:08] Steve-o using gcov?
[16:08] mikko yes
[16:09] mikko
[16:12] Steve-o nice, it's tedious getting those percentages higher though
[16:13] mikko true. you would almost need to preload a malloc implementation that fails randomly
[16:13] mikko to test all asserts
[16:14] mikko and even then it would be very random
[16:15] mikko might add same thing for zeromq later as well
[16:17] cremes pieterh: ping... where is "zhelpers.h"? i can't compile your mailbugz.c test without it
[16:18] pieterh cremes: sorry!
[16:18] pieterh adding it now
[16:18] sustrik cremes, just replace it with zmq.h
[16:18] pieterh sustrik: nope, that and other stuff
[16:18] sustrik there's nothing used from zhelpers.h in the code
[16:18] sustrik i've just compiled it
[16:18] sustrik aha
[16:18] sustrik replace the line with:
[16:18] sustrik #include <zmq.h>
[16:19] sustrik #include <stdio.h>
[16:19] sustrik #include <string.h>
[16:19] sustrik that works
[16:19] pieterh yes, that works
[16:21] Steve-o mikko: ok so I already have the libtool convenience library, libtool is giving me the shared and static libraries for free
[16:22] mikko Steve-o: i know, but if you link against the .la from zeromq it gives a a warning "Warning: won't be deployed"
[16:22] mikko not sure if that can be ignored
[16:22] mikko maybe it can
[16:22] Steve-o is that because of a noinst_ line?
[16:23] mikko i got a local branch here
[16:23] pieterh sustrik, in the pubsub pattern it is IMO a design flaw that zmq_connect is asynchronous
[16:23] mikko Steve-o:
[16:23] mikko these are some of the changes related to zeromq
[16:24] pieterh that is, on a sub socket
[16:25] mikko Steve-o: i tested that with ./configure --without-documentation --with-pgm=/tmp/to/pgm-trunk
[16:28] Steve-o mikko: I can't find anything on that error message in google
[16:29] sustrik pieterh_: why so?
[16:32] zedas sustrik: yep that's linux. why?
[16:33] sustrik there are 2 implementations of zmq_poll
[16:33] sustrik i was just checking which one to have a look at
[16:34] sustrik anyway, what's the problem you were referring to?
[16:35] sustrik ah, the EAGAINs in strace
[16:35] Steve-o mikko: maybe I need to explicitly add a noinst_LTLIBRARIES instead of lib_LTLIBRARIES
[16:35] sustrik i've missed the link, sorry
[16:36] cremes pieterh_: i don't compile a lot of C programs; what's the gcc line to get the example to compile & link?
[16:37] cremes nm, got it
[16:38] mikko Steve-o: gimme a sec
[16:38] mikko getting the exact error message out
[16:40] pieterh cremes: sorry, my irc client's not alerting me for some reason
[16:41] cremes no worries; i compiled the program and ran it successfully
[16:41] cremes no failures
[16:41] cremes so my hypothesis must be wrong as to the cause of the mailbox assertion
[16:41] pieterh at least it's not that simple
[16:42] cremes right
[16:42] pieterh assuming I got the case right
[16:42] pieterh 5M writes, 5M reads...
[16:42] cremes you got it right as i explained it
[16:42] pieterh sustrik: sorry also, I'm not getting beeps...
[16:42] pieterh pubsub fails, for every new user, in the same way
[16:43] pieterh subscriber connects, then misses X milliseconds of messages
[16:43] sustrik ack
[16:43] pieterh i'm not sure doing a synchronous connect would make any difference
[16:43] sustrik it probably won't
[16:43] cremes pieterh_: is it possible to run this under gdb and have it drop into the debugger instead of asserting?
[16:43] pieterh but there is definitely a problem when every user hits the same issue
[16:44] cremes if so, perhaps i could dump the contents of the mailbox?
[16:44] pieterh cremes, afaik usual tactic is to get a core dump and then debug from there
[16:44] pieterh i'm no gdb expert
[16:44] cremes ok, how can i force it to core?
[16:44] sustrik cremes: p
[16:44] pieterh divide by zero?
[16:45] sustrik when you want to dump the content of variable x, type "p x"
[16:45] pieterh assertion failure will produce a core I think
[16:45] pieterh you need to enable core dumps for your process
[16:45] pieterh ulimit unlimited
[16:45] cremes yeah, right now i'm set for a core size of 0; i can change that
[16:46] cremes are you sure the assertion causes a core?
[16:46] sustrik cremes: just start the executable under gdb
[16:46] sustrik it will stop and get you gdb prompt when assertion is hit
[16:46] pieterh yeah, and make sure it's compiled and linked for debugging
[16:47] cremes i did run it under gdb several times; the assertion would cause the ruby runtime to throw an exception and exit cleanly
[16:47] cremes so gdb never caught the issue
[16:47] cremes outside of gdb, it would assert
[16:47] cremes very frustrating
[16:47] sustrik :|
[16:48] pieterh my brute force approach would be to add code to 0MQ that dumps the mailbox just before it asserts, under the same conditions
[16:48] pieterh don't waste time trying to get debuggers working unless you already know how
[16:49] cremes i like that suggestion; any suggestion on how to dump the mailbox?
[16:49] sustrik cremes: i would do a bit different thing
[16:49] cremes i.e. are there important components to capture or should i just dump it as a string?
[16:49] cremes sustrik: talk to me
[16:49] sustrik just print some text when mailbox_t::send() is invoked
[16:50] sustrik in you scenario the number of invocations should be pretty modest
[16:50] sustrik if it starts printing a lot of text, there's definitely some problem there
[16:50] cremes sustrik: just any text like "mailbox.send!"
[16:50] sustrik yes
[16:50] cremes ok
[16:51] cremes so you don't care about the contents of the mailbox
[16:51] sustrik not really
[16:51] cremes ok, i'll try that now
[16:51] sustrik if we find out that there's a lot of commands is written
[16:52] sustrik we'll have a look at what kind of commands is that
[17:10] pieterh mikko: I'm improving some of the coverage but it's always going to miss on assertions, apparently
[17:15] mikko pieterh_: yes
[17:15] mikko i dont think it calculates those
[17:15] pieterh hey, my beep works now! :-)
[17:15] mikko and 100% is not really a realistic or even desirable aim
[17:15] mikko Steve-o: i think i solved it
[17:15] pieterh ok, I'll improve some of the coverage but like Steve-o says, it gets messy
[17:16] mikko Steve-o: almost. now it compiles twice it seems
[17:17] ianbarber just to be doubly sure
[17:18] ianbarber compare the two, and if they're different fail on a non-deterministic build process
[17:25] cremes sustrik: yes, there are a *lot* of commands sent
[17:25] sustrik ok
[17:25] cremes what's the next step? dump the commands when the mailbox buffer is increased?
[17:26] sustrik can you print out cmd->type?
[17:26] sustrik that will show what kind of commands are being passed
[17:26] cremes sure; on every invocation or just when the buffer size is increased?
[17:26] sustrik on every invocation
[17:26] cremes ok
[17:28] cremes sustrik: i see it's defined as an enum so i can use printf("%d", cmd->type), yes?
[17:29] sustrik printf("%d", (int) cmd->type)
[17:29] sustrik just in case
[17:29] cremes k
[17:31] cremes sustrik: mailbox.cpp:158:34: error: base operand of '->' has non-pointer type 'const zmq::command_t'
[17:32] cremes ??
[17:32] sustrik it should be cmd_.type
[17:32] sustrik sorry
[17:34] cremes clean compile; running now
[17:37] cremes sustrik: here's a sampling of what i see; the cmd is wrapped in TY(cmd) so i can pick it out of the log easily
[17:37] cremes
[17:39] sustrik do you call connect or bind in that app?
[17:40] cremes i call both early on during setup, then i don't need to call it again
[17:41] sustrik ah, both are in the same process
[17:41] sustrik i see
[17:41] sustrik what transport do you use?
[17:41] sustrik tcp? inproc? ipc?
[17:41] cremes tcp
[17:42] sustrik cremes: can you printf something in connect_sessio_t::detached() function?
[17:42] cremes yes
[17:42] sustrik (that wey we'll see if there a lot of reconnecting happening)
[17:47] cremes sustrik: [cremes@box1 servers]$ grep ^REC t.out | wc -l
[17:47] cremes 921674
[17:47] cremes so yes, lots of reconnects
[17:51] cremes this is a threaded app writing to the same logfile so sequence is a bit suspect
[17:52] cremes however, it appears each REC is always followed by command type 1 or 3 (plug or attach) which kind of makes sense
[17:56] sustrik yep
[17:56] sustrik the question is: why does it reconnect at all?
[17:57] sustrik moreover, the default reconnect interval is 0.1 sec
[17:57] cremes agreed; all transport strings are of the form 'tcp://<port>'
[17:57] sustrik so to get 921675 would require couple of days
[17:58] sustrik you mean: "both" rather than "all", right?
[17:59] cremes there is a PUB producer, a FORWARDER device, and multiple SUB consumers in this process
[17:59] cremes they all connect up in the beginning and should never close/reconnect for the life of the program
[17:59] cremes so each one has its own transport connection string; that's what i meant by 'all'
[18:00] sustrik i see
[18:00] sustrik how many SUBs?
[18:01] cremes let's see...
[18:02] sustrik approximately...
[18:02] sustrik tens, hundreds, thousands?
[18:02] cremes 5 in the clients and 1 in the FORWARDER, so about 6 (i might be forgetting one or two)
[18:02] sustrik ok
[18:03] sustrik do you close the FORWARDER before closing the SUBs?
[18:04] cremes they should all terminate at roughly the same time when i interrupt/kill the program
[18:04] sustrik ok
[18:05] cremes otherwise, the FORWARDER never exits
[18:05] sustrik does FORWARDER connect to SUBs or other way round?
[18:05] cremes FORWARDER binds while all clients connect
[18:05] sustrik what about PUB?
[18:05] cremes actually, the IN/OUT sockets on the FORWARDER always bind
[18:06] cremes the publisher connects too as a result
[18:06] sustrik ok
[18:06] sustrik hm, i see no reason then for reconnections to happen
[18:06] sustrik are you 100% that the connection strings match?
[18:07] cremes match in what way?
[18:07] cremes they are all tcp?
[18:07] sustrik are they the same on bind and connect side?
[18:07] cremes if they weren't, the data wouldn't flow through my app, yes?
[18:08] sustrik ah, the data flow through
[18:08] sustrik i see
[18:08] sustrik to all 5 subs?
[18:09] cremes yes, the main PUB broadcasts and the 5 subs each sub to everything
[18:09] sustrik and all of them actually get the data
[18:09] cremes if they weren't getting the data, the app would lock (and produce something similar to EFSM in my code)
[18:09] sustrik ok, good
[18:09] cremes it's kind of like an election algo
[18:10] sustrik to be frank, i have no idea what's going on there
[18:10] sustrik if the reconnections happen
[18:10] sustrik one would expect that at least some messages would be lost
[18:10] cremes any idea how i can do 900k reconnects in a few minutes?
[18:10] sustrik no idea
[18:11] cremes <sigh>
[18:11] sustrik have you changed the default RECONNECT_IVL?
[18:11] cremes btw, i ran pieter's mailbugz code with these debug prints in them and it barely puts out anything at all
[18:11] sustrik exactly
[18:11] cremes nope, no changes to RECONNECT_IVL
[18:12] cremes all sockets are allocated in their default state; the one exception is calling setsockopt on the SUBs to set their subscription string
[18:12] cremes and i always set my own IDENTITY
[18:12] cremes someone on the ML suggested a potential IDENTITY collision; could that be related?
[18:13] sustrik maybe
[18:13] sustrik do you have identity collisions there?
[18:13] sustrik like all 5 subs having the same identity?
[18:13] cremes i shouldn't; the identity is always <random id>.<sock type>.<server type> where random id is 0 to 999_999_999
[18:14] cremes it's *possible* there is a collision but *improbable*
[18:14] sustrik try printing them out
[18:15] cremes i'm auditing that right now; give me 5m
[18:22] pieterh cremes, are you sure you're initializing your random number generator?
[18:22] pieterh if not, every client will produce an identical 'random' sequence
[18:23] pieterh cremes: if you're getting reconnects, presumably you're also getting disconnects
[18:23] pieterh and if you can find those, you can find what is causing them
[18:24] pieterh sustrik: how many places does 0MQ forcefully disconnect a subscriber socket without assertion
[18:24] pieterh do we have the sys: transport working?
[18:26] sustrik pieterh_: every time the other side does something unexpected
[18:26] sustrik such as sending malformed frame
[18:26] pieterh yeah, but are there lots of places in the code?
[18:26] sustrik not much, 3-4 i think
[18:26] pieterh right... so a few well-placed prints and we'll know what's happening
[18:26] sustrik sys: works
[18:27] sustrik and should be used exactly for this kind of thing
[18:27] pieterh precisely
[18:27] sustrik the only problem is that some kind of throttling
[18:27] sustrik not to get the log overloaded
[18:27] pieterh presumably all we care about are the first 10 messages
[18:27] sustrik i.e. if the same problem happens over and over again
[18:27] sustrik in 10us intevals
[18:28] sustrik only the fist one should be reported
[18:28] pieterh add a numeric code and ignore duplicates, standard solution
[18:28] sustrik you need some kind of state machine
[18:29] sustrik if connecting fails happens log it a switch to "no log" state
[18:29] cremes alas, it looks to me like they are all unique:
[18:29] sustrik any subsequent connect failures are not logged
[18:29] cremes interestingly, out of all 4 components, only the one that crashes shows the hundreds of thousands of reconnects
[18:29] sustrik when connecting succeeds, switch back to "log" state
[18:29] sustrik thus making next disconnect being logged
[18:30] pieterh you don't need anything that complex IMO
[18:30] pieterh if you get more than 1000 alerts on sys: you can give up
[18:30] pieterh (in a minute, hour, day_)
[18:30] pieterh cremes, you may want to add prints in the places 0MQ *disconnects* subscribers
[18:31] sustrik cremes: no more ideas, i need a minimal test case
[18:31] sustrik to reproduce it here
[18:31] cremes ok, i'll keep poking at it
[18:32] pieterh sustrik, can you tell cremes where those 3-4 places are?
[18:32] sustrik hm, i don't know precisely
[18:32] sustrik dhammika have supplied those patches
[18:33] pieterh it used to be easy 'egrep assert *.cpp'
[18:33] sustrik maybe check the commit log
[18:33] sustrik ?
[18:33] sustrik it's not asserting, it's closing the connections
[18:36] cremes this conversation gave me an idea... i think i am narrowing it down... give me 10m
[18:36] pieterh sustrik, I meant, it *used* to assert and I remember several times chasing down framing errors by sticking printfs into those places
[18:38] sustrik these assert have been removed via your "0MQ competition" :)
[18:46] cremes sustrik, pieterh_: found it!
[18:46] pieterh :-)
[18:46] cremes i had a duplicate identity on an unrelated XREQ socket!
[18:46] pieterh yay!
[18:47] cremes to reproduce, it's probably just these steps...
[18:47] pieterh sustrik, does zmq already send anything to sys:?
[18:47] cremes 1. create a QUEUE device that binds to some port
[18:47] sustrik pieterh_: no
[18:47] cremes 2. create two XREQ (REQ too?) sockets, set their identity the same and connect them to the QUEUE
[18:47] cremes 3. check for reconnects
[18:48] cremes 4. Maybe need to send some data through first...?
[18:48] pieterh cremes: I'll make a test case later on
[18:48] cremes ok, thanks pieter! your c skills far exceed my own
[18:48] pieterh what do you mean by 'check for reconnects'?
[18:48] cremes thank you both so much for working through this with me; this conversation solved it
[18:49] pieterh i'd like to get a test case that results in a crash
[18:49] cremes i added a debug statement to connect_session.cpp:detach to print whenever it detached and attempted a reconnect
[18:50] cremes let me try to write one in ruby
[18:50] pieterh this still does not explain why the mailbox exploded...
[18:50] cremes then i can tell you exactly what needs to be done in c
[18:50] pieterh yes, make a ruby test case, that's perfect
[18:50] pieterh exploding mailbox gets double score
[18:50] pieterh sustrik: we should start to send stuff to sys: where we used to assert
[18:51] pieterh if you can document how to use sys: from inside zmq I can try that
[18:51] pieterh ideally, a 1-liner that sends a string... :-)
[18:52] pieterh then we can apply that to cremes test case and check that we'd have caught this error
[18:52] sustrik log ();
[18:52] sustrik it's ther
[18:52] sustrik e
[18:53] pieterh ah, it requires all the work of creating a message first
[18:53] pieterh that's tedious
[18:54] pieterh do we have a standardized format for sys://log messages?
[18:54] pieterh sorry to complain but if this was packaged somewhat, it'd be easier for people to use it internally
[18:55] sustrik no format
[18:55] sustrik just use string atm
[18:55] sustrik we can polish the format later on
[18:55] pieterh every single object has a log method?
[18:56] pieterh inherited from object_t?
[18:56] sustrik yes
[18:56] pieterh so the log method there could be somewhat expanded to take a string and create/destroy the msg itself
[18:57] pieterh afaics we don't use this anywhere yet
[18:57] sustrik sure
[18:57] pieterh and then we need a documented parsable format for messages
[18:57] pieterh minimal
[18:57] pieterh easy to improve later
[18:57] sustrik ack
[18:57] pieterh ok, I'll try my hand at this, apologies in advance...
[19:11] cremes yes! i have a reproducible crasher in ruby!
[19:12] cremes pieterh_: do you want the ruby code or an explanation for translation to c?
[19:12] pieterh cremes, I think we need to log two issues here
[19:13] cremes ok, i can create the issues, but i only see one
[19:13] pieterh (a) lack of any warning to the app developer
[19:13] pieterh (b) mailbox crash
[19:13] pieterh (b) is the critical one, and the ruby example will be valuable there
[19:13] cremes ok, so (a) is for tracking a new feature request to add the sys: stuff, yes?
[19:13] pieterh yes
[19:13] cremes ok, i'll write them up
[19:14] pieterh well, we don't track new feature requests, so perhaps skip (a)
[19:14] cremes i'll add it to the wiki 3.0/roadmap page
[19:14] pieterh i'm working on it now... :-)
[19:15] cremes ok!
[19:27] cremes pieterh_: preview this issue and let me know if you need more details to reproduce in c:
[19:27] pieterh cremes, thanks!
[19:27] cremes pieterh_: i've spent the last 96 hours banging on this! i'm happy to see it solved!
[19:28] pieterh that's why i'm doing the sys://log stuff, it's insane to lose so much time to a missing warning
[19:28] cremes honestly, i'm taking the rest of the day off.... i feel deflated
[19:29] pieterh sustrik, what's the correct way to work with a msg in the zmq core?
[19:30] pieterh ::zmq_msg_t or is there a message class I'm missing?
[19:39] enleth Hello
[19:39] enleth mikko: is the API documentation at supposed to be inaccessible?
[19:46] ianbarber enleth: check
[19:46] ianbarber references probably need updating
[19:50] mikko enleth: yes
[19:55] enleth ianbarber: thanks, that's it.
[19:56] enleth mikko: can I suggest a 302 redirect to the new address?
[19:56] pieterh cremes: still there?
[19:56] enleth The old one is all over the latest git tree
[20:01] mikko done
[20:01] cremes pieterh_: for a bit more; what's up?
[20:01] pieterh just wondered if you need to actually use the REQ/REP sockets to create the crash
[20:01] pieterh or just bind them and BOOM
[20:02] pieterh s/bind/connect
[20:02] cremes let me see... give me 1m
[20:03] cremes pieterh_: nope, crashes without using them; good catch... it's even *more* reduced now
[20:03] pieterh excellent...
[20:03] pieterh thanks a lot
[20:03] cremes i'm no longer thinking clearly otherwise i would have tried that :)
[20:04] pieterh it's been a long day :-)
[20:04] cremes pieterh_: looks like you *do* need the REQ socket too
[20:05] cremes a pair of REP's with the same ID is insufficient
[20:05] cremes it's been a long *week*
[20:05] pieterh ack, you need a pair of sockets with one disconnecting the other
[20:05] pieterh presumably, I'll test that, it applies to all relevant socket types
[20:05] pieterh it's been a long *year*!
[20:05] cremes perhaps...
[20:05] pieterh hang on...
[20:06] pieterh :-)
[20:06] cremes heh
[20:10] pieterh cremes: bingo, I reproduced it!
[20:10] cremes awesome!
[20:11] cremes once started it only takes a few seconds to exhaust that buffer even when it's 5MB!
[20:11] pieterh just connect two req sockets with same ID, wait 1 second...
[20:11] pieterh I'm going to try with other socket types now
[20:16] pieterh cremes: it affects all socket types
[20:16] pieterh any combination of bind/connect, even pub connecting to sub
[20:17] cremes wow
[20:17] cremes this *might* explain a lot of people's problems; there are several issues open about this assertion
[20:18] pieterh ironically 0MQ used to assert before :-)
[20:18] cremes oh, the irony... :(
[20:19] cremes well, i'm just glad it's no longer a mystery
[20:19] pieterh anyhow, this makes it much easier to solve properly
[20:19] cremes other than this, i haven't hit an assertion in a long time
[20:22] pieterh indeed, we had a competition to kill them :-)
[22:30] jol pieterh: nice talk at fosdem, I just watch it.
[22:40] Steve-o thx mikko
[23:53] dan hello
[23:53] mikko hi
[23:53] dan i've got a question about zmq
[23:54] mikko go ahead
[23:54] dan is there any reason I should not be able to implement a pubsub connection with one side in python and the other in cpp over ipc?
[23:55] mikko no reason
[23:55] mikko should be perfectly ok
[23:55] dan hm
[23:56] mikko you are not seeing any messages?
[23:56] dan i see them when I use tcp, but not when i use ipc
[23:56] mikko can i see the code?
[23:56] dan whats the best way to share it?
[23:56] dan copy paste in here?
[23:56] mikko
[23:57] dan sure - let me copy the code