Tuesday May 3, 2011

[Time] NameMessage
[00:06] neopallium iFire: lua-zmq now has support for zmq_stopwatch_*() functions and the perf/*.lua benchmarks have been updated to use that instead of socket.gettime().
[05:21] Steve newbie question here - I just downloaded the windows version of zeromq, and opened up the project in visual studio 2008
[05:22] Steve two projects - inproc_lat, and inproc_thr show as unavailable. were they removed? or what do i do to get them in place?
[05:22] Steve (so far all I've done is downloaded the zip file from the "Grab the Software" page)
[05:53] iFire bah
[07:59] kryptom hi there, i'm using zeromq 2.1.6 and trying out the ruby samples (hwclient and hwserver). In the zguide it says you can start the client (REQ), then start the server or restart the server (REP) with a running client. But when I try the samples "hang", is this normal behaviour?
[08:06] pieterh kryptom: it should work
[08:25] kryptom sorry, got distracted
[08:26] kryptom i would agree it should work ...
[08:28] kryptom so, I start the client, start the server, then quit the server and start the server again - then the client hangs
[08:32] sustrik the request may be lost
[08:33] sustrik if it was processed by the server when it was shut down
[08:35] kryptom checking the examples again ...
[09:19] steve_k I'm having a problem. I'm trying to test HelloWorld cross/platform cross/language. I have the HelloWorldServer built in Java on Fedora 14, and two clients on Windows XP - One Java and One Python.
[09:20] steve_k The Java client talks to the Java server just fine, but if I run the Python Client, the Java Server crashes with: Assertion failed: (msg_->flags | ZMQ_MSG_MASK) == 0xff (zmq.cpp:223)
[09:20] Guthur steve_k: are you setting idenitities
[09:21] Guthur identities*
[09:32] sustrik steve_k: looks like you (or java binding) are using invalid zmq_msg_t
[09:33] sustrik can you produce a minimal test case and fill a bug for jzmq project
[09:33] sustrik ?
[09:39] steve_k sorry - got disconnected.
[09:39] steve_k where should I submit the bug?
[09:40] steve_k the jzmq git repository?
[09:48] djc 2.0 and 2.1 should be over-the-network compatible, right?
[09:48] guido_g yes
[09:52] djc k, thanks
[09:52] djc are there any good docs on the compatibility of major, minor and micro releases?
[09:52] djc in terms of API, ABI and network framing
[10:00] pieterh djc:
[10:01] steve_k2 hmph... I went about building a test case and I ran the same script from linux instead of windows and it worked fine.
[10:01] pieterh but it's not complete, e.g. I'd like a policy on backwards compatibility of wire level protocols
[10:02] steve_k2 I'm going to try it from an external machine to see if that makes a difference. Otherwise, it must be a problem with my windows install of the python module.
[10:04] steve_k2 still - it's a bit scary that a client can crash the server so easily.
[10:06] guido_g trade-off between speed and angst
[10:11] Soleo Hi, does anyone know about something about Queue Device? Should I add lock for each thread queued ?
[10:11] sustrik steve_k2:
[10:12] sustrik Soleo: no
[10:14] Soleo what if I am writing the same file in each thread?
[10:21] Guthur that issue is really orthogonal to the queue device
[10:24] Guthur possibly you want something like this: <fan out> Process stuff <fan in> write file
[10:38] djc pieterh: can you add something there about network protocol/framing compat?
[10:39] pieterh djc: what would you like me to add?
[10:39] djc ah, nm, read your statement about protocol
[10:39] djc if there is no policy, would be hard to describe it
[10:40] djc if the network protocol doesn't change that often, perhaps it would suffice to have an exhaustive list of actual versions that changed it
[10:40] pieterh you mean for the future?
[10:41] pieterh the protocol will definitely change for 3.x
[10:41] djc yeah, or even for the current versions
[10:41] pieterh there's only one current stable version, 2.1.x
[10:41] djc i.e. 2.0 and 2.1, which I bet a lot of people with production stuff will be upgrading soon
[10:41] pieterh all 2.x versions use the same protocol
[10:41] djc and I've also noticed a 2.2 repo, not sure what that's about? all the talk seems to be about 3.x
[10:42] djc okay, well, that would be useful info for that wiki page
[10:42] pieterh well, 2.2 is for ongoing evolution of the 2.1 packaging
[10:42] djc what do you mean by "packaging" exactly?
[10:42] pieterh well, 2.2 is for two things
[10:43] pieterh a. new functionality that can't go into 2.1 because that's for bug fixes only
[10:43] pieterh b. new functionality packaged with libzmq, e.g. czmq
[10:43] djc so what's the difference between 2.2 and 3.x?
[10:43] pieterh 3.x breaks the API and wire protocol totally...
[10:44] pieterh whereas 2.2 is ... hey, this is explained by the wiki page :)
[10:44] pieterh 2.2 is meant to be a *compatible* evolution of 2.1
[10:44] pieterh whereas 3.0 is incompatible in several major ways
[10:45] pieterh I'm not 100% sure we'll need the 2.2 branch but it's there and I maintain it
[10:45] djc okay, sounds good
[10:45] djc yeah, I guess the wiki page is actually quite comprehensive
[10:46] pieterh there needs to be a page describing present/future versions, maybe I'll add that as a section to this page...
[10:53] pieterh djc: OK, I've summarized the current release situation on that wiki page
[10:54] pieterh will you review that and let me know if it's ok?
[10:55] djc it seems mostly fine
[10:56] djc it might be useful to bring up ZMTP sooner
[10:56] djc since it seems to provide useful vocabulary
[10:56] pieterh Indeed... hang on...
[10:58] pieterh done
[11:00] djc better, thanks
[11:04] pieterh thanks to you... this is good stuff
[11:47] pieterh
[12:32] sustrik pieterh: fix to blast:
[12:33] pieterh ja?
[12:33] sustrik line 56 should be: int rc = connect (handle, (struct sockaddr*) &sin, sizeof (sin));
[12:33] pieterh ok, done and pushed
[12:47] sustrik pieterh: the problem doesn't seem to happen with 3.0
[12:47] pieterh sustrik: interesting, does it run for 1M iterations?
[12:48] sustrik dunno
[12:48] sustrik the server just keeps printing
[12:48] sustrik "Received Hello"
[12:48] pieterh lol
[12:48] pieterh well, that's a good sign
[12:48] pieterh the blast program sends 1GB of random data
[12:49] sustrik heh
[12:49] sustrik the blast seems to fail:
[12:49] sustrik sustrik@istvan:~/libzmq/src$ ./blast
[12:49] sustrik Test failed at count 47
[12:54] djc I just upgraded a C++ server thingy from 2.0 to 2.1
[12:54] djc and now it keeps asserting
[12:54] sustrik pieterh: ok, the blast failure is expected
[12:54] djc from zmq.cpp:223
[12:54] sustrik it's "connection refused by peer"
[12:55] sustrik djc: yes, that's what we are talking about now
[12:55] sustrik djc: any chance to get a minimal test case?
[12:56] djc doesn't seem to be a whole lot I need to do
[12:57] sustrik what do you do?
[12:57] djc hmm, there might be threads involved
[12:58] sustrik ok, forget about it them
[12:58] sustrik then
[12:58] djc well, not sure, this should all happen on the main thread
[12:58] djc but the output is weird
[12:58] sustrik the blast client pieter wrote just now is allegedly reporting the same error
[12:58] djc
[12:59] sustrik djc: does that happen when the peer starts sending messages?
[12:59] sustrik or even before that?
[13:00] djc the req/rep doesn't have a peer at this point
[13:00] djc I'm testing with the PUB now
[13:00] sustrik so the PUB fails even before there's a SUB connected?
[13:01] djc oh yeah, there's no SUB or REQ involved here
[13:01] sustrik ok
[13:03] djc sustrik: here's something more elaborate:
[13:03] djc
[13:04] djc where nothing should be coming in on this->side, and nothing's listening on this->pub
[13:04] djc (no one is connected to this->side, even)
[13:04] djc mmm
[13:05] sustrik djc: let it be for now
[13:05] sustrik we have to sort this out with pieterh
[13:05] djc that's wrong, it only fails if something is connected to this->side, but not sending requests
[13:05] sustrik probably a problem with backporting
[13:06] djc it looks like it somehow tries to process a connection event on the REP socket as a message
[13:12] djc ah, more info: it triggers when I try to connect using 2.0.9, but works correctly when trying to connect using 2.1.6
[13:15] sustrik yeah, the relevant code was added to 2.0.6
[13:15] sustrik 2.1.6 i mean
[13:16] djc you mean, the assert?
[13:16] sustrik yes
[13:16] sustrik checking for validity of zmq_msg_t objects
[13:19] djc so is the check too tight or should that function not be getting that message?
[13:21] sustrik several possibilities
[13:34] sustrik djc: still there?
[13:34] djc yup
[13:34] sustrik can you try a patch for me?
[13:34] djc sure
[13:34] sustrik open src/decoder.cpp
[13:34] sustrik line 112
[13:34] sustrik replace it by this:
[13:34] sustrik in_progress.flags = tmpbuf [0] | ~ZMQ_MSG_MASK;
[13:35] sustrik and try whether it helps
[13:36] djc okay, give me a moment
[13:38] pieterh sustrik: re, was out
[13:38] sustrik see the above fix
[13:38] sustrik it should help
[13:38] sustrik doesn't apply to 3.0 though
[13:39] sustrik the code there have already changed
[13:39] djc sustrik: seems to work just fine
[13:39] sustrik ok
[13:39] djc at least I can't reproduce as before
[13:39] sustrik pieterh: can you apply the fix?
[13:39] pieterh sustrik: should I try that patch wrt blast too?
[13:39] pieterh :) ok
[13:39] sustrik blast is ok
[13:39] sustrik ah
[13:39] sustrik i did
[13:40] sustrik it fails elsewhere
[13:41] pieterh sustrik: it's going to fail in lots of places before it's fully robust :)
[13:41] pieterh Assertion failed: msg_->flags & ZMQ_MSG_MORE (rep.cpp:81)
[13:41] pieterh I've created a branch, we can fix these one by one
[13:48] sustrik yup, same in 3.0
[13:48] sustrik not that easy to fix though
[13:53] pieterh do you think we can fix these in 2.1?
[13:54] pieterh it's mostly to be on malformed messages, of course
[13:54] pieterh actually I need to make a test case with all socket types
[14:00] sustrik malformed frames can be solved easily
[14:00] sustrik malformed messages as such is a problem
[14:01] sustrik as the I/O thread currently has no idea of what socket type it is handling
[14:01] sustrik and thus can't check the validity of socket-type-specific protocol
[14:03] pieterh indeed
[14:03] pieterh however it shouldn't assert when it gets confused...
[14:04] pieterh So I've replaced line 81 in rep.cpp: zmq_assert (msg_->flags & ZMQ_MSG_MORE);
[14:04] pieterh with
[14:04] pieterh if (!msg_->flags & ZMQ_MSG_MORE) {
[14:04] pieterh errno = EFAULT;
[14:04] pieterh return -1;
[14:04] pieterh }
[14:04] pieterh and it no longer asserts
[14:04] pieterh at least on a rep socket test
[14:05] pieterh eventually runs out of resources: Assertion failed: new_sndbuf > old_sndbuf (mailbox.cpp:183)
[14:21] sustrik hm, that's hiding on a problem
[14:21] sustrik of*
[14:21] pieterh you mean returning -1?
[14:21] sustrik yes
[14:21] sustrik it won't assert but it'll leave state machine in an inconsistent state
[14:22] pieterh hmm, let me check if that's the case...
[14:24] pieterh it appears the FSM remains in a consistent state
[14:25] sustrik there's a lot of state there
[14:25] sustrik the current pipe
[14:26] sustrik the "more" flag
[14:26] pieterh what is weird is that it no longer asserts but I don't get an error rc either
[14:26] sustrik and so on
[14:26] sustrik strange
[14:26] pieterh could be my test is wrong, let me add some brackets
[14:27] pieterh ah, better
[14:29] pieterh Ok, so I'm getting the error return properly
[14:29] pieterh it does not get into FSM confusion
[14:29] pieterh but I'd need a more sophisticated test case to know if it recovers properly
[14:30] pieterh also what it should do is close the socket when it gets garbage, not continue reading off it
[14:30] sustrik that's the point
[14:35] Guthur <pieterh> could be my test is wrong, let me add some brackets ... Does this work for all testing
[14:36] pieterh Guthur: the code had to be "if (!(msg_->flags & ZMQ_MSG_MORE)) {"
[14:36] Guthur though as a lisper I would always say there is never enough parenthesis
[14:37] Guthur ...and that's not even Lisp, hehe
[14:44] djc so, if I have a connected REQ socket, and the REP server restarts
[14:44] djc it looks like the REQ will reconnect and send new messages fine
[14:45] djc but the response from the REP doesn't make it to the client
[14:45] djc does that make sense?
[14:50] sustrik not really
[14:50] sustrik do you have a test case?
[14:51] djc I saw it a few times, doesn't seem to be very reproducible
[14:51] sustrik what can happen is that the server is killed while it processes the request
[14:52] sustrik in such case the request is lost
[14:52] djc no, that's not the case here
[14:54] sustrik the strange thing about it is that from server's PoV, the reconnect is simply a new client
[14:54] sustrik so it should behave in standard way
[14:54] djc yeah
[14:58] pieterh djc: are you using explicit identities?
[14:58] djc pieterh: nope
[14:58] pieterh ok
[14:59] pieterh djc: is it possible you're not checking the return code from zmq_recv or zmq_send?
[14:59] djc I was checking it
[15:00] djc but actually it seemed like the zmq_send() on the server was blocking
[15:01] pieterh djc: I guess the point is without a concrete test case it's practically impossible to investigate
[15:02] djc yeah
[15:22] djc will that assert fix (with the msg decoder) appear in a release soon?
[15:22] djc or would it be valuable to put the patch in our distribution
[15:23] sustrik that's up to pieterh
[15:24] pieterh djc: that fix will go into 2.1.7, yes
[15:24] pieterh sustrik: do we have an issue number for that decoder fix?
[15:24] djc pieterh: do you have an intended ETA for that? sorry to be nagging all the time, I'm trying to use zmq-2.1 in our production setup but if there's issues it's kind of scary
[15:25] sustrik nope
[15:25] sustrik it's bug on 2-1
[15:25] pieterh sustrik: no open issues for it?
[15:25] djc btw, I just tried to start working with ZMQ_FD, which is what I was actually upgrading for, but so far I haven't gotten it to work
[15:25] sustrik it doesn't exist in libzmq
[15:25] sustrik so, i guess, the issue should be in zeromq2-1 repo
[15:26] sustrik djc: what's the problem?
[15:26] djc poll doesn't seem to fire
[15:26] pieterh sustrik: well, the libzmq repo has all the issues ever reported...
[15:27] sustrik djc: ibeware! it's edge-triggered
[15:27] sustrik pieterh: well, why not use issue 209 then?
[15:27] pieterh I see issue 206
[15:28] djc sustrik: right, I read that in the docs
[15:28] djc does that mean it will only work with epoll, or something?
[15:28] sustrik yes, 206 seems to be the same thing
[15:28] pieterh sustrik: can you verify whether that 1-line fix actually solves 206?
[15:28] pieterh that does not look like futzing, but a real bug in zmq
[15:28] pieterh no test case, so I mean "verify" as in "give an opinion"
[15:29] sustrik nope, there's no test case in 206
[15:29] sustrik ah
[15:29] sustrik i gues it's related
[15:29] pieterh OK, I'll push the fix to 2.1 master and people can test it from github
[15:29] sustrik djc: it means it fires only on state change
[15:30] sustrik have a look at epoll docs for explanation of edge-triggering
[15:31] djc sustrik: well, for me it doesn't even trigger the first time I send something to it
[15:32] sustrik presumably there was no state change
[15:32] joelr folks, what os specifies pathnames like this?
[15:32] sustrik what you should do is:
[15:32] joelr rc = zmq_bind(socket, "ipc:\/\@@//@@tmp/feeds/0"); assert (rc == 0);
[15:32] djc pieterh: I had a real issue with 2.0.9 client (REQ) connecting to 2.1.6 server (REP), which was fixed by sustrik's patch
[15:32] pieterh djc: you can help push the 2.1.7 release by testing the fix for 206, see my last comment on
[15:33] joelr this is for /* Assign the pathname "/tmp/feeds/0" */
[15:33] sustrik 1. check ZMQ_EVENTS
[15:33] pieterh djc: ... hmm, can you confirm that's the same as 206 (we assume it is)
[15:33] sustrik 2. if POLLIN is set read the message
[15:33] sustrik 3. if it's not, poll on ZMQ_FD
[15:33] pieterh joelr: don't copy/paste code from the Guide
[15:33] pieterh it'll come out weird
[15:33] sustrik 4. when ZMQ_FD triggers goto 1
[15:33] pieterh please grab the github sources and use that
[15:34] joelr pieterh: that's what's in the guide, take a look
[15:34] djc sustrik: can I start at 3 if I assure no one is connected to it at start time?
[15:34] pieterh joelr: this is a bug in the formatting
[15:34] joelr pieterh: what's the correct format then?
[15:35] sustrik joelr: a valid path + filename
[15:35] sustrik djc: no
[15:35] pieterh I've patched it manually now
[15:35] pieterh ipc:// followed by path name, so ipc:///tmp/etc
[15:35] joelr pieterh: thanks
[15:36] pieterh joelr: I'll fix the API doc generation tool, this is a bug... sorry about that
[15:36] joelr np
[15:41] djc sustrik: still doesn't work
[15:41] sustrik what happens?
[15:41] djc there's no POLLIN on the ZMQ_EVENTS before the poll()
[15:42] djc and poll() doesn't fire on the ZMQ_FD after
[15:42] sustrik even though there's a message, right?
[15:42] djc yeah
[15:42] sustrik then it's a bug
[15:42] sustrik do you have a test case?
[15:43] djc I can reproduce this, it's not exactly minimal
[15:43] sustrik goodo
[15:43] sustrik open an issue for it then
[15:43] djc let me try something first
[15:44] djc nope, doesn't work, I'll file an issue
[15:57] djc sustrik: okay, I was being stupid, sorry about that
[15:58] djc ZMQ_FD works just fine
[15:58] sustrik great :)
[16:25] djc if you want to reuse a zmq_msg_t, do you have to init AND close every time?
[16:41] pieterh djc: yes
[17:58] CindyLinz hi ^^
[17:59] CindyLinz I can use libev with ZMQ_FD for PULL / SUB / DEALER / ROUTER correctly
[17:59] CindyLinz But it seems not work with REQ?
[17:59] CindyLinz the ZMQ version is 2.1.6
[18:14] sustrik are you using some kind of 0mq-libev glue project or are you doing it by hand?
[18:17] CindyLinz sustrik: I think.. it's by hand... Here is my code
[18:18] sustrik ok
[18:18] sustrik what's the problem you are seeing?
[18:18] CindyLinz the "!!" has never showed up
[18:20] sustrik what happened instead?
[18:20] CindyLinz the code of the other side:
[18:21] CindyLinz The other side will printed the request message from the REP
[18:21] CindyLinz and response some messages
[18:21] CindyLinz but the REP seems not notified
[18:21] pblasucci Hello
[18:21] sustrik sleeping in poll?
[18:22] sustrik hi
[18:22] CindyLinz But if i uncomment the code from 112 to 121, it will get the response
[18:22] pblasucci I'm not sure if this is the right place to mention this, but I've just put an F# binding to 0MQ up on GitHub
[18:22] CindyLinz no~ I can continue to send other requests.
[18:22] CindyLinz But nothing back
[18:23] sustrik pblasucci: great
[18:23] sustrik have you linked it from
[18:23] pblasucci I tried
[18:23] pblasucci but after logging in, it says I don't have permission
[18:24] sustrik CindyLinz: hard to say, try to create a minimal test case
[18:24] pblasucci perhaps I was on the wrong page
[18:24] sustrik have you joined the wiki?
[18:24] CindyLinz sustrik: wait~~
[18:24] sustrik you have to be a member to have edit rights
[18:25] pblasucci Ah, I think I must not be a member
[18:25] pblasucci I thought having a Wiki account was enough
[18:25] pblasucci I'll go join now
[18:31] pblasucci Apologies but I can't figure out how to become a member
[18:31] CindyLinz sustrik: This is the test case, thank you ^^
[18:36] sustrik pblasucci: there used to be a "join wiki" button at
[18:37] sustrik CindyLinz: i meant a minimal C program I can check
[18:38] sustrik presumably without libev as i have no idea of how that works
[18:40] pblasucci sustrik: hmmm... I don't see anything like that
[18:40] sustrik it must have changed
[18:40] sustrik let me invite you
[18:40] sustrik what's your name at wikidot?
[18:40] pblasucci Great! Thanks.
[18:40] pblasucci pblasucci
[18:41] sustrik ok, wait a sec
[18:41] CindyLinz sustrik: hmm.. how can i just remove the libev part, while the problem is not-notified libev?
[18:41] sustrik dunno, the problem is either in 0mq or in libev
[18:41] sustrik you should find out which one is it
[18:42] sustrik if it's 0mq we'll fix it
[18:42] sustrik if it's libev you should speak to libev devs
[18:42] CindyLinz Is it help, if I also give a working dealer version?
[18:43] CindyLinz (help or helpful.. my English is bad :( )
[18:43] sustrik np
[18:43] CindyLinz hmm..
[18:45] sustrik pblasucci: you are invited
[18:45] pblasucci Thanks!
[18:46] pblasucci Yup... looks like I'm all set.
[18:46] sustrik goodo
[18:46] sustrik CindyLinz: you can remove libev from the test case by doing what libev does by hand
[18:46] sustrik i assume its polling on the fd
[18:47] CindyLinz sustrik: i'll try.. can i use select(2) ?
[18:48] sustrik well, you should have a look what libev does
[18:49] sustrik if you do something else the problem may not show up
[18:50] CindyLinz Then i try first if the select(2) version work?
[18:50] sustrik it's up to you
[18:51] sustrik what i need is simplest possible program that shows the problem
[19:24] jond sustrik: the subports idea as originally suggested can still work over multicast....
[19:31] sustrik jond: how so
[19:31] sustrik ?
[19:32] sustrik jond: btw, have you done any perf tests with the alignment patch?
[19:33] jond well in some other software I'm aware of we express multicast addresses like x.x.x.x:5555@1,2,3
[19:34] jond and 1,2,3 are the subports.
[19:35] jond re: alignment, been busy
[19:36] jond what did you suggest as the test case throughput/latency?
[19:36] sustrik ah, i see, presumably filtering on the receiver
[19:36] sustrik latency is more relevant
[19:36] sustrik the point is that i've tried it at home
[19:36] sustrik and got no different between patched/unpatched versions
[19:36] jond re: multicast, yes the subport goes in a header, and 0 is reserved for control
[19:37] sustrik ack
[19:38] jond did you try both mpales and my patches?? I have vtune @ work which might shed some light, if I knew exactly how to interpret the results
[19:39] jond are you surprised or can explain why it makes no difference; it looks like it should?
[19:40] sustrik i've tried just your patch
[19:40] sustrik it's better, so i've tested just that one
[19:40] sustrik and yes, i am surprised
[19:41] sustrik but maybe the problem is that both threads were scheduled on the same CPU core
[19:41] sustrik so that cacheline polution became irrelevant
[19:44] jond well I did notice that the mpales patch didnt put an alignment on the struct because it used padding rather than gcc attributes. The sutter article on false sharing suggests we should be able to see something but I do not know what tools he was using
[19:44] jond I didnt alter the yqueue so maybe that needs doing as well
[19:52] sustrik yeah, yqueue may benefit from that kind of thing as well
[19:52] sustrik we probably need a more controled test
[19:52] sustrik using ypipe to pass data between two threads bound to different CPU cores (or even CPUs)
[19:56] jond yes. I also noticed someone suggessting replacing socketpair with ypipe+pipe.
[21:22] CIA-75 libzmq: 03Martin Sustrik 07master * r5e329ba 10/ src/msg.cpp : Minor patch to keep ICC compiler happy ...
[21:22] CIA-75 libzmq: 03Martin Sustrik 07master * rceb5e1a 10/ (src/msg.cpp src/msg.hpp): Deallocation functions in zmq.h and msg_t class are consistent. ...