Wednesday September 8, 2010

[Time] NameMessage
[08:12] CIA-20 zeromq2: 03Martin Sustrik 07master * r91ea204 10/ (13 files in 2 dirs): EINTR returned from the blocking functions -
[08:29] lestrrat sustrik: almost.... s/Daisuke Make/Daisuke Maki/ ;) But yes, thanks for the EINTR fix!
[11:58] keffo pieterh, Why I need a custom lb;
[11:58] keffo :)
[12:07] pieterh keffo: if I can figure out wtf I can do with an XREP talking to an XREP, I'll finish this part of the Guide today
[12:07] pieterh turns out least-recently used is indecently easy to do
[12:11] keffo That would ameliorate at least part of the problem :)
[12:11] keffo Except not 'successful send' in my case, but when the answer is returned :)
[12:12] keffo It wouldn't be very difficult for me to specify routing manually though, bypassing the zmq lb altogether
[12:12] keffo but I need to finish heartbeat system first
[12:13] pieterh well, here's my take on custom LB...
[12:13] keffo I'm thinking pub -> sub->xreq back to source, and that would give me both route & availability?
[12:13] pieterh you want to use a XREP socket to do the routing
[12:13] pieterh you can use various algorithms to drive it
[12:13] pieterh depending on the socket type it talks to
[12:14] keffo I'm thinking a priorityqueue of all known/active nodes, then 'nudging' the weights based on speed/performance
[12:14] pieterh e.g. if your workers use req sockets, you get a nice LRU
[12:15] pieterh you don't need a priority queue, just the message from the worker saying "ready"
[12:15] keffo hum, explain how req gives lru?
[12:15] pieterh well, think of HTTP long poll
[12:15] pieterh flow is worker says "ready", router provides it workload
[12:16] pieterh workers say "ready" when they're done...
[12:16] pieterh ready messages come into queue, router takes one off each time
[12:20] keffo uh, must draw on paper..
[12:21] pieterh hey, let me post this section of Ch 3, see if it makes any sense at all
[12:21] pieterh :-)
[12:21] keffo I think I need two prioqueues to be honest, one for picking the best of available set of nodes, and one for handing the most urgent tasks
[12:21] keffo let me know when you finish it!
[12:22] guido_g playing the "naming awareness" card: this isn't REQuest/REPly anymore then, right?
[12:24] pieterh nope, this ain't kansas anymore
[12:24] pieterh this is "off-road"...
[12:27] pieterh basically it's about using identities and constructing envelopes so that XREP will route as we want to other nodes
[12:27] pieterh So it starts here:
[12:27] pieterh I just posted what I have...
[12:33] keffo looks good
[12:35] keffo except the pub/sub ctrl+c sample code needs linebreaks
[12:37] keffo hum, it gets progressively difficult to read after that :)
[12:37] pieterh what sample is that?
[12:38] pieterh :-)
[12:38] keffo something got weird halfway down the article, it's still embedded in the code preview
[12:38] pieterh hang on, refresh the page
[12:38] pieterh there was a fault in the text, i already fixed it
[12:39] keffo ignore, reloaded :)
[12:39] keffo lots of good food for thought there
[12:39] guido_g ack
[12:39] guido_g fig 22, the dashes are misplaced
[12:40] pieterh guido_g: this is ditaa being too clever
[12:40] pieterh it thinks '-' in text is me trying to draw a line
[12:40] guido_g ouch
[12:40] pieterh ill try a unicode n-dash or something
[12:40] guido_g so no dash between single and hop and a slash for req/rep :)
[12:41] pieterh it's always "request-reply"...
[12:42] guido_g ahh... then...
[12:42] pieterh np, easy to fix this in the parser
[12:45] pieterh bah, ditaa does not like Unicode at all, just dies
[12:45] pieterh well, too bad
[12:55] guido_g no escape mechanism?
[12:57] pieterh no mechanisms of any kind afaics
[12:57] guido_g oh
[12:57] pieterh which is ok, I'll email the author
[13:46] CIA-20 zeromq2: 03Martin Sustrik 07master * r47e87b7 10/ include/zmq.h : EMTHREAD error code returned to zmq.h to ensure backward compatibility -
[14:07] sd88g93 greetings !
[14:08] sd88g93 when are the zmq_msg_move() zmq_msg_copy() functions meant to be used ? i mean, what situations call for them ?
[14:27] cremes sd88g93: whenever you *send* data, the library takes ownership of your zmq_msg_t and calls close() on it
[14:27] cremes on the recv side, you are responsible for its lifecycle
[14:28] cremes so you might want to make a copy (which i think does reference counting internally) so that you could send it without it being deallocated
[14:28] cremes the copy is cheap (ref counting)
[14:28] cremes i don't really know of a use case for move
[14:29] sd88g93 cremes: what about VSM's, where the message is embedded right in the msg struct ?
[14:29] cremes i don't know
[14:29] sd88g93 ohok
[14:29] cremes i'd have to read the code to figure out what happens
[14:30] sd88g93 well, what about if you are recieving messages of a multi part message and relaying them through inproc to another thread ?
[14:30] sd88g93 is it alright to just recieve and send the same msg ?
[14:30] cremes yes
[14:30] cremes recv does not automatically call close on the message, only sending does
[14:30] sd88g93 i'm recieving a multipart message by way of a req/rep socket, and want to relay them to ap ublisher
[14:31] sd88g93 but i'm relaying them first through inproc to the main thread, and then forwarding them to the one pub skt
[14:31] cremes oh, so two hops?
[14:31] sd88g93 the inproc is failing, and just trying to figure out why
[14:31] sd88g93 yeah, 2 hops
[14:32] cremes well, i'm not sure... i haven't played much with inproc transports
[14:32] cremes as an optimization it might be doing ref counting on the message so that it doesn't have to copy it for real
[14:32] sd88g93 actually, it would be 3 hops, once to the thread, then once to the main thread, and then the forward device forwards them to the final socket for a send
[14:33] cremes only count sockets as hops
[14:33] sd88g93 yeah, so 3 hops
[14:34] cremes why don't you try using copy on the message received that you are sending to the inproc socket and see if that helps (shooting in the dark here)
[14:34] sd88g93 yeah, i was trying move
[14:34] sd88g93 but maybe copy would be better
[14:34] cremes so, XREQ(tcp) -> XREP(inproc) -> PUB(tcp)
[14:34] sd88g93 i dont need to use it after relaying it
[14:34] cremes copy at the XREQ step
[14:35] sd88g93 actually, i was using PUB for the inproc
[14:35] cremes the main thing to recall is that whenever you send through a socket, it calls close on the message you just sent
[14:35] cremes you relinquish control of the message to the library whenever you send anything
[14:36] cremes so it sounds to me like when you send to the inproc socket, the message may be getting closed on the send side
[14:36] cremes this is all kind of confusing :)
[14:36] cremes i recommend you ping mato or sustrik
[14:37] sd88g93 in the worker thread, it is a PUB skt, and then in the main thread i have a sub skt, so i publish from the thread via inproc, and then it gets fwd'd to the main pub skt
[14:37] sd88g93 oh ok
[14:37] cremes are you using a forwarder device or did you roll your own?
[14:38] sd88g93 fwd device
[14:38] cremes are you using release 2.0.9?
[14:38] sd88g93 yeah, i just updated to 2.0.9
[14:38] cremes ok... i think there was a bug in the forwarder for multi-part message handling that was fixed in 2.0.9
[14:39] cremes that could have been your problem, but if you updated then maybe not
[14:39] sd88g93 at first i had my own queue to queue the messages for a publisher thread, that worked, except i'm worried about when i stress test it, if pasing messages allocated on the heap, is a problem when i free them
[14:39] sd88g93 yeah, i noticed that in the change log when i updated
[14:39] cremes i'd have to see code to be sure but it doesn't sound like a problem
[14:40] cremes sorry i couldn't be of more help... if you figure it out please post a solution
[14:40] sd88g93 sure, no proble, i realize its still a young project, one of the ramifications
[14:41] sd88g93 over all, ive noticed the documentation for zeromq getting much better and more concise as time goes on
[14:42] sd88g93 i'll try using message copy
[14:44] cremes great; you might also consider posting your question (with more detail) to the ML
[14:52] pieterh sd88g93: in fact I don't know what cases need zmq_msg_move and zmq_msg_copy
[14:53] sd88g93 ok, thanks, pieterh
[14:53] pieterh perhaps if you want to make abstraction layers that want to grab a copy of a message, work with it, and allow the caller to continue and possibly deallocate it
[14:53] pieterh you certainly do not need them for normal work afaics
[14:53] cremes pieterh: if you want to send the same data to multiple sockets, you need to copy it
[14:54] sd88g93 yeah, the move/copy didnt work for my situtaion
[14:54] cremes otherwise your first send will deallocate it out from under you, right?
[14:55] mato cremes: send never deallocates anything out from under you. zmq_msg_t is reference counted.
[14:55] cremes mato: if refcount is 1 and you send the message, it gets deallocated, right?
[14:55] sd88g93 oh, so a call to send increments the ref counter
[14:55] cremes i would think it decrements the counter
[14:56] pieterh ... this is all invisible... you send as often as you like and don't even know there's a reference counter
[14:56] sd88g93 i mean, incrememnt at first, nd then decrement when its done wiht it
[14:56] cremes recv, refcount + 1, send, refcount - 1
[14:56] cremes when 0, dealloc
[14:56] mato cremes: zmq_send() does not touch the refcount
[14:56] mato cremes: you should always do zmq_msg_close () if you want the message to go away
[14:56] cremes doesn't it call zmq_msg_close?
[14:57] mato not unless something changed; let me check 100%
[14:57] cremes ok
[14:57] cremes i thought i saw that when i reviewed the code a few weeks ago
[14:57] mato pieterh: no way
[14:58] mato what all the perf tests and examples do is send, then immediately close
[14:58] mato unless i'm confused, check with sustrik to get it from the horses mouth
[14:58] pieterh yes, that's the calling code
[14:58] mato but this is what i've always done in my code
[14:58] pieterh what does the send method do internally to keep the message alive
[14:59] pieterh while the caller does zmq_msg_close?
[14:59] mato ah, sorry
[14:59] mato yeah, send increases the refcount by 1 obviously
[14:59] mato but you get a refcount of 1 already when you create the message
[14:59] pieterh using copy, I assume... ?
[14:59] mato *refcount*
[14:59] mato what copy?
[14:59] sd88g93 oh, so it increases by one, hten when its done sending , it decreases the ref count
[15:00] pieterh so directly modifying message property?
[15:00] mato some lock-free magic with atomic_counter AFAIK
[15:00] mato yes
[15:00] mato all the docs say "zmq_msg_t is opaque", never touch it outside of the zmq_msg_* functions
[15:00] mato for precisely this reason
[15:00] sd88g93 what about with VSM's ?
[15:00] mato sd88g93: it's up to 0MQ if it *actually* copies the *content*
[15:01] pieterh mato: this is what zmq_msg_copy does, increment refcount
[15:01] sd88g93 so, if you have a VSM, and you send it, and then close it, if you re-init the struct, you wipe out what send would have ?
[15:02] mato pieterh: i'm not sure what it actually does internally, why do you need to know?
[15:02] sd88g93 as far as i can tell, the VSM is right inside the struct
[15:02] mato pieterh: anyhow, if you need to know, ask sustrik
[15:02] pieterh sure
[15:02] mato sd88g93: ask sustrik on the mailing list, i'm not an expert on what happens internally
[15:02] pieterh the discussion here is "when do we use zmq_msg_copy/move"
[15:02] mato i just know that it works :)
[15:02] pieterh indeed...
[15:03] pieterh i'm just saying, applications do not normally mess with the refcounts, ever
[15:03] sd88g93 mato, i'm just having a problem with relaying a message i get from a REP socket over PUB skt using inproc
[15:03] pieterh that's internal, and i assume that send() calls copy() to grab a reference, and close() when it's done, to release it
[15:03] mato well, unless you count calling zmq_msg_close() as "messing with the refcount"...
[15:03] pieterh well, sure, but that's not the semantic
[15:03] pieterh the semantic is "I'm finished with it", which translates to "decrement refcount"
[15:04] mato yup
[15:04] pieterh sg88g93: the problem is elsewhere
[15:04] mato sd88g93: don't forget zmq_recv() destroys existing message content, if any
[15:05] pieterh sd88g93, do you have a minimal example?
[15:05] sd88g93 i have some code, but so far, i have been unablet to dupliate it in a minimal example,
[15:05] mato sd88g93: so e.g. if you've got a loop, doing zmq_msg_init() then zmq_recv(), what could be happening is you recv, then send that message, but then recv back into the same message so you could lose messages
[15:05] sd88g93 i posted to the ML one example that i thought i duplicated it , but it didnt
[15:06] pieterh sd88g93, see the request-reply broker example:
[15:06] pieterh it copies messages from one socket to another, nothing bizarre
[15:06] pieterh if you're losing messages, perhaps the sockets aren't connected properly
[15:07] mato sd88g93/pieterh: AFAICT zmq_msg_copy()/zmq_msg_move() are for the case where the app for some reason re-uses message "objects" (i.e. the zmq_msg_t) with a different lifecycle than the message *content*
[15:07] pieterh right
[15:08] mato sd88g93: what language are you using?
[15:08] sd88g93 the c bindings
[15:08] pieterh sd88g93, what is the problem you are actually seeing?
[15:09] sd88g93 here's the example i posted on the mailing list:
[15:09] sd88g93 that's a minimal case, that one works, but it mirrors what i'm doing in the larger project
[15:10] pieterh so publisher is not sending anything out
[15:10] sd88g93 for some reason in the larger project the messages in the main thread arent being recieved via inproc by the subscr socket
[15:10] sd88g93 yeah, the publisher isnt sending
[15:10] sd88g93 ive isolated it to over the inproc that its failing
[15:10] pieterh try changing the transport to ipc
[15:10] sd88g93 when i hcange it to one thread , and publish directly to the bound extreneral interface, it sends fine
[15:11] pieterh so it's a problem with thread coordination, connections not happening in time
[15:11] sd88g93 what's the difference between inproc and ipc ?
[15:12] pieterh well, inproc requires that you bind first, connect second
[15:12] pieterh ipc like tcp lets you connect first, then bind
[15:12] sd88g93 the connections seem to be happending, ive logged it and they seem to happen, the bound channels are made before the connect happens
[15:12] pieterh so with inproc if the subscribers connect before the publisher binds, it won't work
[15:12] pieterh IMO connect will even return an error
[15:13] sd88g93 no, it binds first, and then connects
[15:13] sd88g93 that shouldnt be a problem, i think you are correct in that it returns an error, ive recieved that error when that happens
[15:13] mato sd88g93: this is what i meant with my comment about recv'ing into the same message:
[15:13] pieterh yeah, this is your problem I think
[15:13] mato sd88g93: see the difference between the correct/incorrent case
[15:13] mato sd88g93: just in case that's what you're doing...
[15:14] pieterh you're starting the publishers in child threads
[15:14] pieterh but immediately you connect the subscribers to those ports
[15:14] pieterh won't work
[15:14] pieterh magic solution: do two loops, one to launch pub threads, second to prepare subscribers
[15:15] pieterh do sleep(1) in between two loops
[15:15] pieterh you owe me a beer if it works
[15:16] pieterh here is the text from the user guide:
[15:16] pieterh The {{inproc}} transport has a specific limitation compared to {{ipc}} and {{tcp}}: **you must do bind before connect**. This is something future versions of 0MQ may fix, but for today it has some impact on the way you use inproc sockets. In the example here, we carefully bind to each endpoint before connecting to it. If you connect first, and then bind, the connect will return an error, and if you ignore that error, the recv will block.
[15:17] sd88g93 mato: i dont think that' sthe case, because i re init the message at the beginning of the loop
[15:18] sd88g93 pieterh: it seeems to be binding first before connect
[15:18] pieterh how do you know?
[15:19] pieterh your example does the bind after the connect
[15:19] sd88g93 because i print it out
[15:19] pieterh means nothing
[15:19] sd88g93 oh
[15:19] pieterh printing from multiple threads can do weird stuff
[15:19] pieterh follow the printf with fflush(stdout) to be sure
[15:19] pieterh sd88g93, next time you have an error, if 0MQ reports an error, that's kind of the first thing you look at
[15:20] pieterh "I'm using inproc and my connect fails with an error... why?"
[15:20] sd88g93 but the connect doesnt fail ,
[15:20] keffo is the log pub in 2.0.8?
[15:21] pieterh keffo__, nope
[15:21] pieterh sd88g93, so what fails?
[15:21] pieterh and could you try the 'magic fix' i suggested, it's a quick test
[15:21] sd88g93 the program doesnt throw any errors, it just doesnt publish any of the messages
[15:22] sd88g93 yes, i 'll tryt that test
[15:22] pieterh " i think you are correct in that it returns an error, ive recieved that error when that happens"
[15:22] pieterh ?
[15:22] sd88g93 sorry, i'm just reading over what everyone has written, its alot to digest
[15:22] pieterh NP :-)
[15:22] sd88g93 i'll try that fix now, thanks,
[15:22] pieterh did you try it with ipc:?
[15:23] pieterh if that works, and inproc does not, it's most likely the bind/connect order
[15:37] sd88g93 no, still nothing
[15:37] sd88g93 i'll try it with ipc , just to be sure
[15:37] pieterh ack
[15:39] pieterh there is some questionable stuff in your example BTW
[15:39] pieterh like threads modifying shared static variables
[15:40] sd88g93 yeah, i dont do that n the main program
[15:41] sd88g93 the shared static variable is just to get a number to form the pipe number back to the main program, in the actual example i pass it in through the parameter when creating the thread
[15:42] pieterh hang on, do you actually have any subscribers connecting to
[15:42] sd88g93 yes
[15:43] pieterh well, what I would do is just break this up into little pieces
[15:43] sd88g93 that example does seem to work, but the same scenario, doesnt play out in the real project
[15:43] pieterh ah, it works in the example?
[15:43] sd88g93 yeah, just cant duplicate it , that example was just to give you an idea of what i am doing
[15:44] pieterh can't debug an idea... :-/
[15:44] sd88g93 sorry, didnt mean to confuse
[15:44] sd88g93 yeah, trying to isolate where its coming from
[15:44] pieterh so you can write your own forwarding code
[15:44] pieterh it is trivial
[15:45] pieterh then you can debug that quite easily
[15:45] sd88g93 but when i try a loop in the main program to read what's coming from the subscriber socket via inproc, nothing comes from it
[15:45] pieterh your problem maybe is that by starting an in-built device you have a black box you can't inspect
[15:45] pieterh so it's not the publisher not sending, it's the subscriber not receiving?
[15:46] sd88g93 the publisher isnt making it to the outer publisher socket, nothings coming out
[15:46] sd88g93 there's not errors from the zmq_send function,
[15:47] sd88g93 but when i monitor the socket with tshark, there's nothing coming out
[15:47] pieterh sd88g93, sorry, I can't really help you any more until/unless you can get a minimal test case that reproduces the problem
[15:47] sd88g93 yeah, that's the problem
[16:33] cremes sd88g93: i recommend instrumenting the forwarder device; just have it print what it receives and what it sends out again
[16:33] cremes if you are using multi-part messages with the topic in the first part, then perhaps it isn't getting forwarded correctly and
[16:33] cremes your subscribers are dropping the messages because the topic doesn't match
[16:43] erickt are any of the pyzmq devs around today?
[17:20] ModusPwnens hmm cremes, yesterday you said that if the performance tests were working correctly, remote_thr should be the first one to exist
[17:22] ModusPwnens exit*
[17:22] cremes right; the publisher should be done before the subscriber
[17:22] cremes but that doesn't always happen on my machine so i'm a tad confused
[17:23] cremes unfortunately, i don't have time to look at it right now... maybe someone else can take a peek
[17:23] ModusPwnens Yeah, what I am seeing is the remote continuing to run
[17:23] ModusPwnens until it times out or something
[17:24] cremes i recommend you add some print statements to the tests and run them with low iterations; perhaps something obvious will pop out
[17:29] ModusPwnens ok, will do
[18:53] mikko hi
[18:55] mikko what happened with zmq_close semantics?
[19:21] mato mikko: the infrastructure work is all on 'master'
[19:21] mato mikko: but i've not yet had time to test it properly due to "real" work :-(
[20:45] bennymack hey everybody! So, i'm looking into 0MQ a bit and so I was strace() a simple example in perl and I noticed a lot of these 24byte send/recv calls on the pipes between the main thread and the 0MQ context thread
[20:45] bennymack is there something I can read about all that chatter that's going on?
[20:46] bennymack is it the SPB?
[21:09] sd88g93 bennymack: you might want to read this, if you havent already:
[21:09] sd88g93 alhtough i dont think it gets into the low level stuff that youre talking about
[21:12] sd88g93 cremes: ive managed to replicate the error i get here:
[21:12] mato bennymack: what you see is signalling between the application thread and zmq I/O thread, see zmq::signaler_t and read the source :-)
[21:12] sd88g93 we were talking about it this morning, (or for you , i guess that would be afternoon)
[21:13] sd88g93 oh, mato, you were here this morning, i managed to replicate the error i was having in this program :
[21:14] mato sd88g93: hmm, that's long, do you have a 5-second summary?
[21:14] sd88g93 ok,
[21:14] sd88g93 the top program is fields requests from a client skt, and feeds them to different threads
[21:15] sd88g93 it relays the multipart messages to a pub skt
[21:15] sd88g93 (the second program is a client app to send it messages )
[21:15] sd88g93 the problem is in the top program, it recives the messages fine , but doesnt relay the messages to the central publisher socket in the main thread
[21:16] sd88g93 they should be easy to compile, just gcc -otestpub testpub.c -lzmq for the top program
[21:16] mato sd88g93: so did you find your problem?
[21:16] sd88g93 i replicated the same problem in that code
[21:17] sd88g93 its in the loop in the top program, (in the thread)
[21:17] mato sorry, it's kind of late here ...
[21:17] sd88g93 oh ok no problem
[21:17] bennymack ok thanks! I'll take a look
[21:17] sd88g93 I can post to the mailing list
[21:17] mato that might be best
[21:18] sd88g93 yeah, still 5p.m. here
[21:18] jond sd88g93: is this the same problem as the code on the mail list. that needed a sleep after the pthread_create?
[21:18] sd88g93 it might be, not sure which one youre thinking of
[21:18] mato jond: hi Jon... am in the middle of replying to your mail re Git
[21:19] mato jond: is your patmatch branch a private topic branch? i.e. noone pulling from it?
[21:19] jond mato: hi , thanks, also what does 'Your branch is behind 'origin/master' by 86 commits, and can be fast-forwarded.' mean?
[21:20] jond mato: yes, patmatch private topic branch which I intend to run format patch from and all be hunky dory
[21:20] mato jond: if it's your local 'master' branch that means you can pull from it and you're 86 commits behind...
[21:21] mato jond: the "can fast forward" stuff basically means you've not made any commits to your local master branch that would require a merge
[21:21] jond mato: yep local master; It;s been behind every since I did a hard reset that chuck posted, but i'm not clear how to get rid of it
[21:21] mato jond: git pull
[21:22] jond mato: this is after a git pull
[21:23] jond mato: git pull says this now error: Untracked working tree file 'src/connect_session.cpp' would be overwritten by merge.
[21:23] mato jond: oh, right, you have changes you've not committed on local master
[21:23] mato jond: what do you want to do with those?
[21:24] mato jond: git status will show you what git thinks about your working tree state
[21:24] jond mato: there shouldnt be any, this all started when I pulled the 2.1 tree then reset to 2.0.8. I havent created those files or edited?
[21:24] mato jond: what does git status say?
[21:25] mato jond: alternatively, do you have any changes you care about in that repository?
[21:25] jond mato: git status gives me the 86 commits behind and some untracked files ; do i need to remove those?
[21:27] jond mato: thats fixed master
[21:27] mato ?
[21:27] jond i removed the 'overwritten' files
[21:27] mato you probably wanted git checkout -- <path>
[21:28] mato does git status now say "working directory clean" ?
[21:29] jond mato: yes, after I removed some other note files I had left there
[21:29] mato jond: goodo, then you should be able to pull, and follow the instructions in my email
[21:30] mato jond: but if you want to be sure you don't mess up just take a backup copy of your local repo
[21:30] mato jond: (cp -r zeromq2 zeromq2.bak or something)
[21:30] jond mato: will do , many thanks
[21:30] mato ok, i'm off... cyl
[21:31] jond mato: cyl
[21:31] sd88g93 good night , mato
[21:31] mato nite, thx
[21:31] sd88g93 ok, just posted to the ML