Tuesday September 14, 2010

[Time] NameMessage
[08:45] pieterh sustrik: ping?
[08:45] sustrik pieterh: pong
[08:45] pieterh is there any zero copy mechanism on reading messages?
[08:45] pieterh i.e. directly into application buffers
[08:46] sustrik no, you get the buffer from 0mq
[08:46] sustrik you can then store it as long as you wish
[08:47] pieterh ok, thx
[09:04] ptrb so you can `zero-copy' the buffer data beyond the lifetime of the zmq_msg_t?
[09:10] sustrik ptrb: no, but you can keep zmq_msg_t around as long as you wish
[09:10] ptrb roger.
[09:18] pieterh ptrb: I've added a problem solver to the guide
[09:18] pieterh
[09:20] ptrb cool
[09:21] ptrb the "Yes" path on "Are you using XREP sockets?" isn't labeled :D
[09:21] pieterh thanks...!
[09:21] ptrb but that is definitely helpful
[09:21] ptrb i'm a little overwhelmed by the sheer number of different "guides" and "manuals" and whatnot on the docs page(s), though
[09:23] pieterh ptrb: yeah, standard process, lots of growth and then consolidation
[09:23] pieterh stick with for starters
[09:24] pieterh really there should only be two books, the guide and the reference manual
[09:24] ptrb sure
[09:25] ptrb there's a solid case for a quick-start (~1h) tutorial as well, I think; maybe that's like ch.1 of the guide
[09:25] pieterh yup
[09:25] pieterh plus lots of tutorials and articles from the rest of the community
[09:41] keffo pieterh, I have a concurrency issue that I cant really figure out!
[09:41] pieterh keffo: shoot
[09:41] pieterh have you checked the new problem solver?
[09:41] keffo lemme check
[09:42] keffo pieterh, Think client, double lru-queue(for outoing tasks, and for incoming results), then an additional 'regular' queue on each machine, which provides N worker processes..
[09:43] keffo pieterh, Theyre all sep. processes, no threading involved anywhere apart from zmq internals..
[09:43] pieterh ok
[09:43] pieterh what is the symptom you are seeing?
[09:44] keffo One worker, all is fine, I've done 100000 tasks back'n'forth with no problem, so the handling of data is sound.. Now if I do more than one worker processes, -sometimes- I get jumbled data, so it's clearly a concurrency issue, but I cant really tell how or where
[09:45] pieterh jumbled data = one message has corrupted data or messages are not in order?
[09:45] keffo I guess both? =)
[09:45] keffo lemme trigger it and check the logs
[09:45] pieterh ok
[09:46] keffo there we go
[09:47] keffo so this scenario has multiple worker processes, but I send out batches or bursts of tasks of different granularity.. With 1 it always works, that is, send one task, wait for that task to complete.. The problem pops up when doing more than one, say 10
[09:48] keffo which is the case now
[09:48] pieterh keffo: you still need to explain what the "problem" actually is
[09:48] pieterh what is the difference between working and non-working in terms of what you see in the logs?
[09:49] keffo well, sometimes routing info is jumbled.. I expect uuid + nil + payload, but sometimes that uuid can get very weird
[09:50] keffo and likewise the payload itself can get messed up
[09:50] pieterh ok, so your message envelopes are wrong
[09:50] keffo They would be wrong with just one message too :)
[09:50] keffo or one worker proc.
[09:50] pieterh most probably an error in your code that creates/copies envelopers
[09:50] pieterh *envelopes
[09:51] pieterh the LRU queue stuff is pretty delicate
[09:53] keffo I dont see the relevance to be honest, I can process 1M tasks properly if I wait for each message(or if I have only one worker processing them).. If there was an inherent problem in the envelope construction/usage, surely that would pop up there as well, but it has -never- happened in that scenario..
[09:54] pieterh keffo: when problem solving you don't care when it works
[09:54] pieterh if you can reproduce a problem, that's the interesting thing
[09:55] pieterh things often work by accident...
[09:55] pieterh so your workers are REQ and you're talking to them with an XREP?
[09:55] keffo yeah
[09:56] pieterh do you store the worker identity anywhere?
[09:56] keffo I can reproduce it, but with irregularities one has come to expect from multithreaded :)
[09:56] keffo yeah, all connections are known
[09:57] pieterh so the obvious thing to look at would be how you reconstruct the envelopes and where the worker addresses come from
[09:57] pieterh and that you're not using the wrong data
[09:58] pieterh you presumably have logic that is vulnerable to multiple workers sending back stuff at the same time
[09:59] pieterh maybe you're not using SNDMORE properly in the worker when sending
[09:59] pieterh so messages arrive mixed together rather than atomically
[09:59] keffo right, I dont see that being possible either, as it properly routes across 3-4 queues?
[10:00] pieterh you have assertions on everything?
[10:00] keffo think so, yeah
[10:00] pieterh so every single frame is checked? e.g. RCVMORE where expected, empty frame where expected, etc.
[10:00] keffo yes
[10:01] pieterh the only think I can think of right now is that you can send what looks like a valid envelope but it won't survive concurrency
[10:02] keffo What would cause multiple incoming multipart-messages to break symmetry with more than one sender, in a singlethreaded process?
[10:02] keffo that's what I dont get :)
[10:02] pieterh well, if they are not actually multipart messages
[10:02] pieterh lets say you send the envelope (address, null, data)
[10:03] pieterh but you by mistake use normal sends on each part
[10:03] keffo yup
[10:03] pieterh it'll be 3 messages, not 1 3-part message
[10:03] pieterh if you read that, it might look valid and work for routing
[10:03] pieterh but as soon as you have concurrency it'll stop working
[10:03] pieterh like i say, i've never tried that, but will make a test now, after coffee
[10:04] pieterh i'd expect 0MQ to reject such messages...
[10:04] keffo hmm.. I'll go over the code again! :)
[10:04] pieterh but from REQ to XREP it can't IMO
[10:06] keffo cant reject?
[10:07] sustrik keffo: can you check whether the garbled identity you see is something that may actually your application message or whether it's just real junk
[10:07] sustrik ?
[10:09] keffo It's not real junk.. Sometimes for example I get route uuid that appears to be a mix of multiple uuid's
[10:09] keffo clearly something breaks symmetry somehow
[10:09] keffo I'll lok into the route logic again though
[10:11] sustrik "mix of multiple uuid's" = junk IMO
[10:12] sustrik ok, a sanity check:
[10:12] sustrik do you use each socket exclusively from a single thread?
[10:12] keffo yeah
[10:12] sustrik what language are you using?
[10:13] keffo c++ and lua
[10:13] keffo but homebrew bindings
[10:13] sustrik any chance to test a pure c++ setup?
[10:13] keffo hehe. noo :)
[10:14] sustrik hm, lua binding is a project i know nothing about
[10:15] sustrik the bug can be there, as no such problem was yet reported for any other language
[10:15] sustrik ok, any chance of using anything other than lua?
[10:19] keffo Would take forever to rewrite
[10:19] sustrik ok
[10:19] sustrik how does the setup look like?
[10:19] sustrik which part is c++ and which is lua?
[10:20] keffo all zmq stuff is c, rest is lua.. Ie, connect/bind/send/recv(and their more-counterparts)
[10:20] sustrik you are mixing C code and lua code to use the same 0MQ socket?
[10:21] keffo Um no, I created bindings for sockets, so it's the same thing
[10:21] sustrik hm, are you saying that 0MQ is written in C while your apps are in lua?
[10:22] keffo I've extended an application with lua yeah..
[10:22] keffo lua or not is besides the point really
[10:22] keffo All zmq related code is in C++
[10:23] sustrik aha, so you are _not_ using the lua binding
[10:23] sustrik right?
[10:23] keffo I am, but not the ones from the site :)
[10:23] sustrik your own
[10:23] keffo yup
[10:24] keffo just an abstraction layer that's easier to work with on the lua side
[10:24] sustrik ack
[10:24] sustrik what's the lua's threading model?
[10:24] keffo inherently single threaded, but it has coroutines
[10:25] sustrik ok, so there's only one OS thread for each lua process, right?
[10:25] keffo yes
[10:26] sustrik no such think as garbage collection thread or similar?
[10:26] sustrik thing*
[10:26] keffo client, loadbalancer, computenode and workers are all single processes (a computenode is simply a c++ queue dev. 'owning' the worker processes)
[10:26] keffo No, not in another thread no.. Lua does incremental gc but in the same thread
[10:27] sustrik ok
[10:27] sustrik then it looks like the problem has to be in 0MQ itself
[10:27] sustrik it should be reproducible in C or C++
[10:28] sustrik can you write a simpe C/C++ test that simulates what you are doing in your app?
[10:28] sustrik messaging-wise i mean, not the business logic
[10:28] keffo hmm.. that would take a while I think..
[10:29] keffo I do think I'm to blame, not zmq, I just need a better way to debug it probably :)
[10:30] keffo It behaves as threaded concurrency issue, but being entirely single process based..
[10:30] sustrik there's an I/O thread inside 0MQ so it can be a race
[10:31] sustrik however, i can't help unless i am able to reproduce it
[10:31] keffo Well I'd be half way safe if I could reproduce it too, but it's quite rare :)
[10:31] sustrik of course, it's still possible that you are overwriting memory in your app or something like that
[10:31] sustrik have you tried to valgring your binding?
[10:31] sustrik valgrind
[10:33] keffo no valgrind in win as far as I know.. Boundschecker perhaps, but I doubt that's the problem too
[10:33] keffo Again, in a single-worker scenario, everything is flawless regardless of amount and load etc
[10:34] keffo (and none of the processes every leak anything, stays solid around 2mb mem)
[10:34] sustrik then it looks like a problem in the central dispatcher node
[10:34] keffo yeah
[10:34] keffo that's the most complicated bit too
[10:35] keffo double lru queue device basically
[10:36] sustrik well, unless you are able to provide something to reproduce the problem, you are on your own here
[10:36] keffo I know :)
[10:37] keffo more logging I guess. It would be nice to have all participants publish all the data they process, and collect that in a sub app, and then be able to "play back" everything
[10:37] ptrb that could easily become a project as complex as zmq itself
[10:37] ptrb entire companies are built on the concept of capturing/replaying network traffic
[10:37] pieterh keffo: if you can send me your code I'll see if I can spot anything
[10:38] pieterh i need the parts that read and write worker envelopes, to start with
[10:38] keffo sure hold on
[10:38] keffo wheres the pastebin?
[10:38] pieterh works well
[10:39] keffo But again, if the envelope code is flawed, it would be equally flawed for one worker?
[10:40] pieterh keffo: i explained how it could be flawed but still work for one worker
[10:41] keffo
[10:42] pieterh keffo: i need the actual detail of sendpath, I think
[10:42] pieterh *sendmore, sorry
[10:42] keffo That is in C, hold on
[10:44] keffo
[10:44] keffo send() is obviously quite similar except for the flag :)
[10:45] pieterh keffo: it looks fine
[10:46] pieterh you're sure that pairs() gives you stuff in the right order?
[10:46] pieterh so you have only one choice left, sadly
[10:46] keffo rewrite, and hope it goes away? =)
[10:47] keffo naa, there's some stupid silly thing somewhere
[10:47] pieterh well, everything goes through these functions
[10:47] pieterh so put in printf statements or whatever
[10:47] pieterh and see when it actually goes wrong
[10:47] keffo Yeah, need to find a pattern..
[10:47] pieterh pieter's rules #5 and #23 of debugging
[10:48] pieterh (a) it's never actually a race condition
[10:48] pieterh (b) you can find any bug using printfs
[10:48] keffo except the ones in printf!
[10:48] pieterh that's why C offers puts as well!
[10:49] keffo One thing I could do is rewrite the loadbalancer stuff in pure C
[10:50] pieterh keffo: there's no point rewriting unless you have found the cause of the problem
[10:50] ptrb why are your rule #5 and #23 labelled (a) and (b)??
[10:50] keffo indeed
[10:51] pieterh ptrb: lol
[10:51] keffo pieterh, That code should be rewritten though, for performance, but I agree :)
[11:37] BooTheHamster What mean error "operation cannot be accomplished in current state" in zmq_send() call?
[11:38] ptrb hey i got that yesterday! :D
[11:39] ptrb basically means you're violating some constraint of the socket
[11:39] ptrb is it ZMQ_REQ or _REP?
[11:39] BooTheHamster ZMQ_REQ
[11:39] ptrb you must call zmq_send, then zmq_recv, before you can call zmq_send again
[11:39] BooTheHamster I't a client app but run withiut server
[11:39] ptrb are you doing that?
[11:39] BooTheHamster yes
[11:39] BooTheHamster first I call zmq_send
[11:40] BooTheHamster and then zmq_recv
[11:40] ptrb on the same socket
[11:40] BooTheHamster m_MsgSock->recv(&msg_rcv, ZMQ_NOBLOCK)
[11:40] BooTheHamster yes
[11:40] ptrb noblock means you may not actually have recv'd anything
[11:40] ptrb you have to actually recv something before you can send again (I believe--maybe someone can confirm)
[11:41] BooTheHamster but i want non blocking recv
[11:41] ptrb that's fine, you just can't send again until your non-blocking recv actually gets something
[11:41] ptrb if you want to do multiple sends without getting data back, use XREQ
[11:43] BooTheHamster Thank you ptrb, XREQ helps.
[11:44] BooTheHamster A little one question ... on server side i must use XREP or REP socket?
[11:45] ptrb just depends on what behavior you can tolerate there... both will work, but REP requires you to recv, send, recv, send actual data
[11:46] ptrb `man zmq_socket' for more details
[11:50] BooTheHamster thanks
[11:52] ptrb i helped!! :D
[11:52] ptrb maybe
[11:52] guido_g and for some more ideas what to use when:
[12:25] cremes ptrb: you're an expert now! ;)
[13:02] pieterh cremes: I moved your xrep-xrep recipe page into the tutorials section
[13:47] Tasser why use rbzmq <=> ffi-rzmq?
[13:54] cremes Tasser: they have slightly different APIs; the ffi-rzmq more closely tracks the C api whereas rbzmq uses strings for everything
[13:55] cremes also, ffi-rzmq is more compatible with ruby runtimes (jruby, rubinius, mri)
[13:55] Tasser strings sounds like overuse of core classes
[13:55] cremes well, most people will probably convert their data to string format for send/recv under ruby because it's convenient
[13:55] Tasser that's what JSON is for :-)
[13:56] cremes but since ffi-rzmq exposes a zmq_msg_t-like structure you can pass around C structs without doing a lot of extra copying
[13:56] Tasser but I suspect you can't hardwire zmq against json
[13:56] cremes that *might* be more efficient
[13:56] cremes what do you mean by hardware? as far as i know, 0mq (and zmq) don't care what you pass on the wire
[13:56] cremes s/hardware/hardwire/
[13:57] Tasser well, for filtering in subscribe
[13:57] cremes ah, subscribe filters at the byte-level
[13:57] cremes no need to encode the subscription topic unless you want to decode it on each subscriber too
[13:59] Tasser hmm
[14:00] cremes hmm sometimes means "i don't know what you are talking about" :)
[14:58] Tasser cremes, well, so you could filter based on JSON, but that would make it bloat
[14:59] cremes Tasser: my point is that 0mq matches topics by comparing bytes
[14:59] cremes so you want it to fail as soon as possible
[15:00] cremes using a json-encoded topic means that more bytes would need to be compared to find mismatches
[15:00] cremes so subscription filtering would be slower
[15:00] cremes for example...
[15:00] cremes sock.setsockopt ZMQ::SUBSCRIBE, 'topic.string'
[15:00] cremes versus
[15:01] cremes sock.setsockopt(ZMQ::SUBSCRIBE, JSON.encode({'topic' => 'topic.string'})
[15:01] cremes honestly, i'm not even sure if that second example would ever work
[15:01] cremes since the json encoder could put things in different order while still producing valid json
[15:02] cremes just use a string for the topic and put your encoded data into a second message part
[15:02] Tasser yeah, probably the sane idea of using it.
[17:01] kleppari hello
[17:01] kleppari - should I need to register to view this?
[17:02] ModusPwnens Im not sure, but if you do register, let me know if you are successful. I was trying to the other day and it wouldn't work =/
[17:03] pieterh kleppari: isn't is showing? rats... :-( thanks for pointing it out
[17:03] guido_g f*ck, this is wikidot crap
[17:03] guido_g sorry
[17:03] pieterh gudio_g: default permissions on cloned site are private, I forgot
[17:03] guido_g and you need to allow cross-domain cookies
[17:03] pieterh happily it's a 10-second fix
[17:04] guido_g wikidot sucks in this regard
[17:04] pieterh guido_g: you need to not disallow the dratted things
[17:04] pieterh it's for good security reasons, unfortunately
[17:04] kleppari registered with wikidots, worked after that. I think it's a major fail to require it though :)
[17:04] pieterh kleppari: not required, it was my fault
[17:04] kleppari pieterh: np, thanks for the guide :)
[17:04] guido_g pieterh: cross-donmain cookies are for security?
[17:04] pieterh it should work for anonymous users now
[17:05] pieterh guido_g: wikidot splits off insecure stuff (file uploads) to a separate domain
[17:05] guido_g nack
[17:05] guido_g pieterh: it doesen't work
[17:05] pieterh feel free to debate with the wikidot devs, they are pretty sure of their skills in this respect
[17:05] pieterh what doesn't work?
[17:05] kleppari pieterh: works now, thanks.
[17:05] pieterh ?
[17:06] kleppari i.e., anonymous users on
[17:06] kleppari oh, no, sorry
[17:06] guido_g w/ the damn cookies
[17:06] pieterh guido_g: anything different than
[17:07] pieterh there's nothing special about this site afaics
[17:07] Samy sustrik, ping?
[17:07] guido_g i allowed the cookies for and the wikidot server
[17:07] guido_g but i'm not going to do that for every new site
[17:07] pieterh ah, well, allow them for * and * :-)
[17:08] pieterh why do you disallow them?
[17:08] guido_g i do not allow cookies generally
[17:08] pieterh well...
[17:08] kleppari pieterh:
[17:08] pieterh kleppari: thanks, I'll try it...
[17:09] pieterh unfortunately interactive websites kind of need stuff like that
[17:09] guido_g viewing a page is not interactive
[17:10] kleppari sorry, screwed up file ownership. the screenshot of unauthenticated should be ok now.
[17:10] pieterh kleppari, hang on, 30 seconds... I'll fix it, thanks for the screenshot
[17:11] pieterh can you try to reload the site please?
[17:11] pieterh ctrl-R
[17:11] kleppari like a glove
[17:11] kleppari thanks :)
[17:11] pieterh i didn't change anything, it was already working... :-)
[17:11] pieterh since 19:03 CET
[17:12] pieterh ok, kleppari I appreciate the heads up, kind of stupid to have a Guide no-one can read... :-)
[17:12] kleppari screenshot snapped at 17:06 GMT. Weird.
[17:12] kleppari you're welcome
[17:12] guido_g <- missing
[17:12] pieterh guido_g, yeah, because Sustrik pointed out that parts of it were bogus
[17:13] pieterh i'm rewriting it
[17:13] pieterh as soon as i finish my pasta and wine
[17:13] guido_g and why this intermediate page when i click in "The Guide" from
[17:13] pieterh you'd rather leap to chapter 1 right away?
[17:13] guido_g yes
[17:13] pieterh i mean when you get a book, you see the cover first...
[17:14] pieterh but ok, can be done...
[17:14] guido_g the link is named "The Guide"
[17:14] pieterh Yeah, not The Guide Chapter 1 :-)
[17:14] guido_g not "promotional intermediate page w/o content"
[17:14] pieterh hey, don't mock the promotional pages...
[17:15] guido_g only if you reduce the size of the ØMQ logo drastically
[17:15] pieterh delayed gratification...
[17:15] pieterh OMG don't touch the logo, it's sacred...
[17:15] guido_g btw, too much red makes aggressive :)
[17:15] pieterh ok, the link from the sidebad should now go straight to the first chapter
[17:16] guido_g nope
[17:16] guido_g ok, i'll wait
[17:17] pieterh guido_g, when you say "nope" you mean as in "oops, I didn't reload it and HTTP doesn't yet have a autoupdate function"...?
[17:17] guido_g pieterh: no, it means "i pressed relaod like maniac but the thing didn't work"
[17:18] pieterh works for me...
[17:18] pieterh go back to
[17:18] pieterh are there more links? I fixed the one in the sidebar
[17:18] pieterh what one are you clicking on?
[17:18] guido_g i am on and open the link in another tab
[17:19] guido_g after reloading the page
[17:19] pieterh's correct... you are reloading
[17:19] guido_g yes
[17:20] guido_g hmmm... cdn thing maybe
[17:20] pieterh did you sacrifice a chicken to the gnomes of the Internet?
[17:20] guido_g ahhh... i constantly miss this one (eating the chicken myself)
[17:21] pieterh might be some caching somewhere your side of things
[17:21] pieterh it's correct, anyhow, just tested again
[17:21] guido_g ahhh... cdn got the update
[17:21] kleppari you might also be clicking different links?
[17:21] guido_g no
[17:22] kleppari the sidebar link points to chapter one, the other guide links point to the 'promotional' page
[17:22] guido_g so remember, it might take approx. 5 minutes til changes become vsisible
[17:22] kleppari two of them, one under most popular, other in recent site changes
[17:22] pieterh ah, kleppari, right
[17:23] kleppari but.. After a little while I like the promotional page
[17:23] pieterh well, indeed, a book deserves a cover
[17:23] pieterh bright red
[17:24] pieterh anyhow, this is always a work in progress, i like to see how people respond over time before making more changes
[17:24] kleppari I didn't get it at first though, started at -> hit read the manual -> clicked 'ØMQ - The guide' -> got to
[17:24] guido_g pieterh: you're on a laptop?
[17:25] kleppari didn't catch the differences between and right away
[17:25] pieterh guido_g, on a laptop with a 2G phone modem, eating pasta in a hotel
[17:25] guido_g it'S more about the display size
[17:25] kleppari sounds like a crap hotel :P
[17:25] pieterh kleppari, it's a pretty amazing hotel actually, an old Soviet relic
[17:26] pieterh guido_g, yes, laptop display
[17:26] guido_g pieterh: because on my 13" display the logo takes up apporx. 1/3 of the vertical space
[17:26] guido_g which is way to much, imho
[17:27] guido_g worse if i zoom in for better readability
[17:27] pieterh guido_g, yes, on mine too but it doesn't matter on the cover page, and hardly matters on the content pages since you need to scroll anyhow
[17:27] kleppari looks good on 1280x1024
[17:27] pieterh we tried using a smaller logo some time ago, people felt it wasn't nice
[17:28] kleppari pieterh: and no rj45 jack? URL? :)
[17:28] guido_g ok, i give up
[17:28] pieterh kleppari, URL? no rj45, no, wifi but it's too slow
[17:28] kleppari url of the hotel website? or is it too much of a relic?
[17:28] pieterh guido_g, do start a discussion on look and feel, but this isn't the place IMO
[17:29] pieterh guido_g, the original look was designed by a proper artist, and anything that replaces it would have to be done carefully
[17:29] pieterh kleppari, it's the Hotel Kyjev in Bratislava
[17:30] guido_g pieterh: i didn't complain about the logo, only it's size, but nevermind i'll find a solution
[17:30] pieterh guido_g, I do agree the logo is HUGE but my previous attempts to downsize it failed
[17:31] guido_g ok
[17:32] pieterh guido_g, what might work is a lighter theme for sites other than the Start Here site
[17:32] pieterh now that we have more sites, we can fine-tune them
[17:38] guido_g except for the "promo" pages the focus should be on readability