Wednesday August 25, 2010

[Time] NameMessage
[07:35] mato pieterh: you at eea?
[07:40] pieterh mato: re
[07:40] pieterh mato: i'm not there, eta around 10.30
[07:40] pieterh that ok?
[07:42] mato fine with me
[07:42] mato sustrik: 10:30 then...
[07:42] sustrik ok
[07:44] mato pieterh: your -EFAULT changes to zmq_stopwatch_stop do not build on Solaris, I'll take them our for that function since it's not really part of the core API
[07:51] CIA-20 zeromq2: 03Martin Lucina 07master * rb66dd7a 10/ src/zmq.cpp :
[07:51] CIA-20 zeromq2: zmq_stopwatch_stop: Don't return EFAULT
[07:51] CIA-20 zeromq2: Function returning unsigned long int cannot return (-1) -
[07:55] sustrik mato: a question
[07:55] sustrik EINTR thing
[07:56] sustrik is it a good idea we hide it?
[07:56] sustrik what if the app sends signals
[07:56] sustrik and processes them using sigwait
[07:57] sustrik we'll end up ewith infinite loop, no?
[07:59] mato sustrik: no, if the app is doing "best practice" signal handling with a thread using a sigwait loop then our not return EINTR has no effect on that...
[07:59] mato sustrik: unless something is very broken, which is of course possible with signal handling in general
[08:00] mato sustrik: anyhow, the point is, zmq only ever changes the signal mask in its own I/O threads
[08:00] mato sustrik: and never in the application threads
[08:01] sustrik exactly
[08:01] mato sustrik: if you're asking, should 0mq calls be interruptible (and return EINTR) then that is of course a different question, but we'd agreed that the answer is no
[08:01] sustrik so basically no signal should ever occur in the thread that's using 0mq socket
[08:01] sustrik right?
[08:02] mato no
[08:02] mato no signals should ever be delivered to the 0mq I/O threads
[08:02] sustrik sure
[08:02] mato what happens in the app thread(s) is up to the app
[08:03] sustrik right, but a signal coming to app thread can cause a deadlock -- a busy loop actually
[08:03] mato how so?
[08:03] sustrik the code looks like this:
[08:03] sustrik while (errno = EINTR)
[08:03] sustrik do_blocking_op ();
[08:04] sustrik so once signal arrives, it loops
[08:04] mato what is "the code"? 0mq code? app code?
[08:04] sustrik 0mq code
[08:04] mato which is fine, the blocking operation will restart...
[08:05] sustrik sure, but the signal is still there
[08:05] sustrik so it'll retart in infinite loop
[08:05] mato huh?
[08:05] mato you only get EINTR if the call is actually *interrupted*
[08:05] mato it's not a level trigger
[08:06] mato so you will get at most one EINTR per signal
[08:06] sustrik even with sigwait-style signals?
[08:06] sustrik there's a queue there in the background afaiu
[08:06] mato if you're using sigwait style signals then it is your responsibility to do it right
[08:06] sustrik is is still edge-trigerred
[08:06] sustrik ?
[08:06] mato which means doing the signal handling in a thread of its own
[08:06] mato which does nothing except run the sigwait loop
[08:07] sustrik "never combine 0MQ socket with signals"
[08:07] mato no
[08:07] mato if you're using naive signal handling with signal() it might work fine
[08:07] sustrik hm
[08:08] mato unless people are actually *expecting* calls to return with EINTR
[08:08] sustrik i been thinking about it because zed actually experienced the infinite looping
[08:08] sustrik zedas: hullo!
[08:08] mato zedas: yo
[08:09] mato he's probably asleep, zed is somewhere in the states, no?
[08:09] zedas what's up?
[08:09] mato ah
[08:09] zedas i never sleep
[08:09] mato zedas: what's this about signals and infinite loops?
[08:09] zedas well actually i was about to :-)
[08:09] mato :-)
[08:09] zedas mato: it's not signals.
[08:10] zedas so what's happening is the following (that i haven't tracked down)
[08:10] zedas 1. Mongrel2 starts up with 2 or 4 IO threads.
[08:10] zedas 2. A request comes in for a handler, so M2 sends it out on 0MQ to that handler, which is in python, or lua so far, but seems to be on any of them.
[08:10] zedas 3. Handler then gets *2* messages, but we only sent one.
[08:11] zedas i have confirmed this 5 times.
[08:11] zedas 4. Handler responds to both messages, but Mongrel2 then gets hit with an "infinite" loop in zmq_poll.
[08:11] zedas 5. What's happening is one of the IO threads is setting EINTR on each loop through, so then zmq_poll causes M2 to go 100% CPU.
[08:12] zedas 6. Finally, if we restart the *handler* (not mongrel2) then the EINTR goes away immediately, and everything goes back to normal.
[08:12] zedas that's what i know so far.
[08:12] zedas so, i tested it with IO threads set to 1.
[08:12] zedas and it totally has no bugs. ran a thrashing test for a whole day to make sure.
[08:12] zedas and it's a nasty test too, and not a single lockup or 100% the whole time.
[08:13] zedas but, set threads > 1 and bam, 100% right away.
[08:13] mato in step 5, the EINTR is coming from the poll() call inside zmq_poll() ?
[08:13] zedas so there ya go, that's all I know. the EINTR isn't caused by signals at all.
[08:13] zedas no, i think it's extraneous
[08:13] mato yeah, but what is returning EINTR?
[08:13] zedas i swear I blocked every damn signal possible and that EINTR still kept going
[08:13] zedas well EINTR isn't "returned"
[08:14] zedas i think one of these threads is setting it as an error indicator
[08:14] zedas and it's not getting caught by 0mq, so it bleeds out to poll.
[08:14] zedas anyway after i got mongrel2 1.0 out i was gonna work up a test case for it and try to narrow it down.
[08:14] sustrik memory overwrite maybe
[08:14] zedas but if you guys have insight into where in the code this could happen.
[08:15] sustrik as far as i see there are 2 distinct problems
[08:15] mato sustrik: or somewhere in the i/o code we are not squashing an EINTR and should be?
[08:15] zedas yeah my thinking is the IO thread gets the double response and freaks out. that's also why the handler becomes unresponsive.
[08:15] sustrik yes
[08:15] sustrik the first problem may actually overwrite memory or so
[08:15] sustrik so the second problem may be just a consequence of the first one
[08:16] mato it's still strange that EINTR comes into play if no signals are involved
[08:16] zedas hmm. yep and the best indicator it's a thread issue and not poll/interrupt is i kill the handler, and poof problem solved.
[08:16] zedas if it was signals then i'd still be getting them even after killing the handler.
[08:16] sustrik hm
[08:16] zedas also signals coming in that fast would really nuke the process.
[08:17] zedas i mean these are tight loop poll calls going fast as hell, no way i'm getting signals that quick.
[08:17] mato zedas: just to make sure, is there any way you can trace the process when it happens with something that would show signals being delivered?
[08:17] mato zedas: e.g. running it under GDB with all signals set to "nostop pass" and logging the output?
[08:17] zedas i have a few stack traces...let me find...
[08:17] zedas mato: oh yeah, no signals
[08:18] zedas i swear up and down i cannot catch any signals
[08:18] zedas gdb gets nothing. manually capturing every one with sigaction. nothing.
[08:18] sustrik overwriting glibc memory can result in strange poll behaviour
[08:18] zedas either it's in another thread, which is really weird, or it's being set manually, or overwritten ram.
[08:18] mato we never return EINTR anywhere manually
[08:18] zedas
[08:19] zedas that's our bug on it
[08:19] mato grepping src/* for EINTR will tell you that...
[08:19] zedas there's a really good stack trace in there and our debug dumps, and info i've found so far
[08:19] sustrik zedas: iirc you've said that it happens only when you use multiple i/o threads; does that apply still?
[08:20] zedas yep.
[08:20] zedas if i set io threads to 1, no problems
[08:20] zedas > 1 and lock up almost immediately
[08:20] zedas and it eventually settles down, which to me says race condition in the threading
[08:21] sustrik mato: we should run some tests with multiple i/o threads
[08:21] mato sustrik: yeah
[08:21] sustrik everyone's using magic number of 1 so it's very much untested
[08:21] zedas
[08:21] zedas that's the python 0mq handler that has the most frequent lockups
[08:22] zedas and it's really not doing much.
[08:22] sustrik zedas: ok, thanks for the info
[08:22] zedas receive[Cs on a DOWNSTREAM socket, sends on a PUB socket.
[08:22] sustrik i'll do some tests with multiple i/o threads
[08:22] mato zedas: yeah, thanks. /me thinks the EINTR is a red herring and this looks like a race between i/o threads
[08:22] sustrik hopefully i'll be able to reproduce
[08:22] mato sustrik: you have that big 8-core box
[08:23] sustrik two of them :)
[08:23] mato sustrik: yeah, exactly
[08:23] mato load the hell out of it and see what breaks
[08:23] zedas mato: ohhhh you know, i saw this more when i switched to an 8-core box....
[08:25] mato zedas: since EINTR came up, do you have any opinion on us squashing EINTR in zmq_* calls?
[08:25] mato zedas: that's the way it's done now since we considered the EINTR thing with UNIX calls a bug and felt no need to emulate it
[08:26] mato so effectively all zmq_* calls ignore EINTR and (should) never return it
[08:27] zedas mato: actually yeah, i think you shouldn't do that. to me it should work just like poll, except handle 0mq or regular sockets together.
[08:27] zedas especially in my server since i also have to handle those anyway in other parts, so zmq_poll doing it causes problems potentially.
[08:28] mato zedas: so you prefer the standard behaviour even if it is kind of broken?
[08:28] mato most code i've seen ends up calling system calls in a loop ignoring EINTR
[08:29] zedas yep, that's what i'd prefer, but you'd probably break people's code who expect you to do this
[08:29] zedas so, as long as you fix it so it doesn't peg the cpu 100% i'm alright with it.
[08:30] mato ok
[08:35] mato pieterh: will join you in 15mins or so
[08:36] travlr are you guys at a conference?
[08:37] sustrik travlr: no, but we happen to be in the same city
[08:37] travlr ah. cool. enjoy your meet.
[08:37] sustrik pieterh, mato: ok, i'm leaving as well
[09:40] CIA-20 zeromq2: 03Martin Lucina 07master * rc06a3cc 10/ (builds/msvc/platform.hpp Update version number to 2.0.8 -
[09:44] CIA-20 zeromq2: 03Pieter Hintjens 07master * rd788c1f 10/ NEWS : Updated NEWS for stable 2.0.8 release -
[10:10] CIA-20 zeromq2: 03Pieter Hintjens 07master * r98bea86 10/ NEWS : Updated NEWS for stable 2.0.8 release -
[10:10] CIA-20 zeromq2: 03Pieter Hintjens 07master * r6d275a8 10/ NEWS : Updated NEWS for stable 2.0.8 release -
[10:51] CIA-20 zeromq2: 03Martin Lucina 07master * rc9076c5 10/ doc/zmq_socket.txt :
[10:51] CIA-20 zeromq2: Basic documentation for XREQ/XREP socket types
[10:51] CIA-20 zeromq2: Add some basic documentation for XREQ/XREP socket types, including
[10:51] CIA-20 zeromq2: a brief description of the most common use case (REQ -> XREP) and (XREQ ->
[10:51] CIA-20 zeromq2: REP). -
[10:54] travlr sustrik: you guys should commit to your local repository and make one commit to the public repo at the end of your session
[10:55] alfborge Any work on the fortran bindings?
[10:55] mato travlr: it'd still be split into multiple commits even if i do one push
[10:55] mato travlr: since the changes are separate things
[10:56] travlr yeah but you can do that locally and only do one commit to the public repo when you are done for the day
[10:57] travlr or the session
[10:58] pieterh travlr: sorry about that...
[10:58] travlr :)
[10:58] pieterh i generally use -amend to reduce commits but it did not work here, got some weird merge conflict...
[10:59] travlr merge conflicts are why i stopped using rebase too.
[10:59] pieterh anyhow, the good news is we're making 2.0.8 stable
[10:59] travlr i saw that w00t w00t.. lol
[11:08] travlr btw, to get technical.. i meant one push to the public repo not a commit.
[11:22] CIA-20 zeromq2: 03Martin Lucina 07master * r1e089f7 10/ ChangeLog : Update ChangeLog for v2.0.8 -
[12:23] CIA-20 jzmq: 03Gonzalo Diethelm 07master * r914cbd0 10/ src/org/zeromq/ : Added method Poller.getSocket(int index). -
[13:28] keffo oh, nice bot feature :)
[13:32] keffo meh, by the hammer of thor, I've hit my 20gb 3g transfer cap :/
[13:34] pieterh keffo: you transferred 20GB via 3G????
[13:34] pieterh my monthly limit is 500MB and I haven't even hit that
[13:37] keffo hehe
[13:37] keffo then at least you're familiar with how agonizing it is to do any sort of internet related work over 3g :)
[13:39] keffo this is my only connection, and 3-4 machines using it, if you include the phone itself
[13:39] pieterh wow
[13:41] keffo (hence the predisposition I have about both reliability & custom loadbalancing :))
[13:42] CIA-20 zeromq2: 03Martin Sustrik 07master * rb608c19 10/ (3 files in 2 dirs): MSVC build fixed (+19 more commits...) -
[13:42] keffo msvc build fixed?
[13:48] mato keffo: that is part of a bunch of changes that have landed on master
[13:48] mato keffo: which will become 2.1.x, to be announced shortly
[13:55] keffo how shortly? I'm just fuzzing around upgrading to 2.0.8! :)
[13:55] mato by announced i mean we'll announce what's cooking
[13:55] keffo hmm.. the #define _CRT_SECURE_NO_WARNINGS in windows.hpp probably should be guarded..
[13:55] mato no schedule for an actual 2.1.x release yet, many changes that need to be tested
[13:56] keffo what are the major ones?
[13:56] mato wait for the ml announcement :)
[13:58] keffo oki
[15:18] keffo well, that was a good 20 minutes of trying to make good old dad to stop using IE
[15:52] ModusPwnens Hello
[16:05] keffo hellu
[16:07] ModusPwnens hi keffo, i have a problem with zeromq
[16:11] keffo ok
[16:15] ModusPwnens So i am trying to set up a basic server/client setup
[16:15] ModusPwnens and I got it to work locally
[16:15] ModusPwnens but when I tried to do it across the network, the server would not receive any of the clients messages
[16:15] ModusPwnens but I used a network analyzer and confirmed that the computer itself was indeed receiving the messages
[16:17] ModusPwnens I was following the example for the Hello World program, except I sort of rewrote it because I am also using Google Protocol Buffers
[16:17] keffo that shouldnt matter, it's just data
[16:17] keffo you probably changed something else while doodling
[16:17] keffo or screwed up the size or something
[16:18] ModusPwnens That's what I thought, but I retested it locally and it works just fine
[16:18] ModusPwnens its just when I try to do it over the network that it doesn't work
[16:18] ModusPwnens I was wondering if there was something we have to change or add if we are doing it over a network
[16:18] keffo did the sample work?
[16:18] ModusPwnens Yeah
[16:18] keffo over network?
[16:19] ModusPwnens Hmm, I didn't try that. I only tried it locally to verify that it worked.
[16:19] ModusPwnens Hold on, let me try that
[16:19] keffo as long as you're not using inproc or ipc, it should work provided firewalls are in order etc
[16:19] ModusPwnens I'm just using tcp
[16:20] ModusPwnens Also, is there some reason using localhost for the endpoint of the bind function in the server application would cause the program to crash?
[16:21] keffo programs should never crash :)
[16:21] ModusPwnens bahaha, that's very true.
[16:21] keffo try with tcp://*:1234 etc
[16:21] ModusPwnens Yeah that's what i had originally.
[16:21] ModusPwnens But I was just trying different things to get it to work
[16:21] keffo using localhost in a server sounds pretty weird :)
[16:22] keffo eth0 etc should work though, but not in windows I think
[16:22] keffo but with one nic that doesn't matter, just use *
[16:23] ModusPwnens Yeah that's what I was doing.
[16:23] ModusPwnens For the client, we would use the ip address of the server, yes?
[16:23] keffo yeah
[16:23] ModusPwnens I thought so..
[16:23] keffo bind vs connect
[16:23] ModusPwnens hmm...well let me try the example and see if it works
[16:23] ModusPwnens if it doesnt, maybe there is some more complicated issue
[16:25] ModusPwnens what..? Do the change the examples on the website a lot?
[16:25] ModusPwnens There are nowhere near as many examples as there were the other day
[16:30] ModusPwnens ok, so I just tried using the hello world example over the network
[16:31] ModusPwnens and the only change I made was to the endpoint in the client. I changed it to the ip address of the other computer I am using the server .exe on
[16:32] ModusPwnens and it doesn't work, but it does work locally
[16:34] keffo it's pretty active yeah
[16:34] guido_g packet filter?
[16:34] ModusPwnens a packet filter?
[16:34] guido_g something that filters out packets on the network
[16:34] ModusPwnens I am sorry, but I am relatively new to networking and the concepts involved..
[16:35] guido_g often wrongly called a firewall
[16:35] ModusPwnens Hmm. I saw the data I was sending when I used wireshark to analyze the traffic
[16:35] ModusPwnens would the packet filter occur before or after that?
[16:35] guido_g on which machine?
[16:35] ModusPwnens both
[16:36] guido_g then it should be ok
[16:36] guido_g but i'm notz sure... it's windows, so be prepared for the unexpected
[16:36] ModusPwnens *sigh* i am well aware of that lol
[16:36] ModusPwnens as far as getting the hello world example to work over networking though, all you should have to do is change the endpoint for the connect function, right? ]
[16:37] ModusPwnens and if that doesn't work, there is some other underlying cause as to why?
[16:37] guido_g best is you perepare a text with what you did (exactly!) and paste that to a paste bin
[16:37] ModusPwnens What I did as far as modifying the hello world code?
[16:38] guido_g for example
[16:38] guido_g if you modifed the code, put it on the paste bin too
[16:39] ModusPwnens I'm not sure I understand. What purpose would this serve?
[16:39] guido_g that we do see the same code as you
[16:40] guido_g now we're guessing
[16:40] ModusPwnens oh. Well I'm not sure that's necessary. All i did was copy and paste the code off the website
[16:41] guido_g ok, your choice
[16:44] ModusPwnens I imagine that the hello world code must have been verified to work over a network
[16:45] ModusPwnens i highly doubt that it wouldn't work. I'm just a little unsure of what could cause the problem on my end
[16:46] guido_g because you refuse to show "your end" no one knows
[16:47] ModusPwnens fine fine, I will copy my code over
[16:47] guido_g btw, this is the accepted practice when asking for help
[16:49] ModusPwnens oh ok..Sorry, i didn't know. One of my coworkers told me about these irc chatrooms so i have never really been on one before
[16:50] ModusPwnens Ok so, you don't want me to just copy the code right over right? You said something about a paste bin..
[16:50] guido_g
[16:51] ModusPwnens
[16:51] ModusPwnens the only other thing that could be different is that I have additional header files
[16:52] ModusPwnens but i dont think that would cause a problem
[16:52] guido_g you typed it, right?
[16:52] guido_g ops, wrong window
[16:55] guido_g ok, given that the packets are visible on both machines, try to replace the '*' in line 69 with the ip of the machine
[16:56] ModusPwnens of the client machine?
[16:56] guido_g line 70 actually
[16:56] guido_g no, of course not
[16:56] guido_g ip of the machine the server is running on
[16:57] guido_g how many interfaces does this machine have?
[16:57] ModusPwnens Hmm. I didn't think that was necessary since it was being run on that computer
[16:57] ModusPwnens but i will try that
[16:58] guido_g ok, if it doesen't work, gather all the information incl. the changes to the programs and post a problem report to the mailing list
[16:59] ModusPwnens Okie doke. Thanks for your help!
[16:59] guido_g important information is operating system, compiler, ømq version etc.
[17:00] ModusPwnens as far as the compiler goes, is it alright to just say I'm using Visual Studio 2008? I'm not really sure what compiler it uses.
[17:00] guido_g should be enough
[17:00] ModusPwnens Ok. Thanks!
[17:00] guido_g but i'm not a windows guy, so i can't tell for sure
[17:01] ModusPwnens hey!! it works!
[17:01] guido_g ok
[17:01] ModusPwnens So
[17:02] ModusPwnens why exactly did i need to put the ip address of the computer itself there?
[17:02] guido_g what version of ØMQ do you use?
[17:02] guido_g no idea
[17:02] guido_g might be a windows issue
[17:02] ModusPwnens 2.07
[17:03] guido_g ok, that is sort of current
[17:03] ModusPwnens how often are new versions released?
[17:03] guido_g i'd say write this to the ml. might be a bug
[17:03] guido_g when there is something new
[17:04] guido_g no fixed times
[17:05] ModusPwnens Oh ok. I will do as you suggested.
[17:06] guido_g ok, have fun
[18:47] cremes i have a question about communicating between XREQ and XREP pairs
[18:48] cremes looking at the docs, it mentions adding a null message part between the identity and the body when communicating from REQ to XREP or XREQ to REP
[18:48] cremes is that also necessary when going directly from XREQ to XREP?
[18:49] cremes let me rephrase that last part...
[18:50] cremes when returning a response from a XREP to a XREQ socket, do i need to build the message chain as "identity", "delimiter", "body parts"?
[23:16] mato cremes: re. your question, no you don't need to add the delimiter when connecting XREQ
[23:16] mato cremes: XREP
[23:16] mato cremes: however, if you do so you maintain the ability to connect a REQ/REP should you need to
[23:17] mato cremes: the docs for XREQ/XREP are minimalist, feel free to expand/suggest clearer wording as a patch, thx