ZeroMq IRC Log

Wednesday August 25, 2010

[Time] Name	Message
[07:35] mato	pieterh: you at eea?
[07:40] pieterh	mato: re
[07:40] pieterh	mato: i'm not there, eta around 10.30
[07:40] pieterh	that ok?
[07:42] mato	fine with me
[07:42] mato	sustrik: 10:30 then...
[07:42] sustrik	ok
[07:44] mato	pieterh: your -EFAULT changes to zmq_stopwatch_stop do not build on Solaris, I'll take them our for that function since it's not really part of the core API
[07:51] CIA-20	zeromq2: 03Martin Lucina 07master * rb66dd7a 10/ src/zmq.cpp :
[07:51] CIA-20	zeromq2: zmq_stopwatch_stop: Don't return EFAULT
[07:51] CIA-20	zeromq2: Function returning unsigned long int cannot return (-1) - http://bit.ly/du23tY
[07:55] sustrik	mato: a question
[07:55] sustrik	EINTR thing
[07:56] sustrik	is it a good idea we hide it?
[07:56] sustrik	what if the app sends signals
[07:56] sustrik	and processes them using sigwait
[07:57] sustrik	we'll end up ewith infinite loop, no?
[07:59] mato	sustrik: no, if the app is doing "best practice" signal handling with a thread using a sigwait loop then our not return EINTR has no effect on that...
[07:59] mato	sustrik: unless something is very broken, which is of course possible with signal handling in general
[08:00] mato	sustrik: anyhow, the point is, zmq only ever changes the signal mask in its own I/O threads
[08:00] mato	sustrik: and never in the application threads
[08:01] sustrik	exactly
[08:01] mato	sustrik: if you're asking, should 0mq calls be interruptible (and return EINTR) then that is of course a different question, but we'd agreed that the answer is no
[08:01] sustrik	so basically no signal should ever occur in the thread that's using 0mq socket
[08:01] sustrik	right?
[08:02] mato	no
[08:02] mato	no signals should ever be delivered to the 0mq I/O threads
[08:02] sustrik	sure
[08:02] mato	what happens in the app thread(s) is up to the app
[08:03] sustrik	right, but a signal coming to app thread can cause a deadlock -- a busy loop actually
[08:03] mato	how so?
[08:03] sustrik	the code looks like this:
[08:03] sustrik	while (errno = EINTR)
[08:03] sustrik	do_blocking_op ();
[08:04] sustrik	so once signal arrives, it loops
[08:04] mato	what is "the code"? 0mq code? app code?
[08:04] sustrik	0mq code
[08:04] mato	which is fine, the blocking operation will restart...
[08:05] sustrik	sure, but the signal is still there
[08:05] sustrik	so it'll retart in infinite loop
[08:05] mato	huh?
[08:05] mato	you only get EINTR if the call is actually interrupted
[08:05] mato	it's not a level trigger
[08:06] mato	so you will get at most one EINTR per signal
[08:06] sustrik	even with sigwait-style signals?
[08:06] sustrik	there's a queue there in the background afaiu
[08:06] mato	if you're using sigwait style signals then it is your responsibility to do it right
[08:06] sustrik	is is still edge-trigerred
[08:06] sustrik	?
[08:06] mato	which means doing the signal handling in a thread of its own
[08:06] mato	which does nothing except run the sigwait loop
[08:07] sustrik	"never combine 0MQ socket with signals"
[08:07] mato	no
[08:07] mato	if you're using naive signal handling with signal() it might work fine
[08:07] sustrik	hm
[08:08] mato	unless people are actually expecting calls to return with EINTR
[08:08] sustrik	i been thinking about it because zed actually experienced the infinite looping
[08:08] sustrik	zedas: hullo!
[08:08] mato	zedas: yo
[08:09] mato	he's probably asleep, zed is somewhere in the states, no?
[08:09] zedas	what's up?
[08:09] mato	ah
[08:09] zedas	i never sleep
[08:09] mato	zedas: what's this about signals and infinite loops?
[08:09] zedas	well actually i was about to :-)
[08:09] mato	:-)
[08:09] zedas	mato: it's not signals.
[08:10] zedas	so what's happening is the following (that i haven't tracked down)
[08:10] zedas	1. Mongrel2 starts up with 2 or 4 IO threads.
[08:10] zedas	2. A request comes in for a handler, so M2 sends it out on 0MQ to that handler, which is in python, or lua so far, but seems to be on any of them.
[08:10] zedas	3. Handler then gets 2 messages, but we only sent one.
[08:11] zedas	i have confirmed this 5 times.
[08:11] zedas	4. Handler responds to both messages, but Mongrel2 then gets hit with an "infinite" loop in zmq_poll.
[08:11] zedas	5. What's happening is one of the IO threads is setting EINTR on each loop through, so then zmq_poll causes M2 to go 100% CPU.
[08:12] zedas	6. Finally, if we restart the handler (not mongrel2) then the EINTR goes away immediately, and everything goes back to normal.
[08:12] zedas	that's what i know so far.
[08:12] zedas	so, i tested it with IO threads set to 1.
[08:12] zedas	and it totally has no bugs. ran a thrashing test for a whole day to make sure.
[08:12] zedas	and it's a nasty test too, and not a single lockup or 100% the whole time.
[08:13] zedas	but, set threads > 1 and bam, 100% right away.
[08:13] mato	in step 5, the EINTR is coming from the poll() call inside zmq_poll() ?
[08:13] zedas	so there ya go, that's all I know. the EINTR isn't caused by signals at all.
[08:13] zedas	no, i think it's extraneous
[08:13] mato	yeah, but what is returning EINTR?
[08:13] zedas	i swear I blocked every damn signal possible and that EINTR still kept going
[08:13] zedas	well EINTR isn't "returned"
[08:14] zedas	i think one of these threads is setting it as an error indicator
[08:14] zedas	and it's not getting caught by 0mq, so it bleeds out to poll.
[08:14] zedas	anyway after i got mongrel2 1.0 out i was gonna work up a test case for it and try to narrow it down.
[08:14] sustrik	memory overwrite maybe
[08:14] zedas	but if you guys have insight into where in the code this could happen.
[08:15] sustrik	as far as i see there are 2 distinct problems
[08:15] mato	sustrik: or somewhere in the i/o code we are not squashing an EINTR and should be?
[08:15] zedas	yeah my thinking is the IO thread gets the double response and freaks out. that's also why the handler becomes unresponsive.
[08:15] sustrik	yes
[08:15] sustrik	the first problem may actually overwrite memory or so
[08:15] sustrik	so the second problem may be just a consequence of the first one
[08:16] mato	it's still strange that EINTR comes into play if no signals are involved
[08:16] zedas	hmm. yep and the best indicator it's a thread issue and not poll/interrupt is i kill the handler, and poof problem solved.
[08:16] zedas	if it was signals then i'd still be getting them even after killing the handler.
[08:16] sustrik	hm
[08:16] zedas	also signals coming in that fast would really nuke the process.
[08:17] zedas	i mean these are tight loop poll calls going fast as hell, no way i'm getting signals that quick.
[08:17] mato	zedas: just to make sure, is there any way you can trace the process when it happens with something that would show signals being delivered?
[08:17] mato	zedas: e.g. running it under GDB with all signals set to "nostop pass" and logging the output?
[08:17] zedas	i have a few stack traces...let me find...
[08:17] zedas	mato: oh yeah, no signals
[08:18] zedas	i swear up and down i cannot catch any signals
[08:18] zedas	gdb gets nothing. manually capturing every one with sigaction. nothing.
[08:18] sustrik	overwriting glibc memory can result in strange poll behaviour
[08:18] zedas	either it's in another thread, which is really weird, or it's being set manually, or overwritten ram.
[08:18] mato	we never return EINTR anywhere manually
[08:18] zedas	http://mongrel2.org/tktview?name=f1691a47d1
[08:19] zedas	that's our bug on it
[08:19] mato	grepping src/* for EINTR will tell you that...
[08:19] zedas	there's a really good stack trace in there and our debug dumps, and info i've found so far
[08:19] sustrik	zedas: iirc you've said that it happens only when you use multiple i/o threads; does that apply still?
[08:20] zedas	yep.
[08:20] zedas	if i set io threads to 1, no problems
[08:20] zedas	> 1 and lock up almost immediately
[08:20] zedas	and it eventually settles down, which to me says race condition in the threading
[08:21] sustrik	mato: we should run some tests with multiple i/o threads
[08:21] mato	sustrik: yeah
[08:21] sustrik	everyone's using magic number of 1 so it's very much untested
[08:21] zedas	http://mongrel2.org/artifact/3fcd396312653274f96cee730287bd62d8018fb0
[08:21] zedas	that's the python 0mq handler that has the most frequent lockups
[08:22] zedas	and it's really not doing much.
[08:22] sustrik	zedas: ok, thanks for the info
[08:22] zedas	receive[Cs on a DOWNSTREAM socket, sends on a PUB socket.
[08:22] sustrik	i'll do some tests with multiple i/o threads
[08:22] mato	zedas: yeah, thanks. /me thinks the EINTR is a red herring and this looks like a race between i/o threads
[08:22] sustrik	hopefully i'll be able to reproduce
[08:22] mato	sustrik: you have that big 8-core box
[08:23] sustrik	two of them :)
[08:23] mato	sustrik: yeah, exactly
[08:23] mato	load the hell out of it and see what breaks
[08:23] zedas	mato: ohhhh you know, i saw this more when i switched to an 8-core box....
[08:25] mato	zedas: since EINTR came up, do you have any opinion on us squashing EINTR in zmq_* calls?
[08:25] mato	zedas: that's the way it's done now since we considered the EINTR thing with UNIX calls a bug and felt no need to emulate it
[08:26] mato	so effectively all zmq_* calls ignore EINTR and (should) never return it
[08:27] zedas	mato: actually yeah, i think you shouldn't do that. to me it should work just like poll, except handle 0mq or regular sockets together.
[08:27] zedas	especially in my server since i also have to handle those anyway in other parts, so zmq_poll doing it causes problems potentially.
[08:28] mato	zedas: so you prefer the standard behaviour even if it is kind of broken?
[08:28] mato	most code i've seen ends up calling system calls in a loop ignoring EINTR
[08:29] zedas	yep, that's what i'd prefer, but you'd probably break people's code who expect you to do this
[08:29] zedas	so, as long as you fix it so it doesn't peg the cpu 100% i'm alright with it.
[08:30] mato	ok
[08:35] mato	pieterh: will join you in 15mins or so
[08:36] travlr	are you guys at a conference?
[08:37] sustrik	travlr: no, but we happen to be in the same city
[08:37] travlr	ah. cool. enjoy your meet.
[08:37] sustrik	pieterh, mato: ok, i'm leaving as well
[09:40] CIA-20	zeromq2: 03Martin Lucina 07master * rc06a3cc 10/ (builds/msvc/platform.hpp configure.in): Update version number to 2.0.8 - http://bit.ly/bS3kGK
[09:44] CIA-20	zeromq2: 03Pieter Hintjens 07master * rd788c1f 10/ NEWS : Updated NEWS for stable 2.0.8 release - http://bit.ly/bgBrD1
[10:10] CIA-20	zeromq2: 03Pieter Hintjens 07master * r98bea86 10/ NEWS : Updated NEWS for stable 2.0.8 release - http://bit.ly/9NJIVG
[10:10] CIA-20	zeromq2: 03Pieter Hintjens 07master * r6d275a8 10/ NEWS : Updated NEWS for stable 2.0.8 release - http://bit.ly/bLMHJw
[10:51] CIA-20	zeromq2: 03Martin Lucina 07master * rc9076c5 10/ doc/zmq_socket.txt :
[10:51] CIA-20	zeromq2: Basic documentation for XREQ/XREP socket types
[10:51] CIA-20	zeromq2: Add some basic documentation for XREQ/XREP socket types, including
[10:51] CIA-20	zeromq2: a brief description of the most common use case (REQ -> XREP) and (XREQ ->
[10:51] CIA-20	zeromq2: REP). - http://bit.ly/ctquxP
[10:54] travlr	sustrik: you guys should commit to your local repository and make one commit to the public repo at the end of your session
[10:55] alfborge	Any work on the fortran bindings?
[10:55] mato	travlr: it'd still be split into multiple commits even if i do one push
[10:55] mato	travlr: since the changes are separate things
[10:56] travlr	yeah but you can do that locally and only do one commit to the public repo when you are done for the day
[10:57] travlr	or the session
[10:58] pieterh	travlr: sorry about that...
[10:58] travlr	:)
[10:58] pieterh	i generally use -amend to reduce commits but it did not work here, got some weird merge conflict...
[10:59] travlr	merge conflicts are why i stopped using rebase too.
[10:59] pieterh	anyhow, the good news is we're making 2.0.8 stable
[10:59] travlr	i saw that w00t w00t.. lol
[11:08] travlr	btw, to get technical.. i meant one push to the public repo not a commit.
[11:22] CIA-20	zeromq2: 03Martin Lucina 07master * r1e089f7 10/ ChangeLog : Update ChangeLog for v2.0.8 - http://bit.ly/aYIYNs
[12:23] CIA-20	jzmq: 03Gonzalo Diethelm 07master * r914cbd0 10/ src/org/zeromq/ZMQ.java : Added method Poller.getSocket(int index). - http://bit.ly/aSnrqC
[13:28] keffo	oh, nice bot feature :)
[13:32] keffo	meh, by the hammer of thor, I've hit my 20gb 3g transfer cap :/
[13:34] pieterh	keffo: you transferred 20GB via 3G????
[13:34] pieterh	my monthly limit is 500MB and I haven't even hit that
[13:37] keffo	hehe
[13:37] keffo	then at least you're familiar with how agonizing it is to do any sort of internet related work over 3g :)
[13:39] keffo	this is my only connection, and 3-4 machines using it, if you include the phone itself
[13:39] pieterh	wow
[13:41] keffo	(hence the predisposition I have about both reliability & custom loadbalancing :))
[13:42] CIA-20	zeromq2: 03Martin Sustrik 07master * rb608c19 10/ (3 files in 2 dirs): MSVC build fixed (+19 more commits...) - http://bit.ly/cp3KJL
[13:42] keffo	msvc build fixed?
[13:48] mato	keffo: that is part of a bunch of changes that have landed on master
[13:48] mato	keffo: which will become 2.1.x, to be announced shortly
[13:55] keffo	how shortly? I'm just fuzzing around upgrading to 2.0.8! :)
[13:55] mato	by announced i mean we'll announce what's cooking
[13:55] keffo	hmm.. the #define _CRT_SECURE_NO_WARNINGS in windows.hpp probably should be guarded..
[13:55] mato	no schedule for an actual 2.1.x release yet, many changes that need to be tested
[13:56] keffo	what are the major ones?
[13:56] mato	wait for the ml announcement :)
[13:58] keffo	oki
[15:18] keffo	well, that was a good 20 minutes of trying to make good old dad to stop using IE
[15:52] ModusPwnens	Hello
[16:05] keffo	hellu
[16:07] ModusPwnens	hi keffo, i have a problem with zeromq
[16:11] keffo	ok
[16:15] ModusPwnens	So i am trying to set up a basic server/client setup
[16:15] ModusPwnens	and I got it to work locally
[16:15] ModusPwnens	but when I tried to do it across the network, the server would not receive any of the clients messages
[16:15] ModusPwnens	but I used a network analyzer and confirmed that the computer itself was indeed receiving the messages
[16:17] ModusPwnens	I was following the example for the Hello World program, except I sort of rewrote it because I am also using Google Protocol Buffers
[16:17] keffo	that shouldnt matter, it's just data
[16:17] keffo	you probably changed something else while doodling
[16:17] keffo	or screwed up the size or something
[16:18] ModusPwnens	That's what I thought, but I retested it locally and it works just fine
[16:18] ModusPwnens	its just when I try to do it over the network that it doesn't work
[16:18] ModusPwnens	I was wondering if there was something we have to change or add if we are doing it over a network
[16:18] keffo	did the sample work?
[16:18] ModusPwnens	Yeah
[16:18] keffo	over network?
[16:19] ModusPwnens	Hmm, I didn't try that. I only tried it locally to verify that it worked.
[16:19] ModusPwnens	Hold on, let me try that
[16:19] keffo	as long as you're not using inproc or ipc, it should work provided firewalls are in order etc
[16:19] ModusPwnens	I'm just using tcp
[16:20] ModusPwnens	Also, is there some reason using localhost for the endpoint of the bind function in the server application would cause the program to crash?
[16:21] keffo	programs should never crash :)
[16:21] ModusPwnens	bahaha, that's very true.
[16:21] keffo	try with tcp://*:1234 etc
[16:21] ModusPwnens	Yeah that's what i had originally.
[16:21] ModusPwnens	But I was just trying different things to get it to work
[16:21] keffo	using localhost in a server sounds pretty weird :)
[16:22] keffo	eth0 etc should work though, but not in windows I think
[16:22] keffo	but with one nic that doesn't matter, just use *
[16:23] ModusPwnens	Yeah that's what I was doing.
[16:23] ModusPwnens	For the client, we would use the ip address of the server, yes?
[16:23] keffo	yeah
[16:23] ModusPwnens	I thought so..
[16:23] keffo	bind vs connect
[16:23] ModusPwnens	hmm...well let me try the example and see if it works
[16:23] ModusPwnens	if it doesnt, maybe there is some more complicated issue
[16:25] ModusPwnens	what..? Do the change the examples on the website a lot?
[16:25] ModusPwnens	There are nowhere near as many examples as there were the other day
[16:30] ModusPwnens	ok, so I just tried using the hello world example over the network
[16:31] ModusPwnens	and the only change I made was to the endpoint in the client. I changed it to the ip address of the other computer I am using the server .exe on
[16:32] ModusPwnens	and it doesn't work, but it does work locally
[16:34] keffo	it's pretty active yeah
[16:34] guido_g	packet filter?
[16:34] ModusPwnens	a packet filter?
[16:34] guido_g	something that filters out packets on the network
[16:34] ModusPwnens	I am sorry, but I am relatively new to networking and the concepts involved..
[16:35] guido_g	often wrongly called a firewall
[16:35] ModusPwnens	Hmm. I saw the data I was sending when I used wireshark to analyze the traffic
[16:35] ModusPwnens	would the packet filter occur before or after that?
[16:35] guido_g	on which machine?
[16:35] ModusPwnens	both
[16:36] guido_g	then it should be ok
[16:36] guido_g	but i'm notz sure... it's windows, so be prepared for the unexpected
[16:36] ModusPwnens	sigh i am well aware of that lol
[16:36] ModusPwnens	as far as getting the hello world example to work over networking though, all you should have to do is change the endpoint for the connect function, right? ]
[16:37] ModusPwnens	and if that doesn't work, there is some other underlying cause as to why?
[16:37] guido_g	best is you perepare a text with what you did (exactly!) and paste that to a paste bin
[16:37] ModusPwnens	What I did as far as modifying the hello world code?
[16:38] guido_g	for example
[16:38] guido_g	if you modifed the code, put it on the paste bin too
[16:39] ModusPwnens	I'm not sure I understand. What purpose would this serve?
[16:39] guido_g	that we do see the same code as you
[16:40] guido_g	now we're guessing
[16:40] ModusPwnens	oh. Well I'm not sure that's necessary. All i did was copy and paste the code off the website
[16:41] guido_g	ok, your choice
[16:44] ModusPwnens	I imagine that the hello world code must have been verified to work over a network
[16:45] ModusPwnens	i highly doubt that it wouldn't work. I'm just a little unsure of what could cause the problem on my end
[16:46] guido_g	because you refuse to show "your end" no one knows
[16:47] ModusPwnens	fine fine, I will copy my code over
[16:47] guido_g	btw, this is the accepted practice when asking for help
[16:49] ModusPwnens	oh ok..Sorry, i didn't know. One of my coworkers told me about these irc chatrooms so i have never really been on one before
[16:50] ModusPwnens	Ok so, you don't want me to just copy the code right over right? You said something about a paste bin..
[16:50] guido_g	http://paste.pocoo.org/
[16:51] ModusPwnens	http://paste.pocoo.org/show/254360/
[16:51] ModusPwnens	the only other thing that could be different is that I have additional header files
[16:52] ModusPwnens	but i dont think that would cause a problem
[16:52] guido_g	you typed it, right?
[16:52] guido_g	ops, wrong window
[16:55] guido_g	ok, given that the packets are visible on both machines, try to replace the '*' in line 69 with the ip of the machine
[16:56] ModusPwnens	of the client machine?
[16:56] guido_g	line 70 actually
[16:56] guido_g	no, of course not
[16:56] guido_g	ip of the machine the server is running on
[16:57] guido_g	how many interfaces does this machine have?
[16:57] ModusPwnens	Hmm. I didn't think that was necessary since it was being run on that computer
[16:57] ModusPwnens	but i will try that
[16:58] guido_g	ok, if it doesen't work, gather all the information incl. the changes to the programs and post a problem report to the mailing list
[16:59] ModusPwnens	Okie doke. Thanks for your help!
[16:59] guido_g	important information is operating system, compiler, Ã¸mq version etc.
[17:00] ModusPwnens	as far as the compiler goes, is it alright to just say I'm using Visual Studio 2008? I'm not really sure what compiler it uses.
[17:00] guido_g	should be enough
[17:00] ModusPwnens	Ok. Thanks!
[17:00] guido_g	but i'm not a windows guy, so i can't tell for sure
[17:01] ModusPwnens	hey!! it works!
[17:01] guido_g	ok
[17:01] ModusPwnens	So
[17:02] ModusPwnens	why exactly did i need to put the ip address of the computer itself there?
[17:02] guido_g	what version of ÃMQ do you use?
[17:02] guido_g	no idea
[17:02] guido_g	might be a windows issue
[17:02] ModusPwnens	2.07
[17:03] guido_g	ok, that is sort of current
[17:03] ModusPwnens	how often are new versions released?
[17:03] guido_g	i'd say write this to the ml. might be a bug
[17:03] guido_g	when there is something new
[17:04] guido_g	no fixed times
[17:05] ModusPwnens	Oh ok. I will do as you suggested.
[17:06] guido_g	ok, have fun
[18:47] cremes	i have a question about communicating between XREQ and XREP pairs
[18:48] cremes	looking at the docs, it mentions adding a null message part between the identity and the body when communicating from REQ to XREP or XREQ to REP
[18:48] cremes	is that also necessary when going directly from XREQ to XREP?
[18:49] cremes	let me rephrase that last part...
[18:50] cremes	when returning a response from a XREP to a XREQ socket, do i need to build the message chain as "identity", "delimiter", "body parts"?
[23:16] mato	cremes: re. your question, no you don't need to add the delimiter when connecting XREQ
[23:16] mato	cremes: ...to XREP
[23:16] mato	cremes: however, if you do so you maintain the ability to connect a REQ/REP should you need to
[23:17] mato	cremes: the docs for XREQ/XREP are minimalist, feel free to expand/suggest clearer wording as a patch, thx