ZeroMq IRC Log

Friday September 30, 2011

[Time] Name	Message
[07:33] mikko	sustrik: are you there?
[07:47] sustrik	mikko: hi
[07:47] mikko	sustrik: want to try to reproduce this lost ack problem?
[07:48] sustrik	nope. have you uploaded the steps to reproduce?
[07:48] mikko	i can give them here
[07:48] mikko	i added a small piece of test code into the pzq repo
[07:49] mikko	that allows reproducing it
[07:49] sustrik	please, create a ticket otherwide it'll get lost
[07:49] mikko	ok, even though i'm not 100% sure it's a zeromq issue
[07:49] mikko	looks so though
[07:49] sustrik	sure
[07:49] mikko	i don't understand it yet
[07:49] mikko	as if i sleep at the end of the consumer code all ACKs get sent to the server
[07:50] mikko	but without sleep they disappear somewhere
[07:50] sustrik	yes, i recall that
[07:50] sustrik	some logging should be added to debug the thing
[07:51] sustrik	what kind of connection is that btw?
[07:51] sustrik	router/router?
[07:54] mikko	LIBZMQ-264
[07:55] mikko	there are a few
[07:55] mikko	DEALERS and ROUTERS
[07:55] mikko	see LIBZMQ-264
[07:55] mikko	sustrik: i added some debug code on the device
[07:55] mikko	the device is the border that receives the ACK first from client
[07:55] sustrik	i mean the connection loosing the acks
[07:55] mikko	and it seems that it doesn't get even there
[07:56] mikko	yes, i added some debug code on server side
[07:56] mikko	and it looks like the messages don't reach the server
[07:56] mikko	they do reach the server if i sleep (1)
[07:56] mikko	and tons of them come at the end in one big lump
[07:56] sustrik	ok
[07:57] mikko	https://github.com/mkoppanen/pzq/blob/master/tests/consumer.cpp
[07:57] mikko	the consumer code is fairly simple
[07:58] sustrik	what about the sender?
[07:58] mikko	that is the sender as well
[07:58] mikko	it consumes message and ACKs it
[07:58] mikko	using ROUTER -> DEALER
[07:58] sustrik	ok
[08:00] mikko	in the consumer.cpp if i add sleep (1) just before return 0; it all goes well
[08:02] mikko	i'm interested in seeing if it's just me
[08:06] sustrik	i'll give it a try later on
[09:08] jd10	https://github.com/kro/zeromq-scala-bindings
[09:10] sustrik	jd10: nice
[09:10] sustrik	you should link it from zero.mq site
[09:10] sustrik	so that people can find it
[09:10] jd10	aight, i'll add it to the bindings page
[09:11] jd10	scala bindings uses jna so no need for jzmq if you're using scala
[09:11] sustrik	ack
[09:15] jd10	for sake of consistency, now renamed to https://github.com/kro/zeromq-scala-binding
[12:35] guido_g	cool
[12:37] guido_g	even cooler, w/o the java bindings
[13:28] sustrik	mikko: there?
[13:32] mikko	sustrik: y
[13:32] sustrik	i'm trying to install pzq
[13:32] sustrik	having problems with boost
[13:33] mikko	whats the problem?
[13:33] sustrik	ok, solved
[13:33] sustrik	i've forgot to install some of the boost packages
[13:33] mikko	the cmake scripts might not be quite cross platform yet
[13:33] sustrik	sorry
[13:35] sustrik	mikko: still, what is it about the build directory?
[13:35] sustrik	i can't run cmake there
[13:35] sustrik	as there's no cmake build file there
[13:36] sustrik	$ git clone git://github.com/mkoppanen/pzq.git
[13:36] sustrik	$ cd pzq
[13:36] sustrik	$ mkdir build
[13:36] sustrik	$ cd build
[13:36] sustrik	$ cmake .. -DZEROMQ_ROOT=/path/to -DKYOTOCABINET_ROOT=/path/to
[13:36] sustrik	$ make
[13:38] sustrik	it seems to build ok in the main pzq directory though
[13:38] skm	when a router connects to another router for the first time - is that when identity information is passed and stored?
[13:38] sustrik	skm: yes
[13:39] skm	ok cool - that has done my head in for AGES
[13:39] skm	my server router that does the 'bind' has the name 'server1' and if it disconnects it can't re bind on the same ip/port with a different identity
[13:40] skm	('server1' is an example, i was actually using 'processid@ip:port'
[13:40] skm	when my server died and restarted, all clients could no longer talk to it because they obviously were told about the initial processed only
[13:40] mikko	sustrik: it's to keep build artifacts in one dir
[13:40] skm	processid*
[13:41] mikko	sustrik: you can reset to original state just by removing build directory
[13:41] sustrik	mikko: yes, i mean the steps above don't work
[13:41] sustrik	it's ok when i build in main dir
[13:41] mikko	ok, will look into that
[13:41] mikko	might be cmake version difference or something
[13:41] sustrik	no big problem
[13:42] mikko	did you get it running?
[13:42] sustrik	just that cmake looks for CMakeLists in the current dir
[13:42] sustrik	so it obviously doesn't work in build dir
[13:42] mikko	cmake .. should work
[13:42] sustrik	which is empty
[13:42] mikko	as that looks for one dir up
[13:42] sustrik	ah
[13:42] mikko	cmake [options] <path-to-source>
[13:42] mikko	cmake [options] <path-to-existing-build>
[13:42] sustrik	anyway, no big problem
[13:43] sustrik	skm: yes, it works that way
[13:43] mikko	yeah, doesn't really matter as you can easily clean with git as well if needed
[13:43] sustrik	ack
[13:43] mikko	sustrik: but it built ok otherwise?
[13:43] sustrik	looks like
[13:43] zirpu	i know 0mq doesn't have any security layer built in, but is there a way for the client/server to see the server/client connecting IP in order to implement an ACL list?
[13:43] skm	sustrik is there any way around it or should i just use ip:port as the ID?
[13:43] mikko	sustrik: if you run ./pzq it should tell "0 messages loaded from store"
[13:43] zirpu	probably not, but just thought i'd ask here before the mailing list.
[13:43] sustrik	it does
[13:44] sustrik	mikko: it does
[13:44] mikko	so if you run ./producer it should pump in 10k messages
[13:44] mikko	and after that running ./consumer should consume 1k
[13:44] mikko	after consuming you can press ctrl c on the pzq
[13:44] mikko	and restart it
[13:44] mikko	it should tell 9000 messages loaded
[13:44] mikko	but in my case it's something like 9021 or so
[13:45] sustrik	skm: not sure what it should do. if you restart the server with different identity it's basically a different app, so it doesn't make much sense to reconnect clients to it
[13:45] sustrik	zirpu: probably not
[13:46] skm	the clients reconnect automatically to it because they are aware of the ip/port combo
[13:46] zirpu	sustrik: cool. thanks.
[13:46] skm	you are right it is a different app - but the clients don't know that
[13:46] skm	because it's on the same ip/port
[13:46] skm	and they have once before connected to it
[13:46] skm	can you turn auto reconnect off?
[13:46] sustrik	skm: if you want to use ip/port to identify the sever just don't set the identity on the server
[13:47] sustrik	then it's identified solely by ip/port
[13:47] skm	sustrik: i actually want to identify it as a new/different app (but using the same ip/port) but the clients can't talk to it if it has a different identity
[13:48] skm	and if it has the same identity some old msgs can be received on it
[13:48] sustrik	mikko: 9004
[13:48] sustrik	skm: exactly
[13:48] skm	i've had to make one of the message parts a filter for the server process id so when the server starts up again with its ip/port identity, it only looks at msgs recvd with it's process id in it, not a previous one
[13:50] mikko	sustrik: thats ok
[13:50] mikko	sustrik: do it a few times
[13:50] mikko	sustrik: some messages might expire
[13:51] sustrik	mikko: after consuming 2000 more messages: 7-35
[13:51] sustrik	7035
[13:52] mikko	yeah
[13:52] mikko	thats what i see as well
[13:52] sustrik	ok
[13:52] mikko	now, tests/consumer.cpp
[13:52] mikko	add sleep (1); before return 0;
[13:52] mikko	and rebuild
[13:52] mikko	you should see fairly consistent consuming
[13:52] sustrik	ok
[13:53] mikko	might be 2 missing because they have expired (two get pushed before hwm is reached)
[13:54] mikko	and from what i have debugged those ACKs never reach the pzq daemon
[13:54] sustrik	what does "expired" mean?
[13:56] sustrik	skm: yes, the identity semantics don't work well in corner cases, that's why identities were removed in development trunk
[13:57] mikko	sustrik: the pzq sends the message to consumer
[13:57] mikko	and waits for ACK for N amount of time
[13:57] mikko	and if the ACK doesn't come it considers the consumer dead
[13:57] mikko	and schedules the message for redelivery
[13:58] sustrik	ah, so "expired" = "scheduled for resend"
[13:58] mikko	yes
[13:58] sustrik	got it
[13:59] sustrik	mikko: how do i clean the DB btw?
[13:59] mikko	rm /tmp/sink.kch
[13:59] skm	sustrik how will/does that work? how does moving messages between xrep rep req xreq work with no identities?
[13:59] mikko	thats the default path
[13:59] mikko	there is no programmatic way for now
[13:59] sustrik	skm: there are auto-generated identities
[13:59] sustrik	always unique
[14:00] sustrik	mikko: ok
[14:00] mikko	sustrik: did you test with sleep (1) ?
[14:00] sustrik	nope
[14:00] mikko	you can increase --ack-timeout if you don't want expiries
[14:01] mikko	it's microseconds
[14:01] sustrik	i believe you it works :)
[14:01] mikko	the behaviour i see is that tons of those ACKs come in a lump at the end
[14:01] mikko	and without sleep that lump is lost
[14:01] sustrik	ok
[14:01] mikko	and as there is no linger on the sender side i would expect it to block at the end
[14:26] mikko	sustrik: did you look any further?
[17:14] mikko	sustrik: there?
[19:35] cremes	i need some help confirming a bug in the latest master for zeromq-3_0
[19:36] cremes	using the local_thr and remote_thr throughput tests, I am showing that the receiver
[19:36] cremes	(local_thr) hangs because it doesn't receive all of the messages
[19:36] cremes	it looks like it misses around 50 messages at the tail end
[19:38] minrk1	parameters? I just ran the test with current 3.0, and it finished
[19:41] cremes	minrk1: try tcp://127.0.0.1:5555 1024 100000
[19:42] minrk1	success
[19:43] minrk1	(success as in no hang, not success as in reproduced your bug)
[19:43] cremes	hmmm...
[19:43] cremes	what os are you on?
[19:43] cremes	i have osx that i see this happening
[19:43] minrk1	OSX - building now on Linux
[19:44] cremes	the same code completes when i load up 2.1.x
[19:44] cremes	i updated to latest master so i should have all fixes
[19:45] minrk1	weird
[19:45] cremes	very
[19:49] cremes	no SNDHWM or RCVHWM is set...
[19:49] minrk1	the sender exits fine?
[19:51] cremes	yes
[19:52] cremes	every zmq_* returns 0 (or the number of bytes sent)
[19:52] cremes	no errors
[19:52] cremes	and i print from inside the loop to confirm it spits out all messages
[19:53] minrk1	I don't suppose there might be another sender adding a few messages on the same port
[19:53] cremes	no
[19:54] minrk1	so you have a count of the missed messages?
[19:54] cremes	yes
[19:54] cremes	when i use 1000, the receiver gets from 950 to 965 of them
[19:54] cremes	the publisher prints all 1000
[19:57] cremes	i'll look at it more this weekend
[19:57] minrk1	I feel like I've seen similar before, but it's been a long time
[19:58] cremes	just saw something interesting...
[19:58] cremes	i set local_thr to receive 10 messages
[19:58] cremes	and i set remote_thr to send 10
[19:58] cremes	local_thr did not see any at all
[19:58] cremes	so i bumped remote_thr to 50
[19:58] cremes	it sent all of them but local_thr only saw 3 messages
[19:59] cremes	sounds like something is getting buffered someplace
[20:00] whitej	greetings.. qq to the general populus
[20:00] minrk1	cremes: what if you put sleeps after connect and before close, does that change anything?
[20:01] whitej	zmq::context_t's and zmq::socket_t's created on the stack should clean-up and unwind themselves
[20:01] whitej	correct?
[20:01] whitej	(using C++ interface)
[20:01] cremes	minrk1: that worked
[20:01] minrk1	okay, check which one matters
[20:02] cremes	in remote_thr, if i comment out the sleep after zmq_connect(), messages are dropped
[20:02] cremes	so apparently it starts transmitting before the connection is fully established
[20:02] minrk1	it would seem that way
[20:02] cremes	(good suggestion btw!)
[20:03] cremes	odd that you don't see this too though
[20:03] cremes	you're using 3.02, yes?
[20:04] minrk1	could be your computer is enough faster (or slower) than mine that the window is only open for you
[20:04] cremes	crazy... i wonder how i can write this up when i don't have a repro that works on other computers
[20:05] minrk1	current git master (24bc1e510e191ad27fddae37a8714efab2911b47)
[20:05] cremes	for all i know, it has to do with a race condition with the xsub sending its subscription up to the xpub
[20:05] cremes	yes, mine matches that hash
[20:06] cremes	that actually makes sense...
[20:06] cremes	it starts transmitting before any filter is set; a null filter causes it to drop those messages
[20:06] minrk1	that sounds exactly right
[20:06] cremes	then the filter arrives, gets set, and the messages are forwarded
[20:07] cremes	ugh, what a headache! :)
[20:07] minrk1	ahh!
[20:07] minrk1	there's a known issue that things do not behave correctly when XSUB binds, and XPUB connects
[20:08] cremes	ah yes, that's it then
[20:09] minrk1	https://zeromq.jira.com/browse/LIBZMQ-248
[20:10] whitej	anyone know if there is a specific way to close a socket_t with the C++ API
[20:10] whitej	playing around with this stuff and the server cannot rebind to the tcp port
[20:11] whitej	looking at zmq.hpp... it appears that the socket_t and context_t clean up and terminate correctly in the destructors
[20:11] minrk1	looks that way
[20:11] minrk1	you can call socket_t.close() directly, I think
[20:12] whitej	seeing a "Address already in use" exception
[20:12] mikko	cremes: i'm seeing the same issue
[20:12] mikko	cremes: 20:54 < cremes> when i use 1000, the receiver gets from 950 to 965 of them
[20:12] mikko	cremes: LIBZMQ-264
[20:13] mikko	but with ROUTER/DEALER
[20:21] cremes	mikko: interesting... i'll write up a test using ROUTER/DEALER to see if i get the same
[20:23] mikko	cremes: i wasnt able to get the issue with simple tests
[20:23] mikko	cremes: but with pzq i get it consistently
[20:23] mikko	in this case the messages get batched
[20:23] mikko	and large batch seems to be sent at the end
[20:23] mikko	unless i sleep(1) at the end of the script the batch is lost
[20:23] minrk1	mikko: were you sending unprompted messages from ROUTER sockets immediately after connection?
[20:24] mikko	minrk1: i seem to be losing the messages at tail end
[20:24] minrk1	ah, so not the front, like cremes?
[20:25] mikko	i'm quite sure at the end
[20:25] mikko	but i can check
[20:25] mikko	sleep (1) at the end would suggest that it's in the end
[20:25] minrk1	it certainly would
[20:25] mikko	but they could be messages batched at start
[20:25] mikko	unlikely
[20:25] minrk1	Because if you send from ROUTER to DEALER, knowing the identity of the DEALER, but not having received a message from it, it can be that the handshake is incomplete, and the ROUTER discards unroutable messages
[20:27] mikko	so, assuming that i sleep after first message the rest should go ok?
[20:27] mikko	i seem to be losing 1% - 5% of messages
[20:27] minrk1	If it's the problem I'm describing, it should be a relatively constant number
[20:28] minrk1	that is indeed what cremes saw - ~950-960/100, but only 3/50
[20:28] mikko	Loaded 102223 messages from store
[20:28] mikko	consumed 1000
[20:28] mikko	and
[20:28] mikko	Loaded 102180 messages from store
[20:29] mikko	so fairly large chunk lost
[20:29] minrk1	~45? - the same number cremes is seeing with 50 messages and 1000
[20:30] mikko	if i sleep at the end of the consumer
[20:30] mikko	all goes well
[20:30] mikko	i was debugging this on the server side
[20:30] mikko	and i see a huge lump of messages coming at the end
[20:30] mikko	so it looks like batching
[20:30] mikko	but linger should prevent that batch to get lost
[20:30] minrk1	not on the consumer
[20:31] mikko	sorry, in my case consumer is confusing
[20:31] minrk1	on the receiver, that is
[20:31] mikko	server sends message, consumer receives, consumer sends ack, server receives ack
[20:31] mikko	and the ack messages get lost
[20:32] minrk1	consumer is DEALER, server ROUTER?
[20:32] minrk1	or no...
[20:32] mikko	server is dealer
[20:32] mikko	consumer is router
[20:32] minrk1	ok, that makes more sense
[20:33] minrk1	and this cycle happens many times in the life of one ROUTER socket?
[20:33] mikko	yes
[20:33] mikko	it can happen
[20:34] mikko	but i'm testing with: start consumer, consume 1000, exit
[20:34] mikko	exit is not hard exit
[20:34] mikko	it should allow zmq_term to happen
[20:34] minrk1	and all 1000 are on a single socket
[20:34] mikko	yes
[20:34] minrk1	and sleep before consumer.close() at the very end solves it?
[20:35] mikko	i dont call close explicitly but effective yes
[20:35] mikko	they are on stack so they should go out of scope
[20:35] minrk1	wait, you don't call close?
[20:36] mikko	not explicitly
[20:36] mikko	~socket_t () calls close
[20:36] minrk1	call close before term
[20:36] minrk1	oh, cpp bindings
[20:36] mikko	yes
[20:36] mikko	it happens with php bindings as well
[20:36] mikko	which use the C api
[20:37] minrk1	but you do call term?
[20:37] mikko	~context_t () terminates
[20:37] mikko	https://github.com/zeromq/zeromq2-1/blob/master/include/zmq.hpp#L187
[20:37] whitej	the context was created on the stack before the socket... so it should be getting cleaned up after the socket
[20:38] minrk1	that makes sense
[20:38] minrk1	you might try calling close then term explicitly, just to be sure
[20:38] whitej	and now I'm stuck with...
[20:38] whitej	$ lsof -i tcp:5555 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME zmq_serve 19187 whitej 15u IPv4 24031 0t0 TCP *:5555 (LISTEN)
[20:39] mikko	minrk1: there is no explicit term
[20:39] mikko	but let me try to close socket
[20:39] mikko	in any case this should already happen
[20:39] mikko	because ~socket_t () calls close
[20:39] mikko	and ~context_t calls term
[20:39] mikko	if there are sockets open during term it would block
[20:39] mikko	but let's see
[20:41] mikko	explicit socket.close ();
[20:41] mikko	Loaded 100255 messages from store
[20:41] mikko	and Loaded 99266 messages from store
[20:41] mikko	11 still lost
[20:41] mikko	which is approx what i am seeing without the close as well
[20:41] mikko	it varies
[20:42] minrk1	ok
[20:42] mikko	with sleep (1); consistent 1k
[20:42] minrk1	then it just sounds like ROUTER sockets don't respect LINGER somehow
[20:45] mikko	minrk1: that's what i am thinking as well
[20:46] minrk1	Do you have an isolated test case, ideally with few messages that still causes the problem?
[20:47] whitej	my guess is that I need to add a signal handler for ctrl c
[20:47] whitej	but yes.. have a test case
[20:47] mikko	minrk1: i can't isolate this into a small case
[20:48] mikko	whitej: you need to close all sockets before terminating context
[20:53] whitej	mikko: ah, ok... makes sense
[21:03] whitej	odd... sig handler is in there
[21:04] whitej	if I put the explicit close... it works
[21:04] whitej	if I leave that out.. it borks
[21:04] whitej	~socket_t should be doing that close before the context terminates
[21:05] whitej	so I am expecting that I do not need the explicit close
[23:05] staticfloat	Hey guys, I'm trying to do ROUTER <-> ROUTER messaging using custom routing. I've set an identity on one of the ROUTER sockets, and then connected to it from another socket, but when I send to that socket I just connected to, I never receive the message.
[23:06] staticfloat	So to state that a little more clearly, I send from an anonymous ROUTER to "A", and I can see through wireshark that my packet to "A" does indeed contain the data I want it to,
[23:06] staticfloat	but "A" never returns from recv()
[23:07] minrk	1. make sure you set identity before you call bind/connect
[23:08] staticfloat	I set an identity before I bound, and this allows me to see the packet on the network. Before, when I didn't do that, I didn't see any packets being sent
[23:08] minrk	ok, so the message is sent
[23:08] staticfloat	To be clear, I'm setting an identity on only one ROUTER right now, and then connecting to that ROUTER using a different ROUTER.
[23:08] staticfloat	And I'm using 3.0
[23:11] minrk	what is the full message you are sending?
[23:12] staticfloat	Right now, I'm sending two message parts; First, the identity, "tcp://127.0.0.1:5040", and then the identity again, as a second message part.
[23:12] staticfloat	I have tried different values for both identity, and message contents, nothing seems to make a difference
[23:15] staticfloat	I would like to debug the zmq sources maybe get a better idea of what's going on inside, however xcode doesn't seem to want to load the debugging information from libzmq.a
[23:15] staticfloat	Does anyone have experience with this kind of thing?
[23:16] staticfloat	Failing that, any tips on why my ROUTER on the receiving side seems to be dropping packets would be nice. :)
[23:17] minrk	I haven't used xcode much, but I do use ROUTER-ROUTER connections regularly
[23:18] minrk	you said two message parts, but you described 3 - identity,url,identity
[23:19] staticfloat	My socket's identity is "tcp://127.0.0.1:5040", and I send that twice
[23:19] staticfloat	once for the outbound ROUTER socket to route with,
[23:19] minrk	oh, sorry
[23:19] staticfloat	once to be received by the other ROUTER
[23:19] staticfloat	as data
[23:20] minrk	with SNDMORE correctly?
[23:21] minrk	are you sending immediately after connecting?
[23:21] staticfloat	Yes, I am sending immediately after connecting
[23:21] staticfloat	Should I sleep(1) or something?
[23:21] minrk	try sticking a sleep there, just to test
[23:22] minrk	ROUTERs need to handshake before they notice an endpoint exists
[23:40] staticfloat	Alright
[23:40] staticfloat	sleeping makes it work now
[23:41] staticfloat	Thank you
[23:41] staticfloat	:)
[23:41] minrk1	sure
[23:41] minrk1	XREP sockets (now known as ROUTER) were designed with replying in mind
[23:42] minrk1	so if they try to send a message to a peer that doesn't exist, they just discard the message, since the requester is gone
[23:43] minrk1	sending a message from a ROUTER before handshake completes is the same as sending a reply to a peer that has shutdown - it is discarded.

ZeroMq Home

Friday September 30, 2011