IRC Log

Friday September 30, 2011

[Time] NameMessage
[07:33] mikko sustrik: are you there?
[07:47] sustrik mikko: hi
[07:47] mikko sustrik: want to try to reproduce this lost ack problem?
[07:48] sustrik nope. have you uploaded the steps to reproduce?
[07:48] mikko i can give them here
[07:48] mikko i added a small piece of test code into the pzq repo
[07:49] mikko that allows reproducing it
[07:49] sustrik please, create a ticket otherwide it'll get lost
[07:49] mikko ok, even though i'm not 100% sure it's a zeromq issue
[07:49] mikko looks so though
[07:49] sustrik sure
[07:49] mikko i don't understand it yet
[07:49] mikko as if i sleep at the end of the consumer code all ACKs get sent to the server
[07:50] mikko but without sleep they disappear somewhere
[07:50] sustrik yes, i recall that
[07:50] sustrik some logging should be added to debug the thing
[07:51] sustrik what kind of connection is that btw?
[07:51] sustrik router/router?
[07:54] mikko LIBZMQ-264
[07:55] mikko there are a few
[07:55] mikko DEALERS and ROUTERS
[07:55] mikko see LIBZMQ-264
[07:55] mikko sustrik: i added some debug code on the device
[07:55] mikko the device is the border that receives the ACK first from client
[07:55] sustrik i mean the connection loosing the acks
[07:55] mikko and it seems that it doesn't get even there
[07:56] mikko yes, i added some debug code on server side
[07:56] mikko and it looks like the messages don't reach the server
[07:56] mikko they do reach the server if i sleep (1)
[07:56] mikko and tons of them come at the end in one big lump
[07:56] sustrik ok
[07:57] mikko https://github.com/mkoppanen/pzq/blob/master/tests/consumer.cpp
[07:57] mikko the consumer code is fairly simple
[07:58] sustrik what about the sender?
[07:58] mikko that is the sender as well
[07:58] mikko it consumes message and ACKs it
[07:58] mikko using ROUTER -> DEALER
[07:58] sustrik ok
[08:00] mikko in the consumer.cpp if i add sleep (1) just before return 0; it all goes well
[08:02] mikko i'm interested in seeing if it's just me
[08:06] sustrik i'll give it a try later on
[09:08] jd10 https://github.com/kro/zeromq-scala-bindings
[09:10] sustrik jd10: nice
[09:10] sustrik you should link it from zero.mq site
[09:10] sustrik so that people can find it
[09:10] jd10 aight, i'll add it to the bindings page
[09:11] jd10 scala bindings uses jna so no need for jzmq if you're using scala
[09:11] sustrik ack
[09:15] jd10 for sake of consistency, now renamed to https://github.com/kro/zeromq-scala-binding
[12:35] guido_g cool
[12:37] guido_g even cooler, w/o the java bindings
[13:28] sustrik mikko: there?
[13:32] mikko sustrik: y
[13:32] sustrik i'm trying to install pzq
[13:32] sustrik having problems with boost
[13:33] mikko whats the problem?
[13:33] sustrik ok, solved
[13:33] sustrik i've forgot to install some of the boost packages
[13:33] mikko the cmake scripts might not be quite cross platform yet
[13:33] sustrik sorry
[13:35] sustrik mikko: still, what is it about the build directory?
[13:35] sustrik i can't run cmake there
[13:35] sustrik as there's no cmake build file there
[13:36] sustrik $ git clone git://github.com/mkoppanen/pzq.git
[13:36] sustrik $ cd pzq
[13:36] sustrik $ mkdir build
[13:36] sustrik $ cd build
[13:36] sustrik $ cmake .. -DZEROMQ_ROOT=/path/to -DKYOTOCABINET_ROOT=/path/to
[13:36] sustrik $ make
[13:38] sustrik it seems to build ok in the main pzq directory though
[13:38] skm when a router connects to another router for the first time - is that when identity information is passed and stored?
[13:38] sustrik skm: yes
[13:39] skm ok cool - that has done my head in for AGES
[13:39] skm my server router that does the 'bind' has the name 'server1' and if it disconnects it can't re bind on the same ip/port with a different identity
[13:40] skm ('server1' is an example, i was actually using 'processid@ip:port'
[13:40] skm when my server died and restarted, all clients could no longer talk to it because they obviously were told about the initial processed only
[13:40] mikko sustrik: it's to keep build artifacts in one dir
[13:40] skm processid*
[13:41] mikko sustrik: you can reset to original state just by removing build directory
[13:41] sustrik mikko: yes, i mean the steps above don't work
[13:41] sustrik it's ok when i build in main dir
[13:41] mikko ok, will look into that
[13:41] mikko might be cmake version difference or something
[13:41] sustrik no big problem
[13:42] mikko did you get it running?
[13:42] sustrik just that cmake looks for CMakeLists in the current dir
[13:42] sustrik so it obviously doesn't work in build dir
[13:42] mikko cmake .. should work
[13:42] sustrik which is empty
[13:42] mikko as that looks for one dir up
[13:42] sustrik ah
[13:42] mikko cmake [options] <path-to-source>
[13:42] mikko cmake [options] <path-to-existing-build>
[13:42] sustrik anyway, no big problem
[13:43] sustrik skm: yes, it works that way
[13:43] mikko yeah, doesn't really matter as you can easily clean with git as well if needed
[13:43] sustrik ack
[13:43] mikko sustrik: but it built ok otherwise?
[13:43] sustrik looks like
[13:43] zirpu i know 0mq doesn't have any security layer built in, but is there a way for the client/server to see the server/client connecting IP in order to implement an ACL list?
[13:43] skm sustrik is there any way around it or should i just use ip:port as the ID?
[13:43] mikko sustrik: if you run ./pzq it should tell "0 messages loaded from store"
[13:43] zirpu probably not, but just thought i'd ask here before the mailing list.
[13:43] sustrik it does
[13:44] sustrik mikko: it does
[13:44] mikko so if you run ./producer it should pump in 10k messages
[13:44] mikko and after that running ./consumer should consume 1k
[13:44] mikko after consuming you can press ctrl c on the pzq
[13:44] mikko and restart it
[13:44] mikko it should tell 9000 messages loaded
[13:44] mikko but in my case it's something like 9021 or so
[13:45] sustrik skm: not sure what it should do. if you restart the server with different identity it's basically a different app, so it doesn't make much sense to reconnect clients to it
[13:45] sustrik zirpu: probably not
[13:46] skm the clients reconnect automatically to it because they are aware of the ip/port combo
[13:46] zirpu sustrik: cool. thanks.
[13:46] skm you are right it is a different app - but the clients don't know that
[13:46] skm because it's on the same ip/port
[13:46] skm and they have once before connected to it
[13:46] skm can you turn auto reconnect off?
[13:46] sustrik skm: if you want to use ip/port to identify the sever just don't set the identity on the server
[13:47] sustrik then it's identified solely by ip/port
[13:47] skm sustrik: i actually want to identify it as a new/different app (but using the same ip/port) but the clients can't talk to it if it has a different identity
[13:48] skm and if it has the same identity some old msgs can be received on it
[13:48] sustrik mikko: 9004
[13:48] sustrik skm: exactly
[13:48] skm i've had to make one of the message parts a filter for the server process id so when the server starts up again with its ip/port identity, it only looks at msgs recvd with it's process id in it, not a previous one
[13:50] mikko sustrik: thats ok
[13:50] mikko sustrik: do it a few times
[13:50] mikko sustrik: some messages might expire
[13:51] sustrik mikko: after consuming 2000 more messages: 7-35
[13:51] sustrik 7035
[13:52] mikko yeah
[13:52] mikko thats what i see as well
[13:52] sustrik ok
[13:52] mikko now, tests/consumer.cpp
[13:52] mikko add sleep (1); before return 0;
[13:52] mikko and rebuild
[13:52] mikko you should see fairly consistent consuming
[13:52] sustrik ok
[13:53] mikko might be 2 missing because they have expired (two get pushed before hwm is reached)
[13:54] mikko and from what i have debugged those ACKs never reach the pzq daemon
[13:54] sustrik what does "expired" mean?
[13:56] sustrik skm: yes, the identity semantics don't work well in corner cases, that's why identities were removed in development trunk
[13:57] mikko sustrik: the pzq sends the message to consumer
[13:57] mikko and waits for ACK for N amount of time
[13:57] mikko and if the ACK doesn't come it considers the consumer dead
[13:57] mikko and schedules the message for redelivery
[13:58] sustrik ah, so "expired" = "scheduled for resend"
[13:58] mikko yes
[13:58] sustrik got it
[13:59] sustrik mikko: how do i clean the DB btw?
[13:59] mikko rm /tmp/sink.kch
[13:59] skm sustrik how will/does that work? how does moving messages between xrep rep req xreq work with no identities?
[13:59] mikko thats the default path
[13:59] mikko there is no programmatic way for now
[13:59] sustrik skm: there are auto-generated identities
[13:59] sustrik always unique
[14:00] sustrik mikko: ok
[14:00] mikko sustrik: did you test with sleep (1) ?
[14:00] sustrik nope
[14:00] mikko you can increase --ack-timeout if you don't want expiries
[14:01] mikko it's microseconds
[14:01] sustrik i believe you it works :)
[14:01] mikko the behaviour i see is that tons of those ACKs come in a lump at the end
[14:01] mikko and without sleep that lump is lost
[14:01] sustrik ok
[14:01] mikko and as there is no linger on the sender side i would expect it to block at the end
[14:26] mikko sustrik: did you look any further?
[17:14] mikko sustrik: there?
[19:35] cremes i need some help confirming a bug in the latest master for zeromq-3_0
[19:36] cremes using the local_thr and remote_thr throughput tests, I am showing that the receiver
[19:36] cremes (local_thr) hangs because it doesn't receive all of the messages
[19:36] cremes it looks like it misses around 50 messages at the tail end
[19:38] minrk1 parameters? I just ran the test with current 3.0, and it finished
[19:41] cremes minrk1: try tcp://127.0.0.1:5555 1024 100000
[19:42] minrk1 success
[19:43] minrk1 (success as in no hang, not success as in reproduced your bug)
[19:43] cremes hmmm...
[19:43] cremes what os are you on?
[19:43] cremes i have osx that i see this happening
[19:43] minrk1 OSX - building now on Linux
[19:44] cremes the same code completes when i load up 2.1.x
[19:44] cremes i updated to latest master so i should have all fixes
[19:45] minrk1 weird
[19:45] cremes very
[19:49] cremes no SNDHWM or RCVHWM is set...
[19:49] minrk1 the sender exits fine?
[19:51] cremes yes
[19:52] cremes every zmq_* returns 0 (or the number of bytes sent)
[19:52] cremes no errors
[19:52] cremes and i print from inside the loop to confirm it spits out all messages
[19:53] minrk1 I don't suppose there might be another sender adding a few messages on the same port
[19:53] cremes no
[19:54] minrk1 so you have a count of the missed messages?
[19:54] cremes yes
[19:54] cremes when i use 1000, the receiver gets from 950 to 965 of them
[19:54] cremes the publisher prints all 1000
[19:57] cremes i'll look at it more this weekend
[19:57] minrk1 I feel like I've seen similar before, but it's been a long time
[19:58] cremes just saw something interesting...
[19:58] cremes i set local_thr to receive 10 messages
[19:58] cremes and i set remote_thr to send 10
[19:58] cremes local_thr did not see any at all
[19:58] cremes so i bumped remote_thr to 50
[19:58] cremes it sent all of them but local_thr only saw 3 messages
[19:59] cremes sounds like something is getting buffered someplace
[20:00] whitej greetings.. qq to the general populus
[20:00] minrk1 cremes: what if you put sleeps after connect and before close, does that change anything?
[20:01] whitej zmq::context_t's and zmq::socket_t's created on the stack should clean-up and unwind themselves
[20:01] whitej correct?
[20:01] whitej (using C++ interface)
[20:01] cremes minrk1: that worked
[20:01] minrk1 okay, check which one matters
[20:02] cremes in remote_thr, if i comment out the sleep after zmq_connect(), messages are dropped
[20:02] cremes so apparently it starts transmitting before the connection is fully established
[20:02] minrk1 it would seem that way
[20:02] cremes (good suggestion btw!)
[20:03] cremes odd that you don't see this too though
[20:03] cremes you're using 3.02, yes?
[20:04] minrk1 could be your computer is enough faster (or slower) than mine that the window is only open for you
[20:04] cremes crazy... i wonder how i can write this up when i don't have a repro that works on other computers
[20:05] minrk1 current git master (24bc1e510e191ad27fddae37a8714efab2911b47)
[20:05] cremes for all i know, it has to do with a race condition with the xsub sending its subscription up to the xpub
[20:05] cremes yes, mine matches that hash
[20:06] cremes that actually makes sense...
[20:06] cremes it starts transmitting before any filter is set; a null filter causes it to drop those messages
[20:06] minrk1 that sounds exactly right
[20:06] cremes then the filter arrives, gets set, and the messages are forwarded
[20:07] cremes ugh, what a headache! :)
[20:07] minrk1 ahh!
[20:07] minrk1 there's a known issue that things do not behave correctly when XSUB binds, and XPUB connects
[20:08] cremes ah yes, that's it then
[20:09] minrk1 https://zeromq.jira.com/browse/LIBZMQ-248
[20:10] whitej anyone know if there is a specific way to close a socket_t with the C++ API
[20:10] whitej playing around with this stuff and the server cannot rebind to the tcp port
[20:11] whitej looking at zmq.hpp... it appears that the socket_t and context_t clean up and terminate correctly in the destructors
[20:11] minrk1 looks that way
[20:11] minrk1 you can call socket_t.close() directly, I think
[20:12] whitej seeing a "Address already in use" exception
[20:12] mikko cremes: i'm seeing the same issue
[20:12] mikko cremes: 20:54 < cremes> when i use 1000, the receiver gets from 950 to 965 of them
[20:12] mikko cremes: LIBZMQ-264
[20:13] mikko but with ROUTER/DEALER
[20:21] cremes mikko: interesting... i'll write up a test using ROUTER/DEALER to see if i get the same
[20:23] mikko cremes: i wasnt able to get the issue with simple tests
[20:23] mikko cremes: but with pzq i get it consistently
[20:23] mikko in this case the messages get batched
[20:23] mikko and large batch seems to be sent at the end
[20:23] mikko unless i sleep(1) at the end of the script the batch is lost
[20:23] minrk1 mikko: were you sending unprompted messages from ROUTER sockets immediately after connection?
[20:24] mikko minrk1: i seem to be losing the messages at tail end
[20:24] minrk1 ah, so not the front, like cremes?
[20:25] mikko i'm quite sure at the end
[20:25] mikko but i can check
[20:25] mikko sleep (1) at the end would suggest that it's in the end
[20:25] minrk1 it certainly would
[20:25] mikko but they could be messages batched at start
[20:25] mikko unlikely
[20:25] minrk1 Because if you send from ROUTER to DEALER, knowing the identity of the DEALER, but not having received a message from it, it can be that the handshake is incomplete, and the ROUTER discards unroutable messages
[20:27] mikko so, assuming that i sleep after first message the rest should go ok?
[20:27] mikko i seem to be losing 1% - 5% of messages
[20:27] minrk1 If it's the problem I'm describing, it should be a relatively constant number
[20:28] minrk1 that is indeed what cremes saw - ~950-960/100, but only 3/50
[20:28] mikko Loaded 102223 messages from store
[20:28] mikko consumed 1000
[20:28] mikko and
[20:28] mikko Loaded 102180 messages from store
[20:29] mikko so fairly large chunk lost
[20:29] minrk1 ~45? - the same number cremes is seeing with 50 messages and 1000
[20:30] mikko if i sleep at the end of the consumer
[20:30] mikko all goes well
[20:30] mikko i was debugging this on the server side
[20:30] mikko and i see a huge lump of messages coming at the end
[20:30] mikko so it looks like batching
[20:30] mikko but linger should prevent that batch to get lost
[20:30] minrk1 not on the consumer
[20:31] mikko sorry, in my case consumer is confusing
[20:31] minrk1 on the receiver, that is
[20:31] mikko server sends message, consumer receives, consumer sends ack, server receives ack
[20:31] mikko and the ack messages get lost
[20:32] minrk1 consumer is DEALER, server ROUTER?
[20:32] minrk1 or no...
[20:32] mikko server is dealer
[20:32] mikko consumer is router
[20:32] minrk1 ok, that makes more sense
[20:33] minrk1 and this cycle happens many times in the life of one ROUTER socket?
[20:33] mikko yes
[20:33] mikko it can happen
[20:34] mikko but i'm testing with: start consumer, consume 1000, exit
[20:34] mikko exit is not hard exit
[20:34] mikko it should allow zmq_term to happen
[20:34] minrk1 and all 1000 are on a single socket
[20:34] mikko yes
[20:34] minrk1 and sleep before consumer.close() at the very end solves it?
[20:35] mikko i dont call close explicitly but effective yes
[20:35] mikko they are on stack so they should go out of scope
[20:35] minrk1 wait, you don't call close?
[20:36] mikko not explicitly
[20:36] mikko ~socket_t () calls close
[20:36] minrk1 call close before term
[20:36] minrk1 oh, cpp bindings
[20:36] mikko yes
[20:36] mikko it happens with php bindings as well
[20:36] mikko which use the C api
[20:37] minrk1 but you do call term?
[20:37] mikko ~context_t () terminates
[20:37] mikko https://github.com/zeromq/zeromq2-1/blob/master/include/zmq.hpp#L187
[20:37] whitej the context was created on the stack before the socket... so it should be getting cleaned up after the socket
[20:38] minrk1 that makes sense
[20:38] minrk1 you might try calling close then term explicitly, just to be sure
[20:38] whitej and now I'm stuck with...
[20:38] whitej $ lsof -i tcp:5555 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME zmq_serve 19187 whitej 15u IPv4 24031 0t0 TCP *:5555 (LISTEN)
[20:39] mikko minrk1: there is no explicit term
[20:39] mikko but let me try to close socket
[20:39] mikko in any case this should already happen
[20:39] mikko because ~socket_t () calls close
[20:39] mikko and ~context_t calls term
[20:39] mikko if there are sockets open during term it would block
[20:39] mikko but let's see
[20:41] mikko explicit socket.close ();
[20:41] mikko Loaded 100255 messages from store
[20:41] mikko and Loaded 99266 messages from store
[20:41] mikko 11 still lost
[20:41] mikko which is approx what i am seeing without the close as well
[20:41] mikko it varies
[20:42] minrk1 ok
[20:42] mikko with sleep (1); consistent 1k
[20:42] minrk1 then it just sounds like ROUTER sockets don't respect LINGER somehow
[20:45] mikko minrk1: that's what i am thinking as well
[20:46] minrk1 Do you have an isolated test case, ideally with few messages that still causes the problem?
[20:47] whitej my guess is that I need to add a signal handler for ctrl c
[20:47] whitej but yes.. have a test case
[20:47] mikko minrk1: i can't isolate this into a small case
[20:48] mikko whitej: you need to close all sockets before terminating context
[20:53] whitej mikko: ah, ok... makes sense
[21:03] whitej odd... sig handler is in there
[21:04] whitej if I put the explicit close... it works
[21:04] whitej if I leave that out.. it borks
[21:04] whitej ~socket_t should be doing that close before the context terminates
[21:05] whitej so I am expecting that I do not need the explicit close
[23:05] staticfloat Hey guys, I'm trying to do ROUTER <-> ROUTER messaging using custom routing. I've set an identity on one of the ROUTER sockets, and then connected to it from another socket, but when I send to that socket I just connected to, I never receive the message.
[23:06] staticfloat So to state that a little more clearly, I send from an anonymous ROUTER to "A", and I can see through wireshark that my packet to "A" does indeed contain the data I want it to,
[23:06] staticfloat but "A" never returns from recv()
[23:07] minrk 1. make sure you set identity before you call bind/connect
[23:08] staticfloat I set an identity before I bound, and this allows me to see the packet on the network. Before, when I didn't do that, I didn't see any packets being sent
[23:08] minrk ok, so the message is sent
[23:08] staticfloat To be clear, I'm setting an identity on only one ROUTER right now, and then connecting to that ROUTER using a different ROUTER.
[23:08] staticfloat And I'm using 3.0
[23:11] minrk what is the full message you are sending?
[23:12] staticfloat Right now, I'm sending two message parts; First, the identity, "tcp://127.0.0.1:5040", and then the identity again, as a second message part.
[23:12] staticfloat I have tried different values for both identity, and message contents, nothing seems to make a difference
[23:15] staticfloat I would like to debug the zmq sources maybe get a better idea of what's going on inside, however xcode doesn't seem to want to load the debugging information from libzmq.a
[23:15] staticfloat Does anyone have experience with this kind of thing?
[23:16] staticfloat Failing that, any tips on why my ROUTER on the receiving side seems to be dropping packets would be nice. :)
[23:17] minrk I haven't used xcode much, but I do use ROUTER-ROUTER connections regularly
[23:18] minrk you said two message parts, but you described 3 - identity,url,identity
[23:19] staticfloat My socket's identity is "tcp://127.0.0.1:5040", and I send that twice
[23:19] staticfloat once for the outbound ROUTER socket to route with,
[23:19] minrk oh, sorry
[23:19] staticfloat once to be received by the other ROUTER
[23:19] staticfloat as data
[23:20] minrk with SNDMORE correctly?
[23:21] minrk are you sending immediately after connecting?
[23:21] staticfloat Yes, I am sending immediately after connecting
[23:21] staticfloat Should I sleep(1) or something?
[23:21] minrk try sticking a sleep there, just to test
[23:22] minrk ROUTERs need to handshake before they notice an endpoint exists
[23:40] staticfloat Alright
[23:40] staticfloat sleeping makes it work now
[23:41] staticfloat Thank you
[23:41] staticfloat :)
[23:41] minrk1 sure
[23:41] minrk1 XREP sockets (now known as ROUTER) were designed with replying in mind
[23:42] minrk1 so if they try to send a message to a peer that doesn't exist, they just discard the message, since the requester is gone
[23:43] minrk1 sending a message from a ROUTER before handshake completes is the same as sending a reply to a peer that has shutdown - it is discarded.