Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print a message if getaddrinfo is slow on startup #4600

Closed
danielmewes opened this issue Jul 30, 2015 · 13 comments
Closed

Print a message if getaddrinfo is slow on startup #4600

danielmewes opened this issue Jul 30, 2015 · 13 comments
Labels
Milestone

Comments

@danielmewes
Copy link
Member

On server startup, we call get_local_ips which in turn calls getaddrinfo for a few host names.

Under some circumstances, this step can take a while. It is done before we print any messages to the log or console, so this can be confusing. @deontologician ran into a case where the server was taking 15-30 seconds to come up, and we didn't know what was going on at first.

It would be nice if we could detect that the process is taking a while, and print a message about what's going on in that case.

This probably doesn't come up much in normal configurations.
In our case it happened because the OS X machine just had its hostname changed. After a reboot things were fine again. Hence pretty low priority...

@danielmewes danielmewes added this to the backlog milestone Jul 30, 2015
@rmg
Copy link

rmg commented Aug 29, 2015

This also comes up when offline, as I discovered during a power outage. My laptop still runs, but WiFi is down, so my rethink docker container went from 2s startup to 24s startup. Fun to debug with access to the Internet only through a 3g tablet. But I did manage to come to the same conclusion as this issue through code inspection on github before landing here!

@pdonorio
Copy link

pdonorio commented Sep 6, 2016

I can confirm the problem offline, around ~30s (kinda frozen) if wifi is off or unreachable. Would it be possible to add an option on the command line to skip the operation?

@danielmewes
Copy link
Member Author

@pdonorio That sounds reasonable I think. I'm not 100% sure what we currently do the address lookup for. It might be important somehow, but most likely we can do without it.

@pdonorio
Copy link

pdonorio commented Sep 6, 2016

Ah great! Thanks!
Make sure that is really nothing important, but if you want my guess
it probably checks rethinkdb installed binary version against your website to tell the user when he should upgrade.

@danielmewes
Copy link
Member Author

@pdonorio I don't think the version check is the issue. The version check runs in the background and will not block a RethinkDB server from starting up (you can start with --no-version-check to verify this). The getaddrinfo that's blocking is one that's related to figuring out the interfaces to bind on I think.

@pdonorio
Copy link

pdonorio commented Sep 6, 2016

I can confirm that with wifi disabled and rethinkdb --no-update-check the issue is still there.

Now that I recall, I had a similar app hanging with socket.gethostbyaddr in python, of course when offline. Since there were no options to use a shorter timeout it ended up with skipping the call when you know you are disconnected.

@nenuko
Copy link

nenuko commented Nov 28, 2016

Hi! I've removed Google Public DNS of my Network preferences (OSX) and it worked

@irae
Copy link

irae commented Dec 9, 2016

Thanks @nenuko! Based on your finding, although I was not using Google DNS, it got me trying stuff. Finally, this is what I did to solve it:

sudo hostname [new name]
sudo scutil --set ComputerName [new name]
sudo scutil --set LocalHostName [new name]
sudo scutil --set HostName [new name]
dscacheutil -flushcache

Reboot, and the issue is gone.

@sorccu
Copy link

sorccu commented Feb 6, 2017

This has been happening with a docker setup we have, albeit not on all hosts. They're all Linux machines.

One one host, we're seeing a ~16s delay. Here's what I've found:

  1. Ran rethinkdb with strace rethinkdb --cache-size 192 --no-update-check
  2. From the output I was able to see that it was doing a DNS lookup for the docker container's hostname (f7b84f15e3d3d) 4 times. The hostname is configured in /etc/hosts, so it's not quite clear why it's attempting to look it up anyway.
  3. Running dig f7b84f15e3d3d @192.168.1.1 (where 192.168.1.1 is the nameserver in /etc/resolv.conf) showed that it was taking about 4100ms to respond to each request, obviously finding nothing. Since according to strace it did this 4 times, it corresponds quite well with the 16 second total delay. Other non-existing host names have the same delay, too.

Indeed it seems that the issue, at least in this case, is caused by a badly configured DNS server. However, it also feels like there are a few things that could be done to prevent this from being as much of an issue. First, it's not clear why the DNS lookup is done 4 times. Second, it's not clear why the dns lookup is being done at all - after all it was present in /etc/hosts (and strace confirms that before each dns lookup, /etc/hosts was indeed read as well). Finally, it feels like specifying the server name manually with -n should help with this issue but it doesn't seem to change anything.

@AtnNn AtnNn modified the milestones: subsequent, backlog Feb 6, 2017
@sorccu
Copy link

sorccu commented Feb 13, 2017

FWIW, the following does not reduce the ~16s delay in any way:

rethinkdb --cache-size 128 --no-update-check --no-default-bind --bind 127.0.0.1 -n foo

Seems like it might be doing DNS queries for the default binds (which include the hostname) even if a custom bind is provided.

@AtnNn AtnNn modified the milestones: 2.3.x, subsequent Mar 12, 2017
@AtnNn AtnNn added the tp:bug label Mar 12, 2017
@srh srh modified the milestones: 2.3.x, 2.4-polish Dec 11, 2017
jtc3161 added a commit to jtc3161/rethinkdb that referenced this issue Jan 5, 2018
srh added a commit that referenced this issue Jan 8, 2018
Removes use of DNS to find local ip addresses to bind to, except on
Windows.  Also removes mistaken "Ignore loopback addresses" logic,
which happened to get its conditional flipped the wrong way.

The flipped "Ignore loopback addresses" logic meant that we would
remove all non-loopback addresses that got returned by the DNS lookup.
So removing the DNS lookup entirely won't break things.

The behavior on Windows never used getifaddrs, it just used DNS
lookup.  We keep the old behavior on Windows because limited manpower
prevents us from using the Windows alternative to getifaddrs.

Fixes #4600.

Co-authored-by: Sam Hughes <sam@samuelhughes.com>
srh added a commit that referenced this issue Jan 8, 2018
Removes use of DNS to find local ip addresses to bind to, except on
Windows.  Also removes mistaken "Ignore loopback addresses" logic,
which happened to get its conditional flipped the wrong way.

The flipped "Ignore loopback addresses" logic meant that we would
remove all non-loopback addresses that got returned by the DNS lookup.
So removing the DNS lookup entirely won't break things.

The behavior on Windows never used getifaddrs, it just used DNS
lookup.  We keep the old behavior on Windows because limited manpower
prevents us from using the Windows alternative to getifaddrs.

Fixes #4600.

Co-authored-by: Sam Hughes <sam@samuelhughes.com>
@srh srh closed this as completed in 2a89b02 Jan 8, 2018
@srh
Copy link
Contributor

srh commented Jan 8, 2018

It turns out that using getaddrinfo on the hostname was the wrong thing to do. And the best of it is, we didn't even use the result of getaddrinfo thanks to the loopback filtering logic mixing up a boolean -- instead of filtering out loopback addresses, it filtered out non-loopback addresses.

@srh srh modified the milestones: 2.4-polish, 2.3.x Jan 8, 2018
@sholladay
Copy link

@srh nice detective work. Your fix has me pretty excited. This has affected me for as long as I've been using RethinkDB and it's been a real pain.

What kind of startup time should I expect from a typical dev laptop (e.g. 2.3Ghz 2012 Macbook Pro)?

@srh
Copy link
Contributor

srh commented Jan 22, 2018

On my 2.7Ghz 2012 MBP it's 0.57 seconds with me manually recording it on a stopwatch.

With old binaries you can use the instructions at #6047 (comment) to mitigate this on macOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants