Notes on upgrading NAV from v3.14 to v4.2.5

Recently it was decided that the «IT operations framework group (GID)» is taking over the management and operation of «Network administration Visualized (NAV)» tool from the «Computer networks group». This first part of the take over involved upgrading NAV to the latest version. Because some parts of NAV has some fundamental changes in the architecture, it was decided to install and configure  the new version from scratch and run the new version parallell with the old version until everything was verified and tested to work.

Setup

A new debian wheezy VM was installed, NAV software repos enabled and NAV software installed [1] . Configuration was replicated as much as possible from existing NAV-installation. Much of the configuration and all inventory is stored in a back-end PostgreSQL database. We cloned the existing production database into a new database configured with access from the new NAV-instance. Then the new NAV instance was configured to point towards the new cloned DB.

One major change between our old and new NAV instance was the method of storing and displaying time series metrics. In the new version Graphite [2] is used while in the old version, Cricket [3] is used. At UiO we had already a scalable Graphite setup, so it made sense to merge all NAV-metrics into our preexisting Graphite metric-repository. That makes easy to lay metrics from NAV on top of other metrics to visualize co-variation between network components and other iinfrastructure metrics over time.

The new NAV instance was configured (in the file /etc/nav/graphite.conf) to point towards our pre-existing Graphite setup. Since NAV only want to send udp data to the Graphite carbon daemon, we needed to enable the udp receiver in the carbon daemon of the host that was configured to send NAV data to (ENABLE_UDP_LISTENER = True in carbon.conf) . Since our other carbon data comes in through the carbon aggregator daemon and is sent to the carbon daemon locally, we also needed to open port 2003/udp in the local firewall of the receiving node so the NAV host can directly send its metric data to the carbon daemon. Also it is important to remember adapting the storage schemas for the NAV-metrics in graphite [7].

Slow updates

Our installation is quite large,  comprising about 1800 devices in the NAV database. The first thing we noticed was that metric data to graphite was logged erratically. With help from the NAV-developers in the freenode IRC channel #nav we found that metric collection loops took way too long to complete, and that it was likely the cause of erratic metric logging from NAV. Average duration times for execution of 5minstats can be queried with SQL:

select avg(duration), stddev_pop(duration) from ipdevpoll_job_log where job_name='5minstats'

We found that the average time for 5minstat jobs were about 60 seconds. When collecting from 1800 devices time adds up. The daemon that collects this data, updates the PostgreSQL database and sends metrics to graphite is the ipdevpolld. It is by far the most work intensive part of NAV (at least in our installation). By default ipdevpolld is single process. In the new version of nav it is however possible to split ipdevpolld into separate processes running in parallel. Enabling ipdevpolld to run in multi process mode is done with the -m option [4].

Ipdevpolld was tried in multi process mode with the goal of reducing gaps in the graphite graphs, by distributing more work in parallell. The result was that the dabase server went out of resources and OOM killer killed Postgres processes. This was unlucky since both the existing production database and the cloned database for the new instance were running on the same host. Fortunately our excellent DBAs managed to get things up again quickly. The VM running the postgres DBs for NAV were doubled with respect to memory and CPU resources. After that the DB-server coped with the increased load.

Again with help from the NAV-developers throught IRC the real cause of the problem was identified: that ipdevpolld was too eagerly pruning entries from the ipdevpoll_job_log table [5]. After upgrading NAV to 4.2.5-2 the Graphite metrics became (almost) complete with few gaps. Although we did not verify it I think this alsos affected status update in the PostgreSQL database which again is used for alerts and web GUI status reporting.

Tuning the carbon daemon

Recently a blog post about gaps in the graphite data was published by NAV-developers [6]. In order to check if we had dropped udp packets and visualize the effects of tuning we started logging this number over time in Graphite.

while true
do
echo  "linux.script.collectd-prod02_uio_no.carbon_udp_drop_count `awk '$2~/07D3/{print $NF}' /proc/net/udp` `date +%s`"|nc  127.0.0.1 2023
sleep 60
done

The graph showed a drop rate that varied between 250 and 1000 pr minute. First we tried to double the default and max buffers. That created no visible change. Then we just tried the values suggested in the blog post. That created the follwoing very visible change.

Merging historical cricket data into graphite.

It is quite well described in the documentation how to import historical cricket data into Graphite [8] , and there is a tool spcially desgined for this: migrate_to_whisper.py . However this documentation is very specific for the case when you upgrade the same machine from a version using cricket to a version using Graphite and graphite is running on the same host as the NAV-instance. That is not the case for us. Here is what we did to back merge cricket data from the old instance to the new instance.

Copy cricket rrd-files to the new machine.

rsync  -a nav-prod03:/var/lib/nav/rrd /var/lib/nav/
rsync -a nav-prod03:/var/lib/nav/cricket-data /var/lib/nav/

Run the migration tool towards a temporary location on the same host

/usr/lib/nav/migrate_to_whisper.py /site/cricket-whisper/

Copy the generated whisper files to a temporary location on the node receving NAV-carbon-data

rsync  -a nav-prod04:/site/cricket-whisper /site/whisper/

Run whisper merge from files containing historical whisper data into files created from the new running NAV-instance

cd /site/whisper/nav/

find -type f -name '*.wsp' > /root/nav_input_list

for i in $(< /root/nav_input_list)
do
  echo "Merging $i to /site/whisper/nav/$i"
  whisper-merge.py $i /site/whisper/nav/$i
done >> /var/tmp/whisper_merge_out

Watch progress

watch wc -l /var/tmp/whisper_merge_out

[1] https://nav.uninett.no/install-instructions/#debian
[2] https://github.com/graphite-project
[3] http://cricket.sourceforge.net/
[4] https://nav.uninett.no/hg/stable/rev/cd7b14d9cefc
[5] https://launchpad.net/bugs/1437318
[6] https://nav.uninett.no/doc/4.2/howto/tuning_graphite.html
[7] https://nav.uninett.no/doc/dev/intro/install.html#integrating-graphite-with-nav
[8] https://nav.uninett.no/doc/4.2/howto/migrate-rrd-to-graphite.html

 

Av Jarle Bjørgeengen
Publisert 19. mai 2015 15:29 - Sist endret 22. mars 2019 12:44
Legg til kommentar

Logg inn for å kommentere

Ikke UiO- eller Feide-bruker?
Opprett en WebID-bruker for å kommentere