Fixing GlusterFS unsynced entries

Two our servers serv1 and serv2 have Debian 7 x64 and GlusterFS 3.4. Unexpectedly serv2 which was in another DC got network issues, when DC fixed their own issues with routers we got “unsynced entries” Nagios alert for serv1:

root@serv1 ~ # /root/bin/check_glusterfs -v sites -n 2 -w 10 -c 5
WARNING: 3 unsynced entries
root@serv1 ~ #

There were three error entries in glustershd.log on serv1:

root@serv1 /var/log/glusterfs # grep "No such file or directory" glustershd.log | grep Path
[2015-12-12 19:22:53.228807] W [client-rpc-fops.c:471:client3_3_open_cbk] 0-sites-client-1: remote operation failed: No such file or directory. Path: <gfid:790d240d-6d8b-4540-9049-06664408cec7> (00000000-0000-0000-0000-000000000000)
[2015-12-12 19:22:53.241330] W [client-rpc-fops.c:471:client3_3_open_cbk] 0-sites-client-1: remote operation failed: No such file or directory. Path: <gfid:8f6a612a-6fda-45ee-aa84-e9cb847047c2> (00000000-0000-0000-0000-000000000000)
[2015-12-12 19:22:53.256951] W [client-rpc-fops.c:471:client3_3_open_cbk] 0-sites-client-1: remote operation failed: No such file or directory. Path: <gfid:2b6f14c4-e863-480a-9d72-c5027cc10666> (00000000-0000-0000-0000-000000000000)

How to get GlusterFS internal file identifiers (GFIDs) without digging into logs:

root@serv1 /var/log/glusterfs # gluster volume heal sites info
Gathering Heal info on volume sites has been successful

Brick serv1.domain.com:/opt/gls/sites/brick
Number of entries: 3
<gfid:790d240d-6d8b-4540-9049-06664408cec7>
<gfid:8f6a612a-6fda-45ee-aa84-e9cb847047c2>
<gfid:2b6f14c4-e863-480a-9d72-c5027cc10666>

Brick serv2.domain.com:/opt/gls/sites/brick
Number of entries: 0
root@serv1 /var/log/glusterfs #

Apart from these entries, the cluster is working fine and files are synced between nodes without any issues. The number of entries is not increasing.

GlusterFS heal operation on volume sites not helped to fix issue:

root@serv1 /var/log/glusterfs # gluster volume heal sites
Launching Heal operation on volume sites has been successful
Use heal info commands to check status

root@serv1 /var/log/glusterfs # gluster volume heal sites info
Gathering Heal info on volume sites has been successful

Brick serv1.domain.com:/opt/gls/sites/brick
Number of entries: 3
<gfid:790d240d-6d8b-4540-9049-06664408cec7>
<gfid:8f6a612a-6fda-45ee-aa84-e9cb847047c2>
<gfid:2b6f14c4-e863-480a-9d72-c5027cc10666>

Brick serv2.domain.com:/opt/gls/sites/brick
Number of entries: 0
root@serv1 /var/log/glusterfs #

How you can fix such issues when you have no file paths, but only gfid’s?

GlusterFS internal file identifier (GFID) is a uuid that is unique to each file across the entire cluster. This is analogous to inode number in a normal filesystem.
At the time of writing I found three methods to obtain file path from GFID: https://gluster.readthedocs.org/en/latest/Troubleshooting/gfid-to-path/

I’ve used method 3 with bash GFID resolver proposed at https://gist.github.com/semiosis/4392640

root@serv1 ~ # ./gfid-resolver.sh /opt/gls/sites/brick 790d240d-6d8b-4540-9049-06664408cec7
790d240d-6d8b-4540-9049-06664408cec7 == File: /opt/gls/sites/brick/somedomain.com/htdocs/images/_thumbs_98x138/7d3a770000649f3b72e15152958211dd.jpg
root@serv1 ~ # ./gfid-resolver.sh /opt/gls/sites/brick 8f6a612a-6fda-45ee-aa84-e9cb847047c2
8f6a612a-6fda-45ee-aa84-e9cb847047c2 == File: /opt/gls/sites/brick/somedomain.com/htdocs/images/_thumbs_98x138/a7dcc888f4f8b73bcd1aa53b2b8ca667.jpg
root@serv1 ~ # ./gfid-resolver.sh /opt/gls/sites/brick 2b6f14c4-e863-480a-9d72-c5027cc10666
2b6f14c4-e863-480a-9d72-c5027cc10666 == File: /opt/gls/sites/brick/somedomain.com/htdocs/images/_thumbs_98x138/fef9da3fa55971c679407c277f7e0330.jpg
root@serv1 ~ #

Now when you have exact file path and file name you need to find GlisterFS hard link to it:

root@serv1:~/img# find /opt/gls/sites/ -samefile /opt/gls/sites/brick/somedomain.com/htdocs/images/_thumbs_98x138/a7dcc888f4f8b73bcd1aa53b2b8ca667.jpg -print
/opt/gls/sites/brick/.glusterfs/8f/6a/8f6a612a-6fda-45ee-aa84-e9cb847047c2
/opt/gls/sites/brick/somedomain.com/htdocs/images/_thumbs_98x138/a7dcc888f4f8b73bcd1aa53b2b8ca667.jpg
root@serv1:~/img#

Then temporarily move file a7dcc888f4f8b73bcd1aa53b2b8ca667.jpg somewhere and remove (with rm -f “file”) broken link /opt/gls/sites/brick/.glusterfs/8f/6a/8f6a612a-6fda-45ee-aa84-e9cb847047c2.

Only then volume healing really “heal” volume, and broken entry disappeared:

root@serv1 ~/img # gluster volume heal sites
Launching Heal operation on volume sites has been successful
Use heal info commands to check status

root@serv1 ~/img # gluster volume heal sites info
Gathering Heal info on volume sites has been successful

Brick serv1.domain.com:/opt/gls/sites/brick
Number of entries: 2
<gfid:790d240d-6d8b-4540-9049-06664408cec7>
<gfid:2b6f14c4-e863-480a-9d72-c5027cc10666>

Brick serv2.domain.com:/opt/gls/sites/brick
Number of entries: 0
root@serv1 ~/img #

Then I did same operations with other two entries and after final healing both bricks are OK:

root@serv1 ~/img # gluster volume heal sites info
Gathering Heal info on volume sites has been successful

Brick serv1.domain.com:/opt/gls/sites/brick
Number of entries: 0

Brick serv2.domain.com:/opt/gls/sites/brick
Number of entries: 0
root@serv1 ~/img #

Now you need to move back your files to their original positions, make sure you copy not to brick path (/opt/gls/sites), but to real GlusterFS mount point (/home/sites) so replication will sync it with serv2:

root@serv1 ~/bin # df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/mapper/vg0-sites 700G 44G 656G 7% /opt/gls/sites
localhost:/sites 700G 44G 656G 7% /home/sites

Now we are done, sync is OK:

root@serv1 ~/img # /root/bin/check_glusterfs -v sites -n 2 -w 10 -c 5
OK: 2 bricks; free space 655GB
root@serv1 ~/img #

P.S.
check_glusterfs nagios plugin used for serv1 is https://exchange.nagios.org/directory/Plugins/System-Metrics/File-System/GlusterFS-checks/details

Fixing GlusterFS unsynced entries

Fixing innodb issues after database rsync

Although someone ( http://dba.stackexchange.com/questions/41667/is-it-okay-to-use-rsync-on-innodb-database-if-the-mysql-server-is-shutdown ) said that InnoDB database rsync is OK, the practice says the opposite. Especially if your source and destination servers have slightly different set of databases and you unable to sync whole /var/lib/mysql directory.

For migration sync InnoDB databases (innodb_file_per_table=1 on both) between two hosts during migration it is better to use SQL dumps (or Percona XtraBackup toolkit if your databases are big enough) than plain old rsync /var/lib/mysql/dbname when mysql stopped, because some databases may work and some have a good chances to be corrupted.

After one rsync of /var/lib/mysql/ I’ve got a bunch of errors like:

mysql> use fattoreb_fbgwp
No connection. Trying to reconnect...
Connection id: 1
Current database: *** NONE ***

mysql> use fattoreb_fbgwp;
mysql> show tables;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id: 1
Current database: fattoreb_fbgwp
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111)

Restore from dump not worked:
$ mysql fattoreb_fbgwp < fattoreb_fbgwp.sql
ERROR 2013 (HY000) at line 22: Lost connection to MySQL server during query

Database drop not worked:
mysql> drop database fattoreb_fbgwp;
ERROR 1010 (HY000): Error dropping database (can't rmdir './fattoreb_fbgwp', errno: 39)
mysql> 

Enabled mysql error log, got errors like these:
151010 22:06:22 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.45-cll' socket: '/var/lib/mysql/mysql.sock' port: 3306 MySQL Community Server (GPL)
InnoDB: Error: tablespace id is 15 in the data dictionary
InnoDB: but in file ./fattoreb_fbgwp/dap_aff_comm.ibd it is 492!
151010 22:07:36 InnoDB: Assertion failure in thread 139677511759616 in file fil0fil.c line 768
...
151011 16:30:01  InnoDB: cannot calculate statistics for table fedetrac_cpvlab/users
InnoDB: because the .ibd file is missing.
...
151011 17:23:36 [ERROR] Cannot find or open table fattoreb_fbgwp/wp_usermeta from
the internal data dictionary of InnoDB though the .frm file for the table exists. 
...
151011 17:25:40  InnoDB: Error: table `fattoreb_fbgwp`.`dap_config` does not exist in the InnoDB internal
InnoDB: data dictionary though MySQL is trying to drop it.
InnoDB: Have you copied the .frm file of the table to the
InnoDB: MySQL database directory from another database?
...
151011 17:47:47  InnoDB: Error: table 'fedetrac_cpvlab/clicks'
InnoDB: in InnoDB data dictionary has tablespace id 68,
InnoDB: but a tablespace with that id does not exist. There is
InnoDB: a tablespace of name ./fedetrac_cpvlab/clicks.ibd and id 155, though. Have
InnoDB: you deleted or moved .ibd files?
...
151011 17:47:47  InnoDB: error: space object of table 'fedetrac_cpvlab/logins',
InnoDB: space id 86 did not exist in memory. Retrying an open.
151011 17:47:47  InnoDB: Error: tablespace id and flags in file './fedetrac_cpvlab/logins.ibd' are 173 and 0, but in the InnoDB
InnoDB: data dictionary they are 86 and 0.
InnoDB: Have you moved InnoDB .ibd files around without using the
InnoDB: commands DISCARD TABLESPACE and IMPORT TABLESPACE?
...

In InnoDB, the metadata within ibdata1 contains a numbered list of InnoDB tables. And since database/table info in ibdata1 on destination server has differences comparing to source host, in this situation the only way to fix these multiple issues with different databases was recreate database from scratch.

I’ve dumped all existing databases (which can be dumped, because DBs with errors above you just unable to dump and need use dump from source server), then shutdown MySQL, delete ibdata* and ib_logfile*, started mysql, recreated databases and imported existing *.sql.

If you unable to drop some databases:

mysql> drop database fattoreb_fbgwp;
ERROR 1010 (HY000): Error dropping database (can't rmdir './fattoreb_fbgwp', errno: 39)
mysql> drop database fedetrac_cpvlab;
ERROR 1010 (HY000): Error dropping database (can't rmdir './fedetrac_cpvlab', errno: 39)
mysql> 

then stop mysql, delete all .ibd files under /var/lib/mysql/fattoreb_fbgwp and /var/lib/mysql/fedetrac_cpvlab (in fact there were no other files here except *.ibd after drop attempts), start mysql and you now can drop these DBs.

InnoDB Infrastructure Cleanup section on http://stackoverflow.com/questions/3927690/howto-clean-a-mysql-innodb-storage-engine/4056261 helped me a lot to sort out this stuff.

Fixing innodb issues after database rsync

httpd dead but subsys locked

Server appeared in centreon with these alerts:

NGINX	Critical    HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 323 bytes in 0.001 second response time
APACHE_LOCALHOST	Critical	Connection refused

All sites show

502 Bad Gateway 
nginx/1.6.2

and apache was not running.

Apache restart not helped:

$ service httpd status
httpd dead but subsys locked
$ service httpd restart
Stopping httpd: [FAILED]
Starting httpd: [ OK ]
$ service httpd status
httpd dead but subsys locked
$ 

Stopping nginx, then restarting apache, then starting nginx helped sometime, but not now.

Apache error_log said that “Configuration Failed”, at the same time config syntax check passed with no errors:

$ tail error_log
[Tue Sep 08 03:35:08 2015] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Tue Sep 08 03:35:08 2015] [notice] Digest: generating secret for digest authentication ...
[Tue Sep 08 03:35:08 2015] [notice] Digest: done
Configuration Failed
[Tue Sep 08 03:36:36 2015] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Tue Sep 08 03:36:36 2015] [notice] Digest: generating secret for digest authentication ...
[Tue Sep 08 03:36:36 2015] [notice] Digest: done
Configuration Failed
$

$ apachectl configtest
Syntax OK
$ httpd -t
Syntax OK

Enabling debug loglevel not helped at all (no new lines in error_log since warn level):

$ grep LogLevel /etc/httpd/conf/httpd.conf
# LogLevel: Control the number of messages logged to the error_log.
#LogLevel warn
LogLevel debug
$ 

Tried strace to debug what’s going on:

$ strace -f -o apache.trace /usr/sbin/httpd

In generated apache.trace text file found only one occurrence "Configuration Failed" right after the line "No space left on device":
8540  mmap(NULL, 500008, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0) = 0x7f99d00b9000
8540  semget(IPC_PRIVATE, 1, IPC_CREAT|0600) = -1 ENOSPC (No space left on device)
8540  write(2, "Configuration Failed\n", 21) = 21

But disk space was already checked and is not an issue:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       127G  107G   14G  89% /
tmpfs           1.9G     0  1.9G   0% /dev/shm
/dev/sda1        97M   58M   35M  63% /boot
/dev/sda5      1007M   18M  939M   2% /tmp

$ df -i
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/sda3      8462336 181993 8280343    3% /
tmpfs           489644      1  489643    1% /dev/shm
/dev/sda1        25688     45   25643    1% /boot
/dev/sda5        65536    108   65428    1% /tmp

For any case checked SELinux policy, it was disabled. Removed lock file once again (recreated after each httpd restart), but apache won’t start:

$ getenforce
Disabled

$ setenforce 0
setenforce: SELinux is disabled

$ rm -f /var/lock/subsys/httpd
$ service httpd restart
Stopping httpd: [FAILED]
Starting httpd: [ OK ]
$ service httpd status
httpd dead but subsys locked

Checked PIDFILE setting in both apache configs, default location is /var/run/httpd/httpd.pid, checked permissions – all OK.
For test changed to /var/run/httpd.pid, removed the lock, restarted apache once again – no luck: same issue in log.
Reverted PIDFILE setting back then.

$ grep PID /etc/sysconfig/httpd
# /var/run/httpd/httpd.pid in which it records its process
# specified in httpd.conf (via the PidFile directive), the new
# location needs to be reported in the PIDFILE.
#PIDFILE=/var/run/httpd/httpd.pid
PIDFILE=/var/run/httpd.pid

$ grep PidFile /etc/httpd/conf/httpd.conf
# PidFile: The file in which the server should record its process
# identification number when it starts.  Note the PIDFILE variable in
#PidFile run/httpd.pid
PidFile /var/run/httpd.pid
$
$ rm -f /var/lock/subsys/httpd
$ service httpd restart

Hopefully remembered about locked semaphores, check them. There were no ‘apache’ semaphores, but many other numeric semid’s:

$ ipcs -s | grep apache

$ ipcs -s

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     
0x53032aba 302186496  root       600        103       
0x48032aba 302219265  root       600        9         
0x57032aba 302252034  root       600        1         
0x00000000 302612483  root       600        1         
0x00000000 302645252  root       600        1         
0x00000000 952729605  4294967295 600        1         
0x00000000 952762374  4294967295 600        1         
0x00000000 303661063  root       600        1         
0x00000000 303693832  root       600        1         
0x00000000 302907401  root       600        1         
....
.... 
0x00000000 303267864  root       600        1         
0x00000000 303300633  root       600        1         
0x00000000 209223710  4294967295 600        1         
0x00000000 209256479  4294967295 600        1         
0x00000000 275742752  4294967295 600        1         
....
.... 
0x00000000 127336570  4294967295 600        1         
0x00000000 127369339  4294967295 600        1         
0x00000000 127402108  4294967295 600        1         
0x00000000 188448893  4294967295 600        1         
0x00000000 188481662  4294967295 600        1         
0x00000000 295174271  4294967295 600        1    

Reset existing semaphores actually helped to start apache, it resume normal operations:

$ for sem in `ipcs -s | awk '{print $2}'`; do ipcrm -s $sem; done
ipcrm: already removed id (Semaphore)
ipcrm: already removed id (semid)
$ ipcs -s

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     

$ 
$ service httpd start
Starting httpd:                                            [  OK  ]
$ service httpd status
httpd (pid  8740) is running...
$ 

$ tail -f /var/log/httpd/error_log
[Tue Sep 08 04:29:29 2015] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Tue Sep 08 04:29:29 2015] [notice] Digest: generating secret for digest authentication ...
[Tue Sep 08 04:29:29 2015] [notice] Digest: done
[Tue Sep 08 04:29:29 2015] [notice] Apache/2.2.15 (Unix) DAV/2 configured -- resuming normal operations

Apache back to work and all sites works now ๐Ÿ™‚

If even semaphores reset not helped for you or you just want permanently raise their quota:

You can see what your limits are like this:
$ cat /proc/sys/kernel/sem
250	32000	32	128

$ ipcs -ls
------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 32
semaphore max value = 32767

You can double those limits by adding this line to /etc/sysctl.conf:
kernel.sem = 500 64000 64 256

That makes sure you'll get the change at the next boot. 
To make the change take immediate effect:
$ sysctl -p

httpd dead but subsys locked

Free disk space quest

Got <5% free disk space alert on /var in monitoring system, started investigation what can be rotated, (re)moved, zipped etc.

Found that in fact on /var used only ~1Gb:

[root@vm1501 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 15G 13G 823M 95% /var

[root@vm1501 var]# du -hcs *
31M account
235M cache
249M clamav
8.0K cvs
28K db
32K empty
8.0K ftp
8.0K games
32K hotcopy
69M lib
8.0K local
44K lock
266M log
16K lost+found
4.0K mail
8.0K net-snmp
8.0K nis
8.0K opt
8.0K preserve
8.0K racoon
264K run
154M spool
12M tmp
1.2M www
20K yp
1014M total
[root@vm1501 var]#

One-liners like these:

find /var -type f -size +50M -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'
du -mxS /var | sort -n | tail

confirmed that only ~1Gb of space really used on this server

Where the rest of space hidden?

Grepped opened file descriptors, found old deleted logs still pointed to /var:

[root@vm1501 ~]# lsof | grep /var | less
nginx      7443       nginx    8w      REG                8,5      825768    1442201 /var/log/httpd/prXXX.log.1 (deleted)
nginx      7443       nginx    9w      REG                8,5    48004684    1442079 /var/log/httpd/vaXXX-access.log.1 (deleted)
nginx      7443       nginx   10w      REG                8,5 11491286832    1442207 /var/log/httpd/scXXX.log.1 (deleted)
nginx      7443       nginx   12w      REG                8,5    13814805    1441946 /var/log/httpd/maXXX.log.1 (deleted)
nginx      7443       nginx   13w      REG                8,5    10619125    1442209 /var/log/httpd/taXXX.com.log.1 (deleted)
nginx      7443       nginx   14w      REG                8,5        8975    1442222 /var/log/httpd/reXXX.com.log.1 (deleted)
nginx      7443       nginx   16w      REG                8,5     6085095    1442181 /var/log/httpd/kiXXX.com.log.1 (deleted)
nginx      7443       nginx   17w      REG                8,5      208104    1442168 /var/log/httpd/moXXX.com.log.1 (deleted)
nginx      7443       nginx   19w      REG                8,5     7731942    1442203 /var/log/httpd/roXXX.log.1 (deleted)
nginx      7443       nginx   20w      REG                8,5  1154637195    1442178 /var/log/httpd/inXXX.com.log.1 (deleted)
nginx      7443       nginx   21w      REG                8,5    94751240    1442198 /var/log/httpd/piXXX.com.log.1 (deleted)
...

Restarted nginx and this solved issue, now df show correct space utilization:

[root@vm1501 ~]# service nginx restart

[root@vm1501 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 15G 1.1G 13G 8% /var
Free disk space quest

RabbitMQ network partition monitoring

RabbitMQ is very sensitive for network connections between nodes, even LAN clusters in same DC (even on same hardware node) sometimes may stuck with so called “network partitions” error. As described in official documentation, RabbitMQ clusters do not tolerate network partitions well. It unable to restore replication automatically on its own. Therefore each time this issue happen – need restore cluster manually or with additional automation (via various scripts).

Recovering from a RabbitMQ network partition fairly described here: https://www.rabbitmq.com/partitions.html

In a few words whole process can be described as:
0) Choose one node (server with RabbitMQ) which you trust the most.
1) Do “/etc/init.d/rabbitmq-server stop” on all other nodes.
2) In several seconds do “/etc/init.d/rabbitmq-server start” on same nodes from step 2.
3) Check results, e.g.: ssh -i id_rsa admin@rabbit1 “sudo rabbitmqctl cluster_status” OR in your RabbitMQ web UI, http://rabbit1:15672/

If you don’t have any “partitions” in cluster_status command output – everything is OK:

admin@rabbit1:~# sudo rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit3]}]},
 {running_nodes,[rabbit@rabbit1,rabbit@rabbit3,rabbit@rabbit2]},
 {cluster_name,<<"rabbit@rabbit1">>},
 {partitions,[]}]
admin@rabbit1:~#

How to monitor network partitions?
For nagios I’ve prepared simple bash plugin. It can be used on any monitoring system which accepted 3d-party plugins.

admin@rabbit1:~/bin$ cat check_rabbit_partitions
#!/bin/bash
 
# No warning/critical threshold, even one network partition will raise alert flag.
# Usage: ./check_rabbit_partitions
RABBIT_PARTITIONS=$(sudo rabbitmqctl cluster_status | grep partitions | grep rabbit)
 
if [ ! -z "$RABBIT_PARTITIONS" ]; then
        echo "CRITICAL: Rabbit network partitions exists! $RABBIT_PARTITIONS"
        $(exit 2)
#elif [ "" ]; then
#        echo "WARNING: "
#        $(exit 1)
else
        echo "OK: No Rabbit network partitions."
        $(exit 0)
fi

Now in case of alert you will have such output:

admin@rabbit1:~/bin$ ./check_rabbit_partitions
CRITICAL: Rabbit network partitions exists!  {partitions,[{rabbit@rabbit3,[rabbit@rabbit1,rabbit@rabbit2]}]}]
admin@rabbit1:~/bin$

If things are OK:

admin@rabbit1:~/bin$ ./check_rabbit_partitions
OK: No Rabbit network partitions.
admin@rabbit1:~/bin$

The last step you need to do is to define new service for your rabbit node(s) in nagios for remote script execution.

P.S. If you need more than just network partitions monitoring, you can check these perl plugins (need additional perl modules installed) – http://www.thegeekstuff.com/2013/12/nagios-plugins-rabbitmq/

RabbitMQ network partition monitoring

Dealing with fail2ban multi domain jails

On servers with huge domains count (many hundreds, thousands of domains) fail2ban in its own /var/log/fail2ban.log will group all domains by the first letter and add it right after jail name. So in logs you unable to see exact affected domain:

2015-07-15 15:52:30,388 fail2ban.actions: WARNING [wp-auth_1] Ban XXX.66.82.116
2015-07-15 15:59:29,295 fail2ban.actions: WARNING [wp-auth_d] Ban XXX.27.118.100

How to get list of affected domains in these circumstances?
1) Get list of pairs jail/IP from fail2ban logs:

[admin@vm1025 log]# grep Ban fail2ban.log* | grep 2015-07-15 | awk '{print $5" / "$7}' | sed 's/]//g' | sed 's/\[//g' | sort | uniq
wp-auth_1 / XXX.4.96.100
wp-auth_a / XXX.175.9.62
wp-auth_a / XXX.198.252.66
wp-auth_d / XXX.26.193.70
wp-auth_d / XXX.37.207.133
wp-auth_d / XXX.205.239.23
wp-auth_d / XXX.77.224.29
wp-auth_s / XXX.207.144.200
wp-auth_s / XXX.234.146.13
wp-auth_t / XXX.70.98.32
wp-auth_t / XXX.23.227.234
wp-auth_t / XXX.183.182.53
[admin@vm1025 log]#

2) Check exact jail domains access logs, samples:

[admin@vm1025 log]# IP=XXX.26.193.70; JAIL=wp-auth_d;
[admin@vm1025 log]# for i in `fail2ban-client get $JAIL logpath | grep -v Current | cut -d"-" -f2`; do grep $IP $i* | awk '{print $1}'| sort | uniq; done
/home/vhosts/dwXXX.com/statistics/logs/access_log.processed:XXX.26.193.70
 
[admin@vm1025 log]# IP=XXX.234.146.13; JAIL=wp-auth_s;
[admin@vm1025 log]# for i in `fail2ban-client get $JAIL logpath | grep -v Current | cut -d"-" -f2`; do grep $IP $i* | awk '{print $1}'| sort | uniq; done
/home/vhosts/soXXX.com/statistics/logs/access_log.processed:XXX.234.146.13

Want to dive deeper into fail2ban interactive console commands? Here you are: http://www.fail2ban.org/wiki/index.php/Commands

Dealing with fail2ban multi domain jails

Plesk 9.5 DomainKey underscore sign error

Situation: you need setup some domain.com under Amazon SES and on server you have Plesk 9.5.
Amazon SES generated three DKIM records with underscore for proper functioning their DKIM settings (Amazon SES service – Domains – domain.com).

Each time you enter domain name with underscore sign like ‘gfdrol6aypcxib7dqmbyzdcnkhx85b73._domainkey.domain.com’ in Plesk 9.5 DNS settings you got Error:
Incorrect DNS record values were specified.

How to fix this?

Go to plesk database (usually ‘psa’)

mysql -u admin -p`cat /etc/psa/.psa.shadow`
mysql> use psa

Checking current record value:

mysql> select * from dns_recs where val like "%gfdrol6aypcxib7dqmbyzdcnkhx85b73%"\G
*************************** 1. row ***************************
         id: 658
dns_zone_id: 14
       type: CNAME
displayHost: gfdrol6aypcxib7dqmbyzdcnkhx85b73.domainkey.domain.com.
       host: gfdrol6aypcxib7dqmbyzdcnkhx85b73.domainkey.domain.com.
 displayVal: gfdrol6aypcxib7dqmbyzdcnkhx85b73.dkim.amazonses.com.
        val: gfdrol6aypcxib7dqmbyzdcnkhx85b73.dkim.amazonses.com.
        opt: 
 time_stamp: 2015-07-02 18:16:12
1 row in set (0.00 sec)

mysql> 

Now we can update displayHost and host fields in record id=658 with correct value containing underscore sign:

update dns_recs set displayHost="gfdrol6aypcxib7dqmbyzdcnkhx85b73._domainkey.domain.com.", host="gfdrol6aypcxib7dqmbyzdcnkhx85b73._domainkey.domain.com." where id=658;

Checking yourself:

mysql> select * from dns_recs where val like "%gfdrol6aypcxib7dqmbyzdcnkhx85b73%"\G
*************************** 1. row ***************************
         id: 658
dns_zone_id: 14
       type: CNAME
displayHost: gfdrol6aypcxib7dqmbyzdcnkhx85b73._domainkey.domain.com.
       host: gfdrol6aypcxib7dqmbyzdcnkhx85b73._domainkey.domain.com.
 displayVal: gfdrol6aypcxib7dqmbyzdcnkhx85b73.dkim.amazonses.com.
        val: gfdrol6aypcxib7dqmbyzdcnkhx85b73.dkim.amazonses.com.
        opt: 
 time_stamp: 2015-07-02 18:16:12
1 row in set (0.00 sec)

mysql> 

Now you just need repeat similar steps for the rest of DKIM records provided by Amazon.

Plesk 9.5 DomainKey underscore sign error

Enable To header in redmine notification emails

By default notification emails sent out by Redmine are lacking the To: mail header.

How to add To: mail header?

Go to ‘Email notifications’ options tab, here you can see that Blind carbon copy recipients (BCC) set to true so email notification will be sent as Blind carbon copy, its default value is Yes. Just uncheck this checkbox and you are done. Now you can press ‘Send test e-mail notification’ for checking To: header.

Enable To header in redmine notification emails

redmine_checklists plugin collation error

When trying to add non-latin characters as checklist items with redmine_checklists plugin ( https://github.com/RCRM/redmine_checklists ) to your redmine issue you got an ‘Internal error’ (Error 500) in browser.

In redmine production.log you can see such error description:

Mysql2::Error: Incorrect string value: '\xD0\xB9\xD1\x86\xD1\x83...' for column 'subject' at row 1: INSERT INTO `checklists` (`subject`, `issue_id`, `created_at`, `updated_at`, `position`) VALUES ('ั‚ะตัั‚1', 427, '2015-04-24 11:56:44', '2015-04-24 11:56:44', 3)
Completed 500 Internal Server Error in 215ms (ActiveRecord: 157.5ms)

This issue happened due to latin1_swedish_ci collation on all three checklist* tables in redmine database:

mysql> SHOW TABLE STATUS where name like 'checklist%';
+-------------------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+-------------------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-------------------+----------+----------------+---------+
| checklist_template_categories | InnoDB | 10 | Compact | 0 | 0 | 16384 | 0 | 0 | 0 | 1 | 2015-04-24 18:09:26 | NULL | NULL | latin1_swedish_ci | NULL | | |
| checklist_templates | InnoDB | 10 | Compact | 0 | 0 | 16384 | 0 | 0 | 0 | 1 | 2015-04-24 18:09:26 | NULL | NULL | latin1_swedish_ci | NULL | | |
| checklists | InnoDB | 10 | Compact | 6 | 2730 | 16384 | 0 | 0 | 0 | 12 | 2015-04-24 18:09:26 | NULL | NULL | latin1_swedish_ci | NULL | | |
+-------------------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-------------------+----------+----------------+---------+
3 rows in set (0.03 sec)

How to fix this?

First off all let’s create backups of affected tables for any case:

mysqldump redmine checklist_template_categories > redmine__checklist_template_categories.sql
mysqldump redmine checklist_templates default > redmine__checklist_templates default.sql
mysqldump redmine checklist_templates > redmine__checklist_templates.sql

Now we can update wrong collation:

mysql> use redmine
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> alter table checklist_template_categories default character set = utf8 collate = utf8_general_ci;
Query OK, 0 rows affected (0.15 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table checklist_template_categories convert to character set utf8 collate utf8_general_ci;
Query OK, 0 rows affected (2.07 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table checklist_templates default character set = utf8 collate = utf8_general_ci;
Query OK, 0 rows affected (0.13 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table checklist_templates convert to character set utf8 collate utf8_general_ci;
Query OK, 0 rows affected (3.70 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table checklists default character set = utf8 collate = utf8_general_ci;
Query OK, 0 rows affected (0.18 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table checklists convert to character set utf8 collate utf8_general_ci;
Query OK, 6 rows affected (2.64 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> 

If something goes wrong with collation update – you can easily restore three tables from backup you created a few moments ago.

Check yourself

mysql> SHOW TABLE STATUS where name like 'checklist%';

and try to add non-latin checklist items to your redmine tasks – issue should be fixed.

redmine_checklists plugin collation error

Installation and configuration Nginx 1.8, php-fpm 5.6, Percona Server 5.6 on RHEL7

Installation and configuration Nginx 1.8, php-fpm 5.6, Percona Server 5.6 on RHEL7 (Centos7).

Nginx 1.8

nano /etc/yum.repos.d/nginx.repo
[nginx]
name=nginx repo
baseurl=http://nginx.org/packages/rhel/7/$basearch/
gpgcheck=0
enabled=1

Setup:

yum install nginx
systemctl enable nginx.service  (by default in autostart it disabled)


[root@ip-172-31-47-105 nginx]# cat /etc/nginx.conf
## NGINX MAIN CONFIGURATION FILE ##
user nginx nginx;
worker_processes 2;
error_log  /var/log/nginx/error.log info;
events {
    worker_connections  1024;
    use epoll;
}
http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;
    log_format main
                '$remote_addr - $remote_user [$time_local] '
                '"$request" $status $bytes_sent '
                '"$http_referer" "$http_user_agent" '
                '"$gzip_ratio"';
    access_log  /var/log/nginx/access.log  main;
    sendfile        on;
    tcp_nopush on;
    tcp_nodelay on;
    ignore_invalid_headers on;
    keepalive_timeout  30;
    server_tokens off;
    connection_pool_size 256;
    client_header_buffer_size 1k;
    large_client_header_buffers 4 2k;
    request_pool_size 4k;
    output_buffers 1 32k;
    postpone_output 1460;
    client_header_timeout 10m;
    client_body_timeout 10m;
    send_timeout 10m;
    gzip on;
    gzip_disable "MSIE [1-6]\.(?!.*SV1)";
    gzip_http_version 1.1;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_buffers 16 8k;
    gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml 
               application/xml+rss text/javascript application/javascript text/x-js;
    include /etc/nginx/vhosts/*.conf;
}
## NGINX MAIN CONFIGURATION FILE ##
[root@ip-172-31-47-105 nginx]# 
 
 
[root@ip-172-31-47-105 vhosts]# cat /etc/nginx/vhosts/wpblog.conf
server {
    listen 80;
    server_name domain.com www.domain.com;
 
    client_max_body_size 5m;
    client_body_timeout 60;
 
    access_log /var/log/nginx/wpblog.log;
    error_log /var/log/nginx/wpblog-error.log error;
 
    root /home/www/wpblog;
    index  index.html index.php;
 
    ### ROOT DIRECTORY ###
    location / {
        try_files $uri $uri/ /index.php?$args;
    }
 
    ### SECURITY ###
    error_page 403 =404;
    location ~ /\. { access_log off; log_not_found off; deny all; }
    location ~ ~$ { access_log off; log_not_found off; deny all; }
    location ~* wp-admin/includes { deny all; }
    location ~* wp-includes/theme-compat/ { deny all; }
    location ~* wp-includes/js/tinymce/langs/.*\.php { deny all; }
    location /wp-includes/ { internal; }
    #location ~* wp-config.php { deny all; }
    location ~* ^/wp-content/uploads/.*.(html|htm|shtml|php)$ {
        types { }
        default_type text/plain;
    }
 
location ~ ^/(status|ping|opcache\.php)$ {
     access_log off;
     allow 127.0.0.1;
     deny all;
     include fastcgi_params;
	fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
     fastcgi_pass unix:/var/run/wpblog.socket;
}
 
    ### DISABLE LOGGING ###
    location = /robots.txt { access_log off; log_not_found off; }
    location = /favicon.ico { access_log off; log_not_found off; }
 
    ### CACHES ###
    location ~* \.(jpg|jpeg|gif|css|png|js|ico|html)$ { access_log off; expires max; }
    location ~* \.(woff|svg)$ { access_log off; log_not_found off; expires 30d; }
    location ~* \.(js)$ { access_log off; log_not_found off; expires 7d; }
 
    ### php block ###
    location ~ \.php?$ {
        try_files $uri =404;
        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_intercept_errors on;
        fastcgi_split_path_info ^(.+\.php)(.*)$;
        fastcgi_hide_header X-Powered-By;
        #fastcgi_pass 127.0.0.1:9001;
        fastcgi_pass unix:/var/run/wpblog.socket;
    }
} 
[root@ip-172-31-47-105 vhosts]# 

Php-fpm 5.6
Some possible repos for php:

(epel) rpm -Uvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm 
(remi) rpm -Uvh http://rpms.famillecollet.com/enterprise/remi-release-7.rpm
(atomic) wget -q -O - http://www.atomicorp.com/installers/atomic | sh
(webtatic) rpm -Uvh https://mirror.webtatic.com/yum/el7/webtatic-release.rpm

As of May 2015:
epel has php 5.4.16
atomic has php 5.4.40
webtatic has php 5.4.40, 5.5.24, 5.6.8
remi has php 5.4.40
remi-php55 has php 5.5.24
remi-php56 has php 5.6.8

yum --enablerepo=remi,remi-php56 install php-fpm php-opcache php-pecl-apcu php-cli php-pear php-pdo \ 
php-mysqlnd php-pgsql php-pecl-mongo php-pecl-sqlite php-pecl-memcache php-pecl-memcached php-gd \ 
php-mbstring php-mcrypt php-xml

systemctl enable php-fpm  (by default in autostart it disabled)
systemctl status php-fpm


[root@ip-172-31-47-105 php-fpm.d]# cat /etc/php-fpm.d/www.conf
[WORDPRESS]
;listen = 127.0.0.1:9000
listen = /var/run/wpblog.socket
listen.mode = 0660
listen.owner = nginx
listen.group = nginx
 
user = nginx
group = nginx
 
request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/slowlog.log
listen.allowed_clients = 127.0.0.1
pm = dynamic
pm.max_children = 8
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 8
pm.max_requests = 400
listen.backlog = -1
pm.status_path = /status
ping.path = /ping
request_terminate_timeout = 120s
rlimit_files = 131072
rlimit_core = unlimited
catch_workers_output = yes
php_value[session.save_handler] = files
php_value[session.save_path] = /var/lib/php/session
php_admin_value[error_log] = /var/log/php-fpm/error.log
php_admin_flag[log_errors] = on
php_admin_value[memory_limit] = 256M
security.limit_extensions = .php .php3 .php4 .php5 .html .htm .css
 
[root@ip-172-31-47-105 php-fpm.d]#

Percona mysql server 5.6

yum install http://www.percona.com/downloads/percona-release/redhat/0.1-3/percona-release-0.1-3.noarch.rpm
yum install Percona-Server-server-56.x86_64    (this metapackage with setup all needed dependencies)

Issues during setup.

1. Need disable SELinux to avoid such errors:

tail -f /var/log/nginx/error.log
2015/05/06 14:33:18 [error] 10595#0: *1 "/home/www/default/index.html" is forbidden (13: Permission denied), 
client: 1.2.3.4, server: _, request: "GET / HTTP/1.1", host: "52.7.144.126"

Temporary fix:
[root@ip-172-31-47-105 vhosts]# getenforce
Enforcing
[root@ip-172-31-47-105 vhosts]# setenforce Permissive
[root@ip-172-31-47-105 vhosts]# getenforce
Permissive

Permanent fix:
In /etc/sysconfig/selinux need set SELINUX=disabled and reboot server.

2. Nginx: connect() to unix socket failed (13: Permission denied) while connecting to upstream

 
tail -f /var/log/nginx/wpblog-error.log
2015/05/06 15:10:39 [crit] 11252#0: *2 connect() to unix:/var/run/wpblog.socket failed (13: Permission denied) 
while connecting to upstream, client: 1.2.3.4, server: wordpress.domain.net, request: "GET / HTTP/1.1", 
upstream: "fastcgi://unix:/var/run/wpblog.socket:", host: "52.7.144.126"

In /etc/php-fpm.d/www.conf need set:
listen.mode = 0660
listen.owner = nginx
listen.group = nginx

Then restart php-fpm.

3. PHP error “It is not safe to rely on the system’s timezone settings”

  
tail -f /var/log/nginx/wpblog-error.log
2015/05/07 10:33:36 [error] 2683#0: *201 FastCGI sent in stderr: "PHP message: PHP Warning:  phpinfo(): 
It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting 
or the date_default_timezone_set() function. In case you used any of those methods and you are still getting 
this warning, you most likely misspelled the timezone identifier. We selected the timezone 'UTC' for now, 
but please set date.timezone to select your timezone. in /home/www/wpblog/i.php on line 2" while reading 
response header from upstream, client: 1.2.3.4, server: domain.com, request: "GET /i.php HTTP/1.1", 
upstream: "fastcgi://unix:/var/run/wpblog.socket:", host: "52.5.84.229"

Fix: vim /etc/php.ini
:%s#;date.timezone =#date.timezone = US/Central#
service php-fpm restart

4. If phpinfo(); show you blank page:

  
[root@ip-172-31-47-105 vhosts]# php -i | grep open_tag
short_open_tag => Off => Off
[root@ip-172-31-47-105 vhosts]# nano /etc/php.ini
short_open_tag = On

service php-fpm restart

5. In custom nginx locations use these three lines to avoid empty blank pages:

  
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_pass unix:/var/run/wpblog.socket;

6. Solving nginx+php-fpm access denied issue:

  
tail -f /var/log/nginx/wpblog-error.log
2015/05/07 10:55:54 [error] 2807#0: *310 FastCGI sent in stderr: "Access to the script '/home/www/wpblog/wp-admin/' 
has been denied (see security.limit_extensions)" while reading response header from upstream, client: 1.2.3.4, 
server: devops.cf, request: "GET /wp-admin/ HTTP/1.1", upstream: "fastcgi://unix:/var/run/wpblog.socket:", host: "domain.com"

Fix:
Edit www.conf in the php-fpm.d directory
Edit and uncomment this line, by default .htm and .html are blocked: 
security.limit_extensions = .php .php3 .php4 .php5 .html .htm

service php-fpm restart

7. How to use .htaccess directives:
Method 1.
Use php-fpm config of your domain:

  
cat /etc/php-fpm.d/www.conf
php_admin_value[error_log] = /var/log/php-fpm/error.log
php_admin_flag[log_errors] = on
php_admin_value[memory_limit] = 256M

Method 2.
Starting from PHP version 5.3 .user.ini files in domain document root can be used
( http://php.net/manual/en/configuration.file.per-user.php )

PHP directives processed in this order:

 
/etc/php.ini > .user.ini in domain docroot > domain pool config /etc/php-fpm.d/www.conf

By default after 5 min changes in .user.ini files are re-read, if need sooner: service php-fpm restart

Installation and configuration Nginx 1.8, php-fpm 5.6, Percona Server 5.6 on RHEL7