Thursday, September 25, 2014

linux - cron job occasionally not running



I have a CentOS 6.6 server with the following packages installed:




crontabs-1.10-33.el6.noarch
cronie-1.4.4-12.el6.x86_64
cronie-anacron-1.4.4-12.el6.x86_64
kernel-2.6.32-504.3.3.el6.x86_64


Sometimes, one of the backup jobs that is scheduled to run daily simply does not run. The script is not even called according to /var/log/cron.log.
Interesting to mention that other jobs scheduled to run exactly at the same time run without any issues.




I can't reproduce the problem and haven't spotted any patterns on it. If I do nothing, then the job runs correctly the next day as expected.



crond simply ignores just one of the multiple jobs that are supposed to run at a particular time. This only happens sporadically.



I read in a few other places people talking about adding an empty line at the end of the crontab file. The job that's occasionally failing to run is indeed at the last line of my crontab file.
I could not find any confirmation this is a real or known bug.



# tail -2 /var/spool/cron/postgres
* * * * * OTHERJOB
0 21 * * * /pg_backup.sh



This is all I have in my /var/log/cron.log



Mar 31 21:00:02 SERVERNAME [cron.info] CROND[19394]: (root) CMD (OTHERJOB)
Mar 31 21:00:02 SERVERNAME [cron.info] CROND[19418]: (postgres) CMD (/pg_backup.sh)
Mar 31 21:01:02 SERVERNAME [cron.info] CROND[20062]: (root) CMD (OTHERJOB)

Apr 1 21:00:02 SERVERNAME [cron.info] CROND[31349]: (root) CMD (OTHERJOB)
Apr 1 21:01:01 SERVERNAME [cron.info] CROND[32080]: (root) CMD (OTHERJOB)



See how OTHERJOB always run while on Apr 1 pg_backup.sh was not even executed.



I've already tried restarting crond but this keeps happening. This is affecting multiple servers with the same version of OS, kernel and cron RPMs.



There is a newer version of cronie (1.4.12), however upgrading it is not an option as we're already using the latest available version for Centos 6.6



I went through the changelog for all cronie versions after mine (1.4.4) and haven't seem any fix to this particular problem. Also checked all commit messages.


Answer




we use sssd for remote authentication. crond has to check for available users ahead of running jobs and it does this every 60 seconds.
sssd default client_idle_timeout is 60 seconds. so we had a race condition between sssd and crond



We only got to the bottom of this problem because on version 1.4.4-14 crond started being a bit more verbose about some errors.



* Thu Feb  5 12:00:00 2015 Tomáš Mráz  - 1.4.4-14
- add log message when getpwnam fails


After updating to that version we started seeing the error below at the same time a job would not run:




[cron.err] crond[8654]: (user) ERROR (getpwnam() failed): Broken pipe


that brought us to this:
https://bugzilla.redhat.com/show_bug.cgi?id=1209600#c2



and finally to this:
https://access.redhat.com/solutions/1125133





Issue: sssd_be terminated with SIGKILL due to getpwnam() returning EPIPE
(ie. broken pipe) can cause crond to silently skip cron job entries.




The suggest solution on the link above was add the line below to /etc/sssd/sssd.conf:



client_idle_timeout = 75



The change above has fixed the problem for us and cron no longer skips jobs.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...