I have a CentOS 6.6
server with the following packages installed:
crontabs-1.10-33.el6.noarch
cronie-1.4.4-12.el6.x86_64
cronie-anacron-1.4.4-12.el6.x86_64
kernel-2.6.32-504.3.3.el6.x86_64
Sometimes, one of the backup jobs that is scheduled to run daily simply does not run. The script is not even called according to /var/log/cron.log
.
Interesting to mention that other jobs scheduled to run exactly at the same time run without any issues.
I can't reproduce the problem and haven't spotted any patterns on it. If I do nothing, then the job runs correctly the next day as expected.
crond simply ignores just one of the multiple jobs that are supposed to run at a particular time. This only happens sporadically.
I read in a few other places people talking about adding an empty line at the end of the crontab
file. The job that's occasionally failing to run is indeed at the last line of my crontab
file.
I could not find any confirmation this is a real or known bug.
# tail -2 /var/spool/cron/postgres
* * * * * OTHERJOB
0 21 * * * /pg_backup.sh
This is all I have in my /var/log/cron.log
Mar 31 21:00:02 SERVERNAME [cron.info] CROND[19394]: (root) CMD (OTHERJOB)
Mar 31 21:00:02 SERVERNAME [cron.info] CROND[19418]: (postgres) CMD (/pg_backup.sh)
Mar 31 21:01:02 SERVERNAME [cron.info] CROND[20062]: (root) CMD (OTHERJOB)
Apr 1 21:00:02 SERVERNAME [cron.info] CROND[31349]: (root) CMD (OTHERJOB)
Apr 1 21:01:01 SERVERNAME [cron.info] CROND[32080]: (root) CMD (OTHERJOB)
See how OTHERJOB
always run while on Apr 1
pg_backup.sh
was not even executed.
I've already tried restarting crond
but this keeps happening. This is affecting multiple servers with the same version of OS, kernel and cron
RPMs.
There is a newer version of cronie
(1.4.12
), however upgrading it is not an option as we're already using the latest available version for Centos 6.6
I went through the changelog for all cronie
versions after mine (1.4.4
) and haven't seem any fix to this particular problem. Also checked all commit messages.
Answer
we use sssd
for remote authentication. crond
has to check for available users ahead of running jobs and it does this every 60 seconds.sssd
default client_idle_timeout
is 60 seconds. so we had a race condition between sssd
and crond
We only got to the bottom of this problem because on version 1.4.4-14
crond started being a bit more verbose about some errors.
* Thu Feb 5 12:00:00 2015 Tomáš Mráz - 1.4.4-14
- add log message when getpwnam fails
After updating to that version we started seeing the error below at the same time a job would not run:
[cron.err] crond[8654]: (user) ERROR (getpwnam() failed): Broken pipe
that brought us to this:
https://bugzilla.redhat.com/show_bug.cgi?id=1209600#c2
and finally to this:
https://access.redhat.com/solutions/1125133
Issue:
sssd_be
terminated with SIGKILL due to getpwnam() returning EPIPE
(ie. broken pipe) can cause crond to silently skip cron job entries.
The suggest solution on the link above was add the line below to /etc/sssd/sssd.conf
:
client_idle_timeout = 75
The change above has fixed the problem for us and cron no longer skips jobs.
No comments:
Post a Comment