kemonine
/
lollipopcloud
Archived
1
0
Fork 0

Add monit monitoring of sd card hiccups and time leaps

master
KemoNine 3 years ago
parent 4d9d72d438
commit cd55cd7511
No known key found for this signature in database
GPG Key ID: 3BC2928798AE11AB
  1. 92
      hardware/pine64.md

@ -43,3 +43,95 @@ dkms install zfs/0.7.13
systemctl enable zfs-import-cache zfs-import.target zfs-mount zfs-share zfs.target
```
## Monitor For Common Problems
For some reason the Pine64 and SOPine can have problems with "clock jumps" (ie. jumping forward 95 years) due to kernel bugs. They can also have major IO stalls when writing heavily to micro-sd cards, so much so the board becomes basically non-responsive for many minutes (upwards of 10).
The below Monit configuration and setup will monitor for both events and reboot the board in the event either happens. Currently this seems to be the least-worst option for recovery.
### Monit Install / Initial Config
``` bash
apt install monit
nano -w /etc/monit/monitrc
set mail-format { from: user@domain.tld }
set alert admin@domain.tld
set mailserver mail.domain.tld port 587
username "user@domain.tld" password "apassword"
using tls
set httpd port 2812 and
allow admin:apassword
allow guest:guest readonly
#with ssl { # enable SSL/TLS and set path to server certificate
# pemfile: /etc/ssl/certs/monit.pem
#}
```
### Monit Monitor for large clock jumps forward
```/usr/local/bin/check_clock_jump.py```
``` python
#!/usr/bin/env python3
import datetime
import sys
FORMAT_STRING = '%Y-%m-%d %H:%M:%S'
MAX_TIME_JUMP = datetime.timedelta(days=90)
CACHE_FILE = '/var/cache/last_time.check'
current_time = datetime.datetime.now()
last_time = current_time
try:
with open(CACHE_FILE, 'r') as f:
last_time = datetime.datetime.strptime(f.read().strip(), FORMAT_STRING)
except FileNotFoundError:
pass
timedelta = current_time - last_time
if timedelta > MAX_TIME_JUMP:
sys.exit(1)
with open(CACHE_FILE, 'w') as f:
f.write(current_time.strftime(FORMAT_STRING))
sys.exit(0)
```
``` bash
chmod a+x /usr/local/bin/check_clock_jump.py
cat > /etc/monit/conf.d/check_clock_jump.conf <<EOF
check program check_clock_jump with path /usr/local/bin/check_clock_jump.py
if status != 0
then exec "/bin/systemctl reboot"
as uid "root" and gid "root"
EOF
systemctl restart monit
```
### Monit monitor for ```card_busy_detect status: 0xe00``` kernel errors
``` bash
cat > /etc/monit/conf.d/card_busy_detect.conf <<EOF
# From docs: On startup the read position is set to the end of the file and Monit continues to scan to the end of the file on each cycle.
check file kernel path /var/log/kern.log
if content = ".*card_busy_detect status: 0xe00.*"
then exec "/bin/systemctl reboot"
as uid "root" and gid "root"
EOF
systemctl restart monit
```