Running the SDP Prototype stand-alone¶
Installing the etcd operator¶
The SDP configuration database is implemented on top of etcd, a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines.
Before deploying the SDP itself, you need to install the etcd-operator
Helm
chart. This provides a convenient way to create and manage etcd clusters in
other charts.
If you have a fresh install of Helm, you need to add the stable
repository:
$ helm repo add stable https://kubernetes-charts.storage.googleapis.com/
The sdp-prototype
charts directory contains a file called
etcd-operator.yaml
with settings for the chart. This turns off parts which
are not used (the backup and restore operators).
First go to the charts directory:
$ cd [sdp-prototype]/charts
Then install the etcd-operator
chart with:
$ helm install etcd stable/etcd-operator -f etcd-operator.yaml
If you now execute:
$ kubectl get pod --watch
You should eventually see an pod called etcd-etcd-operator-etcd-operator-[...]
in ‘Running’ state (yes, Helm is exceedingly redundant with its names). If not
wait a bit, if you try to go to the next step before this has completed there is
a chance it will fail.
Deploying the SDP¶
At this point you should be able to deploy the SDP. Install the sdp-prototype
chart with the release name test
:
$ helm install test sdp-prototype
You can again watch the deployment in progress using kubectl
:
$ kubectl get pod --watch
Pods associated with Tango might go down a couple times before they
start correctly, this seems to be normal. You can check the logs of
pods (copy the full name from kubectl
output) to verify that they
are doing okay:
$ kubectl logs test-sdp-prototype-lmc-[...] sdp-subarray-1
1|2020-08-06T15:17:41.369Z|INFO|MainThread|init_device|subarray.py#110|SDPSubarray|Initialising SDP Subarray: mid_sdp/elt/subarray_1
...
1|2020-08-06T15:17:41.377Z|INFO|MainThread|init_device|subarray.py#140|SDPSubarray|SDP Subarray initialised: mid_sdp/elt/subarray_1
$ kubectl logs test-sdp-prototype-processing-controller-[...]
...
1|2020-08-06T15:14:30.068Z|DEBUG|MainThread|main|processing_controller.py#192||Waiting...
$ kubectl logs test-sdp-prototype-helm-deploy-[...]
...
1|2020-08-06T15:14:31.662Z|INFO|MainThread|main|helm_deploy.py#146||Found 0 existing deployments.
If it looks like this, there is a good chance everything has been deployed correctly.
Testing it out¶
Connecting to the configuration database¶
By default the sdp-prototype
chart deploys a ‘console’ pod which enables you
to interact with the configuration database. You can start a shell in the pod
by doing:
$ kubectl exec -it deploy/test-sdp-prototype-console -- /bin/bash
This will allow you to use the sdpcfg
command:
# sdpcfg ls -R /
Keys with / prefix:
Which correctly shows that the configuration is currently empty.
Starting a workflow¶
Assuming the configuration is prepared as explained in the previous section, we can now add a processing block to the configuration:
# sdpcfg process batch:test_dask:0.2.1
OK, pb_id = pb-sdpcfg-20200425-00000
The processing block is created with the /pb
prefix in the
configuration:
# sdpcfg ls values -R /pb
Keys with /pb prefix:
/pb/pb-sdpcfg-20200425-00000 = {
"dependencies": [],
"id": "pb-sdpcfg-20200425-00000",
"parameters": {},
"sbi_id": null,
"workflow": {
"id": "test_dask",
"type": "batch",
"version": "0.2.0"
}
}
/pb/pb-sdpcfg-20200425-00000/owner = {
"command": [
"testdask.py",
"pb-sdpcfg-20200425-00000"
],
"hostname": "proc-pb-sdpcfg-20200425-00000-workflow-7pfkl",
"pid": 1
}
/pb/pb-sdpcfg-20200425-00000/state = {
"resources_available": true,
"status": "RUNNING"
}
The processing block is detected by the processing controller which
deploys the workflow. The workflow in turn deploys the execution engines
(in this case, Dask). The deployments are requested by creating entries
with /deploy
prefix in the configuration, where they are detected by
the Helm deployer which actually makes the deployments:
# sdpcfg ls values -R /deploy
Keys with /deploy prefix:
/deploy/proc-pb-sdpcfg-20200425-00000-dask = {
"args": {
"chart": "stable/dask",
"values": {
"jupyter.enabled": "false",
"scheduler.serviceType": "ClusterIP",
"worker.replicas": 2
}
},
"id": "proc-pb-sdpcfg-20200425-00000-dask",
"type": "helm"
}
/deploy/proc-pb-sdpcfg-20200425-00000-workflow = {
"args": {
"chart": "workflow",
"values": {
"env.SDP_CONFIG_HOST": "test-sdp-prototype-etcd-client.default.svc.cluster.local",
"env.SDP_HELM_NAMESPACE": "sdp",
"pb_id": "pb-sdpcfg-20200425-00000",
"wf_image": "nexus.engageska-portugal.pt/sdp-prototype/workflow-test-dask:0.2.0"
}
},
"id": "proc-pb-sdpcfg-20200425-00000-workflow",
"type": "helm"
}
The deployments associated with the processing block have been created
in the sdp
namespace, so to view the created pods we have to ask as
follows (on the host):
$ kubectl get pod -n sdp
NAME READY STATUS RESTARTS AGE
proc-pb-sdpcfg-20200425-00000-dask-scheduler-78b4974ddf-w4x8x 1/1 Running 0 4m41s
proc-pb-sdpcfg-20200425-00000-dask-worker-85584b4598-p6qpw 1/1 Running 0 4m41s
proc-pb-sdpcfg-20200425-00000-dask-worker-85584b4598-x2bh5 1/1 Running 0 4m41s
proc-pb-sdpcfg-20200425-00000-workflow-7pfkl 1/1 Running 0 4m46s
Cleaning up¶
Finally, let us remove the processing block from the configuration (in the SDP console shell):
# sdpcfg delete -R /pb/pb-sdpcfg-20200425-00000
/pb/pb-sdpcfg-20200425-00000
/pb/pb-sdpcfg-20200425-00000/owner
/pb/pb-sdpcfg-20200425-00000/state
OK
If you re-run the commands from the last section you will notice that this correctly causes all changes to the cluster configuration to be undone as well.
Accessing Tango¶
By default the sdp-prototype chart installs the iTango shell pod from the tango-base chart. You can access it as follows:
$ kubectl exec -it itango-tango-base-sdp-prototype -- /venv/bin/itango3
You should be able to query the SDP Tango devices:
In [1]: lsdev
Device Alias Server Class
---------------------------------------- ------------------------- ------------------------- --------------------
mid_sdp/elt/master SdpMaster/1 SdpMaster
mid_sdp/elt/subarray_1 SdpSubarray/1 SdpSubarray
mid_sdp/elt/subarray_2 SdpSubarray/2 SdpSubarray
sys/access_control/1 TangoAccessControl/1 TangoAccessControl
sys/database/2 DataBaseds/2 DataBase
sys/rest/0 TangoRestServer/rest TangoRestServer
sys/tg_test/1 TangoTest/test TangoTest
This allows direct interaction with the devices, such as querying and and changing attributes and issuing commands:
In [2]: d = DeviceProxy('mid_sdp/elt/subarray_1')
In [3]: d.state()
Out[3]: tango._tango.DevState.OFF
In [4]: d.On()
In [5]: d.state()
Out[5]: tango._tango.DevState.ON
In [6]: d.obsState
Out[6]: <obsState.EMPTY: 0>
In [7]: config_sbi = '''
...: {
...: "id": "sbi-mvp01-20200425-00000",
...: "max_length": 21600.0,
...: "scan_types": [
...: {
...: "id": "science",
...: "channels": [
...: {"count": 372, "start": 0, "stride": 2, "freq_min": 0.35e9, "freq_max": 0.358e9, "link_map": [[0,0], [200,1]]}
...: ]
...: }
...: ],
...: "processing_blocks": [
...: {
...: "id": "pb-mvp01-20200425-00000",
...: "workflow": {"type": "realtime", "id": "test_realtime", "version": "0.1.0"},
...: "parameters": {}
...: },
...: {
...: "id": "pb-mvp01-20200425-00001",
...: "workflow": {"type": "realtime", "id": "test_realtime", "version": "0.1.0"},
...: "parameters": {}
...: },
...: {
...: "id": "pb-mvp01-20200425-00002",
...: "workflow": {"type": "batch", "id": "test_batch", "version": "0.1.0"},
...: "parameters": {},
...: "dependencies": [
...: {"pb_id": "pb-mvp01-20200425-00000", "type": ["visibilities"]}
...: ]
...: },
...: {
...: "id": "pb-mvp01-20200425-00003",
...: "workflow": {"type": "batch", "id": "test_batch", "version": "0.1.0"},
...: "parameters": {},
...: "dependencies": [
...: {"pb_id": "pb-mvp01-20200425-00002", "type": ["calibration"]}
...: ]
...: }
...: ]
...: }
...: '''
In [8]: d.AssignResources(config_sbi)
In [9]: d.obsState
Out[9]: <obsState.IDLE: 0>
In [10]: d.Configure('{"scan_type": "science"}')
In [11]: d.obsState
Out[11]: <obsState.READY: 2>
In [12]: d.Scan('{"id": 1}')
In [13]: d.obsState
Out[13]: <obsState.SCANNING: 3>
In [14]: d.EndScan()
In [15]: d.obsState
Out[15]: <obsState.READY: 2>
In [16]: d.End()
In [17]: d.obsState
Out[17]: <obsState.IDLE: 0>
In [18]: d.ReleaseResources()
In [19]: d.obsState
Out[19]: <obsState.EMPTY: 0>
In [20]: d.Off()
In [21]: d.state()
Out[21]: tango._tango.DevState.OFF
Removing the SDP¶
To remove the SDP deployment from the cluster, do:
$ helm uninstall test
and to remove the etcd operator, do:
$ helm uninstall etcd
Troubleshooting¶
etcd doesn’t start (DNS problems)¶
Something that often happens on home set-ups is that test-sdp-prototype-etcd
does not start, which means that quite a bit of the SDP system will not work.
Try executing kubectl logs
on the pod to get a log. You might see something
like this as the last three lines:
... I | pkg/netutil: resolving sdp-prototype-etcd-9s4hbbmmvw.k8s-sdp-prototype-etcd.default.svc:2380 to 10.1.0.21:2380
... I | pkg/netutil: resolving sdp-prototype-etcd-9s4hbbmmvw.k8s-sdp-prototype-etcd.default.svc:2380 to 92.242.132.24:2380
... C | etcdmain: failed to resolve http://sdp-prototype-etcd-9s4hbbmmvw.sdp-prototype-etcd.default.svc:2380 to match --initial-cluster=sdp-prototype-etcd-9s4hbbmmvw=http://sdp-prototype-etcd-9s4hbbmmvw.sdp-prototype-etcd.default.svc:2380 ("http://10.1.0.21:2380"(resolved from "http://sdp-prototype-etcd-9s4hbbmmvw.sdp-prototype-etcd.default.svc:2380") != "http://92.242.132.24:2380"(resolved from "http://sdp-prototype-etcd-9s4hbbmmvw.sdp-prototype-etcd.default.svc:2380"))
This informs you that etcd tried to resolve its own address, and for
some reason got two different answers both times. Interestingly, the
92.242.132.24
address is not actually in-cluster, but from the Internet,
and re-appears if we attempt to ping
a nonexistent DNS name:
$ ping does.not.exist
Pinging does.not.exist [92.242.132.24] with 32 bytes of data:
Reply from 92.242.132.24: bytes=32 time=25ms TTL=242
What is going on here is that that your ISP has installed a DNS server that redirects unknown DNS names to a server showing a ‘helpful’ error message complete with a bunch of advertisements. For some reason this seems to cause a problem with Kubernetes’ internal DNS resolution.
How can this be prevented? Theoretically it should be enough to force
the DNS server to one that does not have this problem (like Google’s
8.8.8.8
and 8.8.4.4
DNS servers), but that is tricky to get working.
Alternatively you can simply restart the entire thing until it works.
Unfortunately this is not quite as straightforward with etcd-operator
,
as it sets the restartPolicy
to Never
, which means that any etcd
pod only gets once chance, and then will remain Failed
forever. The
quickest way seems to be to delete the EtcdCluster
object, then
upgrade
the chart in order to re-install it:
$ kubectl delete etcdcluster test-sdp-prototype-etcd
$ helm upgrade test sdp-prototype
This can generally be repeated until by pure chance the two DNS resolutions
return the same result and etcd
starts up.