[pipeline] Re-evaluate usage of artifacts vs caches vs external storage
GitLab provides two facilities to pass files/directories between jobs: caches and artifacts. Alternatively there is a way to use custom external storage (even GitLab docs mention NFS as an option: https://docs.gitlab.com/ee/ci/caching/#good-caching-practices).
There are two generic use cases for artifacts:
- pass intermediate artifacts between jobs in a single pipeline run
(consider
image-latest
pass fromimage
job tocheck
job) - preserve artifacts between different pipeline runs
(consider
image-stable
pass fromyield
to further pipeline)
Both, cache
and artifacts
have their own pros and cons:
Caches
PROS:
- Do not require special expiration tuning or clean up policy, only one copy is preserved.
- Caches may not be re-downloaded if a sequence of jobs runs on the same instance.
CONS:
- Failures to update/download caches are not considered a job error and lead to silent errors. As a workaround a checksum/timestamp file passed via artifacts could be used.
- Multiple parallel pipeline runs could interfere with caches and lead to silent errors. As a workaround pipeline id vars could be used in keys, but it leads to polluting caches. Alternatively pipeline runs could be protected by pipeline-level locks (link)
- Caches could accidentally deleted via UI or by other means and in general are not reliable. However in some cases it could be required to purge caches to force re-generation.
- Altering caches in a project instance pipeline configuration is like a leak of abstraction.
Changing any of cache properties cause developer to re-specify all the details of caching
including
cache:key
andcache:paths
which could be considered implementation details. It becomes even more complicated and inconvenient if a list of caches used in a job.
Artifacts
PROS:
- More intuitively fits into artifacts passing workflow for intermediate artifacts,
passing artifacts between jobs could be tuned with
needs
/dependencies
stanzas with no need to know details. Same time by default all artifacts are downloaded by subsequent jobs and usage ofneeds
/dependencies
should be used to limit traffic usage. - Intermediate artifacts are backed up and provide a way for investigation in case of issues.
CONS:
- Occupies storage space and eats quotas with expiration clean up not working smoothly.
- Artifacts should be uploaded/downloaded between jobs even if running on the same instance.
- Each job has only one artifacts entry, dependent job couldn't specify part of artifacts.
- Preserve artifacts between pipeline runs requires job artifacts API which is Premium.
- Job artifacts API relies on a job name which could be used by multiple jobs if parent-child pipeline are used. There is no way to specify exact job to use.