Weird Errors

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

Unexpectedly, torch.cuda.is_available() may raise the following error:

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

The code is as follows:

def _nvml_based_avail() -> bool:
    return os.getenv("PYTORCH_NVML_BASED_CUDA_CHECK") == "1"

def is_available() -> bool:
    r"""Return a bool indicating if CUDA is currently available."""
    if not _is_compiled():
        return False
    if _nvml_based_avail():
        # The user has set an env variable to request this availability check that attempts to avoid fork poisoning by
        # using NVML at the cost of a weaker CUDA availability assessment. Note that if NVML discovery/initialization
        # fails, this assessment falls back to the default CUDA Runtime API assessment (`cudaGetDeviceCount`)
        return device_count() > 0
    else:
        # The default availability inspection never throws and returns 0 if the driver is missing or can't
        # be initialized. This uses the CUDA Runtime API `cudaGetDeviceCount` which in turn initializes the CUDA Driver
        # API via `cuInit`
        return torch._C._cuda_getDeviceCount() > 0

It does trust torch.cuda.device_count() but calls torch._C._cuda_getDeviceCount() instead which does not seem to release CUDA memory. One way to solve this is to set PYTORCH_NVML_BASED_CUDA_CHECK=1.