This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Troubleshooting

1 - Huawei CSI Service Issues

1.1 - Failed to Start the huawei-csi-node Service with Error Message /var/lib/iscsi is not a directory Reported

Symptom

The huawei-csi-node service cannot be started. When you run the kubectl describe daemonset huawei-csi-node -n huawei-csi command, error message “/var/lib/iscsi is not a directory” is reported.

Root Cause Analysis

The /var/lib/iscsi directory does not exist in the huawei-csi-node container.

Solution or Workaround

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Go to the directory where the Helm project is located. If the previous Helm project cannot be found, copy the helm directory in the component package to any directory on the master node. For details about the component package path, see Table 1.

  3. Go to the templates directory and find the huawei-csi-node.yaml file.

    cd /templates
    
  4. Run the following command to set path in huawei-csi-node.yaml > volumes > iscsi-dir > hostPath to /var/lib/iscsi, save the file, and exit.

    vi huawei-csi-node.yaml
    
  5. Run the following command to upgrade the Helm chart. The upgrade command will update the Deployment, DaemonSet, and RBAC resources. In the preceding command, helm-huawei-csi indicates the custom chart name and huawei-csi indicates the custom namespace.

    helm upgrade helm-huawei-csi ./  -n huawei-csi -f values.yaml
    

    The following is an example of the command output.

    Release "helm-huawei-csi" has been upgraded. Happy Helming!
    NAME: helm-huawei-csi
    LAST DEPLOYED: Thu Jun  9 07:58:15 2022
    NAMESPACE: huawei-csi
    STATUS: deployed
    REVISION: 2
    TEST SUITE: None
    

1.2 - Huawei CSI Services Fail to Be Started and Error Message '/etc/localtime is not a file' Is Displayed

Symptom

During the installation and deployment of CSI, a Pod fails to run and is in the ContainerCreating state. Alarm /etc/localtime is not a file is generated for the Pod.

Root Cause Analysis

When the container mounts the /etc/localtime file on the host, the type is incorrectly identified. As a result, the container fails to mount the /etc/localtime file on the host and the Pod cannot run.

Procedure

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the following command to check the running status of the Pod of the CSI services.

    kubectl get pod -n huawei-csi
    

    The following is an example of the command output. huawei-csi indicates the namespace where the CSI services are deployed.

    NAME                                     READY   STATUS               RESTARTS   AGE
    huawei-csi-controller-6dfcc4b79f-9vjtq   9/9     ContainerCreating    0          24m
    huawei-csi-controller-6dfcc4b79f-csphc   9/9     ContainerCreating    0          24m
    huawei-csi-node-g6f4k                    3/3     ContainerCreating    0          20m
    huawei-csi-node-tqs87                    3/3     ContainerCreating    0          20m
    
  3. Run the following command to check the Events parameter of the container.

    kubectl describe pod huawei-csi-controller-6dfcc4b79f-9vjtq -n huawei-csi
    

    The following is an example of the command output. In the command, huawei-csi-controller-6dfcc4b79f-9vjtq indicates the name of the Pod in the ContainerCreating state found in 2, and huawei-csi indicates the namespace to which the Pod belongs.

    ...
    Events:
      Type     Reason       Age                From               Message
      ----     ------       ----               ----               -------
      Normal   Scheduled    96s                default-scheduler  Successfully assigned huawei-csi/huawei-csi-controller-6dfcc4b79f-9vjtq to node1
      Warning  FailedMount  33s (x8 over 96s)  kubelet            MountVolume.SetUp failed for volume "host-time" : hostPath type check failed: /etc/localtime is not a file
    
  4. Run the cd /helm/esdk/templates command to go to the CSI installation package path. For the path, see Table 1.

  5. Take the huawei-csi-controller.yaml file as an example. Run the following command to view the file content.

    vi huawei-csi-controller.yaml
    

    Find the host-time configuration item under volumes, and delete the type: File line. Perform the same operations on the huawei-csi-node.yaml deployment file that involves the configuration item in the templates directory.

    ...
    ...
    volumes:
      - hostPath:
          path: /var/log/
          type: Directory
        name: log
      - hostPath:
          path: /etc/localtime
          type: File
        name: host-time
    ...
    ...
    
  6. Uninstall and reinstall the service by referring to Uninstalling Huawei CSI Using Helm.

  7. Run the following command to check whether the Pod running status of Huawei CSI services is Running.

    kubectl get pod -n huawei-csi
    

    The following is an example of the command output.

    NAME                                     READY   STATUS    RESTARTS   AGE
    huawei-csi-controller-6dfcc4b79f-9vjts   9/9     Running   0          24m
    huawei-csi-controller-6dfcc4b79f-csphb   9/9     Running   0          24m
    huawei-csi-node-g6f41                    3/3     Running   0          20m
    huawei-csi-node-tqs85                    3/3     Running   0          20m
    

1.3 - Failed to Start huawei-csi Services with the Status Displayed as InvalidImageName

Symptom

The huawei-csi services (huawei-csi-controller or huawei-csi-node) cannot be started. After the kubectl get pod -A | grep huawei command is executed, the command output shows that the service status is InvalidImageName.

kubectl get pod -A | grep huawei

The following is an example of the command output.

huawei-csi     huawei-csi-controller-fd5f97768-qlldc     6/9     InvalidImageName     0          16s
huawei-csi     huawei-csi-node-25txd                     2/3     InvalidImageName     0          15s

Root Cause Analysis

In the .yaml configuration files of the controller and node, the Huawei CSI image version number is incorrect. For example:

        ...
        - name: huawei-csi-driver
          image: huawei-csi:4.5.0
        ...

Solution or Workaround

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the following command to modify the configuration file of the huawei-csi-node service. Press I or Insert to enter the insert mode and modify related parameters. After the modification is complete, press Esc and enter :wq! to save the modification.

    kubectl edit daemonset huawei-csi-node -o yaml -n=huawei-csi
    

    • In huawei-csi-driver in the sample .yaml file, modify image to Huawei CSI image huawei-csi:4.5.0.
    containers:
      ...
      - name: huawei-csi-driver
        image: huawei-csi:4.5.0
    
  3. Run the following command to modify the configuration file of the huawei-csi-controller service: Press I or Insert to enter the insert mode and modify related parameters. After the modification is complete, press Esc and enter :wq! to save the modification.

    kubectl edit deployment huawei-csi-controller -o yaml -n=huawei-csi
    

    • In huawei-csi-driver in the sample .yaml file, modify image to Huawei CSI image huawei-csi:4.5.0.
    containers:
      ...
      - name: huawei-csi-driver
        image: huawei-csi:4.5.0
    
  4. Wait until the huawei-csi-node and huawei-csi-controller services are started.

  5. Run the following command to check whether the huawei-csi services are started.

    kubectl get pod -A  | grep huawei
    

    The following is an example of the command output. If the Pod status is Running, the services are started successfully.

    huawei-csi   huawei-csi-controller-58799449cf-zvhmv   9/9     Running       0          2m29s
    huawei-csi   huawei-csi-node-7fxh6                    3/3     Running       0          12m
    

2 - Storage Backend Issues

2.1 - A webhook Fails to Be Called When the oceanctl Tool Is Used to Manage Backends

Symptom

After the webhook configuration is changed, for example, the value of the webhookPort parameter is changed, an error is reported indicating that a webhook fails to be called when the oceanctl tool is used to manage backends, as shown in the following figure.

Root Cause Analysis

After the webhook configuration changes, the validatingwebhookconfiguration resource becomes invalid.

Solution or Workaround

  1. Run the following command to delete the validatingwebhookconfiguration resource.

    kubectl delete validatingwebhookconfiguration storage-backend-controller.xuanwu.huawei.io
    
  2. Run the following command to restart CSI Controller. Run the –replicas=* command to set the number of CSI Controller copies to be restored. In the following example, the number of copies to be restored is 1. Change it based on site requirements.

    kubectl scale deployment huawei-csi-controller -n huawei-csi --replicas=0 
    kubectl scale deployment huawei-csi-controller -n huawei-csi --replicas=1
    
  3. Run the following command to check whether CSI Controller is successfully started.

    kubectl get pod -n huawei-csi
    

    The following is an example of the command output. If the Pod status is Running, Controller is successfully started.

    NAME                                     READY   STATUS    RESTARTS   AGE
    huawei-csi-controller-58d5b6b978-s2dsq   9/9     Running   0          19s
    huawei-csi-node-dt6nd                    3/3     Running   0          77m
    

2.2 - A Backend Fails to Be Created Using the oceanctl Tool and Error Message `context deadline exceeded` Is Displayed

Symptom

A user fails to create a storage backend using the oceanctl tool, and “failed to call webhook: xxx :context deadline exceeded; error: exist status 1” is displayed on the console.

Root Cause Analysis

When a storage backend is created, the webhook service provided by CSI is invoked to verify the connectivity with the storage management network and the storage account and password. The possible causes are as follows:

  • Huawei CSI fails to verify the connectivity of the storage management network.
  • The communication between kube-apiserver and CSI webhook is abnormal.

Huawei CSI Fails to Verify the Connectivity of the Storage Management Network

Perform the following steps to check whether Huawei CSI fails to verify the connectivity of the storage management network.

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the following command to obtain CSI service information. huawei-csi indicates the namespace where the CSI services are deployed.

    kubectl get pod -n huawei-csi -owide
    

    The following is an example of the command output.

    NAME                        READY   STATUS    RESTARTS   AGE   IP         NODE       NOMINATED NODE   READINESS GATES
    huawei-csi-controller-xxx   9/9     Running   0          19h   host-ip1   host-1     <none>           <none>
    huawei-csi-node-mnqbz       3/3     Running   0          19h   host-ip1   host-1     <none>           <none>
    
  3. Log in to the node where huawei-csi-controller resides, for example, host-1 in 2.

  4. Go to the /var/log/huawei directory.

    # cd /var/log/huawei
    
  5. View the storage-backend-controller log. The following uses the storage connection timeout as an example.

    tail -n 1000 storage-backend-controller
    

    The following is a log example.

    2024-01-01 06:30:44.280661 1 [INFO]: Try to login https://192.168.129.155:8088/deviceManager/rest
    2024-01-01 06:31:44.281626 1 [ERROR]: Send request method: POST, Url: https://192.168.129.155:8088/deviceManager/rest/xx/sessions, error: Post "https://192.168.129.155:8088/deviceManager/rest/xx/sessions": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    2024-01-01 06:31:44.281793 1 [WARNING]: Login https://192.168.129.155:8088/deviceManager/rest error due to connection failure, gonna try another Url
    2024-01-01 06:31:44.291668 1 [INFO]: Finished validateCreate huawei-csi/backend-test.
    2024-01-01 06:31:44.291799 1 [ERROR]: Failed to validate StorageBackendClaim, error: unconnected
    
  6. If the log contains information about login timeout, login failure, or long request duration, check the connectivity between the host machine and the storage or the network status.

  7. If no request is recorded in the log, the communication between kube-apiserver and CSI webhook is abnormal.

Abnormal Communication Between kube-apiserver and CSI Webhook

Contact the Kubernetes platform administrator to check the network between kube-apiserver and CSI webhook. For example, if kube-apiserver has an HTTPS proxy, the CSI webhook service may fail to be accessed.

In the temporary workaround, the webhook resource will be deleted. This resource is used to check whether the entered account information is correct and whether the connection to the storage can be set up when a storage backend is created. Therefore, deleting this resource affects only the verification during backend creation and does not affect other functions. Pay attention to the following:

  • Ensure that the host machine where the huawei-csi-controller service is located can properly communicate with the storage.
  • Ensure that the entered account and password are correct.
  1. Run the following command to view CSI webhook information.

    kubectl get validatingwebhookconfiguration storage-backend-controller.xuanwu.huawei.io
    

    The following is an example of the command output.

    NAME                                          WEBHOOKS   AGE
    storage-backend-controller.xuanwu.huawei.io   1          4d22h
    
  2. Contact the Kubernetes platform administrator to check whether the communication between kube-apiserver and CSI webhook is abnormal.

  3. Perform the following temporary workaround: Run the following command to delete the webhook.

    kubectl delete validatingwebhookconfiguration storage-backend-controller.xuanwu.huawei.io
    
  4. Create a storage backend. For details, see Managing Storage Backends.

  5. If the communication between kube-apiserver and CSI webhook is restored, you need to reconstruct the webhook. In this case, run the following command to restart CSI Controller and restore the number of CSI Controller copies by specifying –replicas=*. In the following example, the number is restored to 1. Change it based on actual requirements.

    Change the number of copies to 0 first.

    kubectl scale deployment huawei-csi-controller -n huawei-csi --replicas=0 
    

    Then restore the number of copies to the original number.

    kubectl scale deployment huawei-csi-controller -n huawei-csi --replicas=1
    

2.3 - An Account Is Locked After the Password Is Updated on the Storage Device

Symptom

After a user changes the password on the storage device, the account is locked.

Root Cause Analysis

CSI uses the account and password configured on the storage device to log in to the storage device. After the account password is changed on the storage device, CSI attempts to log in to the storage device again after the login fails. Take OceanStor Dorado as an example. The default login policy is that an account will be locked after three consecutive password verification failures. Therefore, when CSI retries for more than three times, the account will be locked.

Solution or Workaround

  1. If the backend account is admin, run the following command to set the number of huawei-csi-controller service copies to 0. If an account other than admin is used, skip this step.

    kubectl scale deployment huawei-csi-controller -n huawei-csi --replicas=0
    
  2. Log in to the storage device as user admin and modify the login policy. Take OceanStor Dorado as an example. On DeviceManager, choose Settings > User and Security > Security Policies > Login Policy, click Modify, and disable Account Lockout.

  3. If the backend account is admin, run the following command to restore the number of CSI Controller copies using –replicas=*. In the following example, the number of copies is restored to 1. Change it based on site requirements. If an account other than admin is used, skip this step.

    kubectl scale deployment huawei-csi-controller -n huawei-csi --replicas=1
    
  4. Use the oceanctl tool to change the storage backend password. For details about how to change the backend password, see Updating a Storage Backend.

  5. Log in to the storage device as user admin and modify the login policy. Take OceanStor Dorado as an example. On DeviceManager, choose Settings > User and Security > Security Policies > Login Policy, click Modify, and enable Account Lockout.

3 - PVC Issues

3.1 - When a PVC Is Created, the PVC Is in the Pending State

Symptom

A PVC is created. After a period of time, the PVC is still in the Pending state.

Root Cause Analysis

Cause 1: A StorageClass with the specified name is not created in advance. As a result, Kubernetes cannot find the specified StorageClass name when a PVC is created.

Cause 2: The storage pool capability does not match the StorageClass capability. As a result, huawei-csi fails to select a storage pool.

Cause 3: An error code (for example, 50331651) is returned by a RESTful interface of the storage. As a result, huawei-csi fails to create a PVC.

Cause 4: The storage does not return a response within the timeout period set by huawei-csi. As a result, huawei-csi returns a timeout error to Kubernetes.

Cause 5: Other causes.

Solution or Workaround

When a PVC is created, if the PVC is in the Pending state, you need to take different measures according to the following causes.

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the following command to view details about the PVC.

    kubectl describe pvc mypvc
    
  3. Perform the corresponding operation according to the Events information in the detailed PVC information.

    • If the PVC is in the Pending state due to cause 1, perform the following steps.

      Events:
        Type     Reason              Age                  From                         Message
        ----     ------              ----                 ----                         -------
        Warning  ProvisioningFailed  0s (x15 over 3m24s)  persistentvolume-controller  storageclass.storage.k8s.io "mysc" not found
      
      1. Delete the PVC.
      2. Create a StorageClass. For details, see StorageClass Configuration Examples in Typical Dynamic Volume Provisioning Scenarios.
      3. Create a PVC. For details, see PVC Parameters for Dynamic Volume Provisioning.
    • If the PVC is in the Pending state due to cause 2, perform the following steps.

      Events:
        Type     Reason                Age                From                                                                                       Message
        ----     ------                ----               ----                                                                                       -------
        Normal   Provisioning          63s (x3 over 64s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  External provisioner is provisioning volume for claim "default/mypvc"
        Warning  ProvisioningFailed    63s (x3 over 64s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  failed to provision volume with StorageClass "mysc": rpc error: code = Internal desc = failed to select pool, the capability filter failed, error: failed to select pool, the final filter field: replication, parameters map[allocType:thin replication:True size:1099511627776 volumeType:lun]. please check your storage class
      
      1. Delete the PVC.
      2. Delete the StorageClass.
      3. Modify the StorageClass.yaml file based on the Events information.
      4. Create a StorageClass. For details, see StorageClass Configuration Examples in Typical Dynamic Volume Provisioning Scenarios.
      5. Create a PVC. For details, see PVC Parameters for Dynamic Volume Provisioning.
    • If the PVC is in the Pending state due to cause 3, contact Huawei engineers.

      Events:
        Type     Reason                Age                From                                                                                       Message
        ----     ------                ----               ----                                                                                       -------
        Normal   Provisioning          63s (x4 over 68s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  External provisioner is provisioning volume for claim "default/mypvc"
        Warning  ProvisioningFailed    62s (x4 over 68s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  failed to provision volume with StorageClass "mysc": rpc error: code = Internal desc = Create volume map[ALLOCTYPE:1 CAPACITY:20 DESCRIPTION:Created from Kubernetes CSI NAME:pvc-63ebfda5-4cf0-458e-83bd-ecc PARENTID:0] error: 50331651
      
    • If the PVC is in the Pending state due to cause 4, perform the following steps.

      Events:
        Type     Reason                Age                From                                                                                       Message
        ----     ------                ----               ----                                                                                       -------
        Normal   Provisioning          63s (x3 over 52s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  External provisioner is provisioning volume for claim "default/mypvc"
        Warning  ProvisioningFailed    63s (x3 over 52s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  failed to provision volume with StorageClass "mysc": rpc error: code = Internal desc = context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      
      1. Wait for 10 minutes and check the PVC details again by referring to this section.
      2. If it is still in the Pending state, contact Huawei engineers.
    • If the PVC is in the Pending state due to cause 5, contact Huawei engineers.

3.2 - Before a PVC Is Deleted, the PVC Is in the Pending State

Symptom

Before a PVC is deleted, the PVC is in the Pending state.

Root Cause Analysis

Cause 1: A StorageClass with the specified name is not created in advance. As a result, Kubernetes cannot find the specified StorageClass name when a PVC is created.

Cause 2: The storage pool capability does not match the StorageClass capability. As a result, huawei-csi fails to select a storage pool.

Cause 3: An error code (for example, 50331651) is returned by a RESTful interface of the storage. As a result, huawei-csi fails to create a PVC.

Cause 4: The storage does not return a response within the timeout period set by huawei-csi. As a result, huawei-csi returns a timeout error to Kubernetes.

Cause 5: Other causes.

Solution or Workaround

To delete a PVC in the Pending state, you need to take different measures according to the following causes.

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the following command to view details about the PVC.

    kubectl describe pvc mypvc
    
  3. Perform the corresponding operation according to the Events information in the detailed PVC information.

    • If the PVC is in the Pending state due to cause 1, run the **kubectl delete pvc **mypvc command to delete the PVC.

      Events:
        Type     Reason              Age                  From                         Message
        ----     ------              ----                 ----                         -------
        Warning  ProvisioningFailed  0s (x15 over 3m24s)  persistentvolume-controller  storageclass.storage.k8s.io "mysc" not found
      
    • If the PVC is in the Pending state due to cause 2, run the **kubectl delete pvc **mypvc command to delete the PVC.

      Events:
        Type     Reason                Age                From                                                                                       Message
        ----     ------                ----               ----                                                                                       -------
        Normal   Provisioning          63s (x3 over 64s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  External provisioner is provisioning volume for claim "default/mypvc"
        Warning  ProvisioningFailed    63s (x3 over 64s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  failed to provision volume with StorageClass "mysc": rpc error: code = Internal desc = failed to select pool, the capability filter failed, error: failed to select pool, the final filter field: replication, parameters map[allocType:thin replication:True size:1099511627776 volumeType:lun]. please check your storage class
      
    • If the PVC is in the Pending state due to cause 3, run the kubectl delete pvc mypvc command to delete the PVC.

      Events:
        Type     Reason                Age                From                                                                                       Message
        ----     ------                ----               ----                                                                                       -------
        Normal   Provisioning          63s (x4 over 68s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  External provisioner is provisioning volume for claim "default/mypvc"
        Warning  ProvisioningFailed    62s (x4 over 68s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  failed to provision volume with StorageClass "mysc": rpc error: code = Internal desc = Create volume map[ALLOCTYPE:1 CAPACITY:20 DESCRIPTION:Created from Kubernetes CSI NAME:pvc-63ebfda5-4cf0-458e-83bd-ecc PARENTID:0] error: 50331651
      
    • If the PVC is in the Pending state due to cause 4, contact Huawei engineers.

      Events:
        Type     Reason                Age                From                                                                                       Message
        ----     ------                ----               ----                                                                                       -------
        Normal   Provisioning          63s (x3 over 52s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  External provisioner is provisioning volume for claim "default/mypvc"
        Warning  ProvisioningFailed    63s (x3 over 52s)  csi.huawei.com_huawei-csi-controller-b59577886-qqzm8_58533e4a-884c-4c7f-92c3-6e8a7b327515  failed to provision volume with StorageClass "mysc": rpc error: code = Internal desc = context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      
    • If the PVC is in the Pending state due to cause 5, contact Huawei engineers.

3.3 - Failed to Expand the Capacity of a Generic Ephemeral Volume

Symptom

In an environment where the Kubernetes version is earlier than 1.25, the capacity of a generic ephemeral volume of the LUN type fails to be expanded. The system displays a message indicating that the PV capacity has been expanded, but the PVC capacity fails to be updated.

Root Cause Analysis

This problem is caused by a Kubernetes bug, which has been resolved in Kubernetes 1.25.

3.4 - Failed to Expand the PVC Capacity Because the Target Capacity Exceeds the Storage Pool Capacity

Symptom

In a Kubernetes environment earlier than 1.23, PVC capacity expansion fails when the target capacity exceeds the storage pool capacity.

Root Cause Analysis

This is a known issue in the Kubernetes community. For details, see Recovering from Failure when Expanding Volumes.

Solution or Workaround

For details, see Recovering from Failure when Expanding Volumes.

4 - Pod Issues

4.1 - After a Worker Node in the Cluster Breaks Down and Recovers, Pod Failover Is Complete but the Source Host Where the Pod Resides Has Residual Drive Letters

Symptom

A Pod is running on worker node A, and an external block device is mounted to the Pod through CSI. After worker node A is powered off abnormally, the Kubernetes platform detects that the node is faulty and switches the Pod to worker node B. After worker node A recovers, the drive letters on worker node A change from normal to faulty.

Environment Configuration

Kubernetes version: 1.18 or later

Storage type: block storage

Root Cause Analysis

After worker node A recovers, Kubernetes initiates an unmapping operation on the storage, but does not initiate a drive letter removal operation on the host. After Kubernetes completes the unmapping, residual drive letters exist on worker node A.

Solution or Workaround

Currently, you can only manually clear the residual drive letters on the host. Alternatively, restart the host again and use the disk scanning mechanism during the host restart to clear the residual drive letters. The specific method is as follows:

  1. Check the residual drive letters on the host.

    1. Run the following command to check whether a DM multipathing device with abnormal multipathing status exists.

      multipath -ll
      

      The following is an example of the command output. The path status is failed faulty running, the corresponding DM multipathing device is dm-12, and the associated SCSI disks are sdi and sdj. If multiple paths are configured, multiple SCSI disks exist. Record these SCSI disks.

      mpathb (3618cf24100f8f457014a764c000001f6) dm-12 HUAWEI  ,XSG1            
      size=100G features='0' hwhandler='0' wp=rw
      `-+- policy='service-time 0' prio=-1 status=active
        |- 39:0:0:1        sdi 8:48  failed faulty running
        `- 38:0:0:1        sdj 8:64  failed faulty running
      
      • If yes, go to step 1.2.
      • If no, no further action is required.
    2. Run the following command to check whether the residual DM multipathing device is readable.

      dd if=/dev/dm-12 of=/dev/null count=1 bs=1M iflag=direct
      

      The following is an example of the command output. If the returned result is Input/output error and the read data is 0 bytes (0 B) copied, the device is unreadable. dm-xx indicates the device ID obtained in step 1.1.

      dd: error reading '/dev/dm-12': Input/output error
      0+0 records in
      0+0 records out
      0 bytes (0 B) copied, 0.0236862 s, 0.0 kB/s
      
      • If yes, record the residual dm-xx device and associated disk IDs (for details, see step 1.1) and perform the clearing operation.
      • If the command execution is suspended, go to step 1.3.
      • If other cases, contact technical support engineers.
    3. Log in to the node again in another window.

      1. Run the following command to view the suspended process.

        ps -ef | grep dm-12 | grep -w dd
        

        The following is an example of the command output.

        root     21725  9748  0 10:33 pts/10   00:00:00 dd if=/dev/dm-12 of=/dev/null count=1 bs=10M iflag=direct
        
      2. Kill the pid.

        kill -9 pid
        
      3. Record the residual dm-xx device and associated disk IDs (for details, see step 1.1) and perform the clearing operation.

  2. Clear the residual drive letters on the host.

    1. Run the following command to delete residual multipathing aggregation device information according to the DM multipathing device obtained in step 1.

      multipath -f /dev/dm-12
      

      If an error is reported, contact technical support engineers.

    2. Run the following command to clear the residual SCSI disks according to the drive letters of the residual disks obtained in step 1.

      echo 1 > /sys/block/xxxx/device/delete
      

      When multiple paths are configured, clear the residual disks based on the drive letters. The residual paths are sdi and sdj.

      echo 1 > /sys/block/sdi/device/delete
      echo 1 > /sys/block/sdj/device/delete
      

      If an error is reported, contact technical support engineers.

    3. Check whether the DM multipathing device and SCSI disk information has been cleared.

      Run the following commands in sequence to query the multipathing and disk information. If the residual dm-12 device and SCSI disks sdi and sdj are cleared, the clearing is complete.

      1. View multipathing information.

        multipath -ll
        

        The following is an example of the command output. The residual dm-12 device is cleared.

        mpathb (3618cf24100f8f457014a764c000001f6) dm-3 HUAWEI  ,XSG1            
        size=100G features='0' hwhandler='0' wp=rw
        `-+- policy='service-time 0' prio=-1 status=active
          |- 39:0:0:1        sdd 8:48  active ready running
          `- 38:0:0:1        sde 8:64  active ready running
        mpathn (3618cf24100f8f457315a764c000001f6) dm-5 HUAWEI  ,XSG1            
        size=100G features='0' hwhandler='0' wp=rw
        `-+- policy='service-time 0' prio=-1 status=active
          |- 39:0:0:2        sdc 8:32  active ready running
          `- 38:0:0:2        sdb 8:16  active ready running
        
      2. View device information.

        ls -l /sys/block/
        

        The following is an example of the command output. SCSI disks sdi and sdj are cleared.

        total 0
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 dm-0 -> ../devices/virtual/block/dm-0
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 dm-1 -> ../devices/virtual/block/dm-1
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 dm-2 -> ../devices/virtual/block/dm-2
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 dm-3 -> ../devices/virtual/block/dm-3
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 sdb -> ../devices/platform/host35/session2/target35:0:0/35:0:0:1/block/sdb
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 sdc -> ../devices/platform/host34/target34:65535:5692/34:65535:5692:0/block/sdc
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 sdd -> ../devices/platform/host39/session6/target39:0:0/39:0:0:1/block/sdd
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 sde -> ../devices/platform/host38/session5/target38:0:0/38:0:0:1/block/sde
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 sdh -> ../devices/platform/host39/session6/target39:0:0/39:0:0:3/block/sdh
        lrwxrwxrwx 1 root root 0 Aug 11 19:56 sdi -> ../devices/platform/host38/session5/target38:0:0/38:0:0:3/block/sdi
        
      3. View disk information.

        ls -l /dev/disk/by-id/
        

        The following is an example of the command output. SCSI disks sdi and sdj are cleared.

        total 0
        lrwxrwxrwx 1 root root 10 Aug 11 19:57 dm-name-mpathb -> ../../dm-3
        lrwxrwxrwx 1 root root 10 Aug 11 19:58 dm-name-mpathn -> ../../dm-5
        lrwxrwxrwx 1 root root 10 Aug 11 19:57 dm-uuid-mpath-3618cf24100f8f457014a764c000001f6 -> ../../dm-3
        lrwxrwxrwx 1 root root 10 Aug 11 19:58 dm-uuid-mpath-3618cf24100f8f457315a764c000001f6 -> ../../dm-5
        lrwxrwxrwx 1 root root  9 Aug 11 19:57 scsi-3618cf24100f8f457014a764c000001f6 -> ../../sdd
        lrwxrwxrwx 1 root root  9 Aug 11 19:57 scsi-3618cf24100f8f45712345678000103e8 -> ../../sdi
        lrwxrwxrwx 1 root root  9 Aug  3 15:17 scsi-3648435a10058805278654321ffffffff -> ../../sdb
        lrwxrwxrwx 1 root root  9 Aug  2 14:49 scsi-368886030000020aff44cc0d060c987f1 -> ../../sdc
        lrwxrwxrwx 1 root root  9 Aug 11 19:57 wwn-0x618cf24100f8f457014a764c000001f6 -> ../../sdd
        lrwxrwxrwx 1 root root  9 Aug 11 19:57 wwn-0x618cf24100f8f45712345678000103e8 -> ../../sdi
        lrwxrwxrwx 1 root root  9 Aug  3 15:17 wwn-0x648435a10058805278654321ffffffff -> ../../sdb
        lrwxrwxrwx 1 root root  9 Aug  2 14:49 wwn-0x68886030000020aff44cc0d060c987f1 -> ../../sdc
        

4.2 - When a Pod Is Created, the Pod Is in the ContainerCreating State

Symptom

A Pod is created. After a period of time, the Pod is still in the ContainerCreating state. Check the log information (for details, see Viewing Huawei CSI Logs). The error message “Fibre Channel volume device not found” is displayed.

Root Cause Analysis

This problem occurs because residual disks exist on the host node. As a result, disks fail to be found when a Pod is created next time.

Solution or Workaround

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the following command to query information about the node where the Pod resides.

    kubectl get pod -o wide
    

    The following is an example of the command output.

    NAME        READY   STATUS              RESTARTS   AGE     IP             NODE   NOMINATED NODE   READINESS GATES
    mypod       0/1     ContainerCreating   0          51s     10.244.1.224   node1  <none>           <none>
    
  3. Delete the Pod.

  4. Use a remote access tool, such as PuTTY, to log in to the node1 node in the Kubernetes cluster through the management IP address. node1 indicates the node queried in 2.

  5. Clear the residual drive letters. For details, see Solution or Workaround.

4.3 - A Pod Is in the ContainerCreating State for a Long Time When It Is Being Created

Symptom

When a Pod is being created, the Pod is in the ContainerCreating state for a long time. Check the huawei-csi-node log (for details, see Viewing Huawei CSI Logs). No Pod creation information is recorded in the huawei-csi-node log. After the kubectl get volumeattachment command is executed, the name of the PV used by the Pod is not displayed in the PV column. After a long period of time (more than ten minutes), the Pod is normally created and the Pod status changes to Running.

Root Cause Analysis

The kube-controller-manager component of Kubernetes is abnormal.

Solution or Workaround

Contact container platform engineers to rectify the fault.

4.4 - A Pod Fails to Be Created and the Log Shows That the Execution of the mount Command Times Out

Symptom

When a Pod is being created, the Pod keeps in the ContainerCreating status. In this case, check the log information of huawei-csi-node (for details, see Viewing Huawei CSI Logs). The log shows that the execution of the mount command times out.

Root Cause Analysis

Cause 1: The configured service IP address is disconnected. As a result, the mount command execution times out and fails.

Cause 2: For some operating systems, such as Kylin V10 SP1 and SP2, it takes a long time to run the mount command in a container using NFSv3. As a result, the mount command may time out and error message “error: exit status 255” is displayed. The possible cause is that the value of LimitNOFILE of container runtime containerd is too large (over 1 billion).

Cause 3: The mounting may fail due to network problems. The default mounting timeout period of CSI is 30 seconds. If the mounting still fails after 30 seconds, logs show that the execution of the mount command times out.

Solution or Workaround

  1. Run the ping command to check whether the service IP network is connected. If the ping fails, the fault is caused by cause 1. In this case, configure an available service IP address. If the ping succeeds, go to 2.

  2. Go to any container where the mount command can be executed and use NFSv3 to run the mount command. If the command times out, the fault may be caused by cause 2. Run the systemctl status containerd.service command to check the configuration file path, and then run the **cat **/xxx/containerd.service command to check the configuration file. If the file contains LimitNOFILE=infinity or the value of LimitNOFILE is 1 billion, go to 3. Otherwise, contact Huawei technical support engineers.

  3. For cause 2, perform the following operations:

    • Try using NFSv4.0.
    • Change the value of LimitNOFILE to a proper one by referring to change solution provided by the community. This solution will restart the container runtime. Evaluate the impact on services.
  4. Manually mount the file system on the host machine where the mounting fails. If the required time exceeds 30 seconds, check whether the network between the host machine and the storage node is normal. An example of the mount command is as follows.

    • Run the following command to create a test directory.

      mkdir /tmp/test_mount
      
    • Run the mount command to mount the file system and observe the time consumed. The value of ip:nfs_share_path can be obtained from the huawei-csi-node log. For details, see Viewing Huawei CSI Logs.

      time mount ip:nfs_share_path /tmp/test_mount
      
    • After the test is complete, run the following command to unmount the file system.

      umount /tmp/test_mount
      

4.5 - A Pod Fails to Be Created and the Log Shows That the mount Command Fails to Be Executed

Symptom

In NAS scenarios, when a Pod is being created, the Pod keeps in the ContainerCreating status. In this case, check the log information of huawei-csi-node (for details, see Viewing Huawei CSI Logs). The log shows that the mount command fails to be executed.

Root Cause Analysis

The possible cause is that the NFS 4.0/4.1/4.2 protocol is not enabled on the storage side. After the NFS v4 protocol fails to be used for mounting, the host does not negotiate to use the NFS v3 protocol for mounting.

Solution or Workaround

4.6 - A Pod Fails to Be Created and Message publishInfo doesn't exist Is Displayed in the Events Log

Symptom

When a Pod is being created, the Pod keeps in the ContainerCreating state. It is found that the following alarm event is printed for the Pod: rpc error: code = Internal desc = publishInfo doesn’t exist

Root Cause Analysis

As required by CSI, when a workload needs to use a PV, the Container Orchestration system (CO system, communicating with the CSI plug-in using RPC requests) invokes the ControllerPublishVolume interface (provided by huawei-csi-controller) in the CSI protocol provided by the CSI plug-in to map the PV, and then invokes the NodeStageVolume interface (provided by huawei-csi-node) provided by the CSI plug-in to mount the PV. During a complete mounting operation, only the huawei-csi-node service receives the NodeStageVolume request. Before that, the huawei-csi-controller service does not receive the ControllerPublishVolume request. As a result, the huawei-csi-controller service does not map the PV volume and does not send the mapping information to the huawei-csi-node service. Therefore, error message publishInfo doesn’t exist is reported.

Solution

To solve this problem, Kubernetes needs to invoke the ControllerPublishVolume interface.

If this operation is triggered by all workloads created by earlier versions in the cluster, this problem will not occur.

Procedure

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the following command to obtain the information about the node where a workload is located.

    kubectl get pod error-pod -n error-pod-in-namespace -owide
    

    The following is an example of the command output.

    NAME      READY   STATUS              RESTARTS   AGE   IP       NODE        NOMINATED NODE   READINESS GATES
    pod-nfs   0/1     ContainerCreating   0          3s    <none>   node-1      <none>           <none>
    
  3. Fail over the workload to another node.

  4. If the failover cannot be completed in the cluster, you can delete the workload and create a new one on the original node.

  5. Check whether the workload is successfully started. If it fails to be started, contact Huawei technical support engineers.

Checking Cluster Workloads

When Kubernetes invokes the CSI plug-in to complete volume mapping, the VolumeAttachment resource is used to save the mapping information, indicating that a specified volume is attached to or detached from a specified node. This problem occurs because publishInfo does not exist. You can view the VolumeAttachment resource information to check whether this problem is also involved in other workloads in the cluster. The procedure is as follows:

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the following command to obtain the VolumeAttachment information and retain resources whose ATTACHER field is csi.huawei.com. csi.huawei.com indicates the Huawei CSI driver name and can be configured in the values.yaml file. The corresponding configuration item is csiDriver.driverName. For details about the configuration item, see Table 4.

    kubectl get volumeattachments.storage.k8s.io 
    

    The following is an example of the command output.

    NAME          ATTACHER         PV       NODE     ATTACHED   AGE
    csi-47abxx   csi.huawei.com   pvc-1xx   node-1   true       12h
    
  3. Run the following command to view the VolumeAttachment resource details. In the following information, csi-47abxx is the resource name obtained in 2.

    kubectl get volumeattachments.storage.k8s.io csi-47abxx -o yaml
    

    The following is an example of the command output.

    kind: VolumeAttachment
    metadata:
      annotations:
        csi.alpha.kubernetes.io/node-id: '{"HostName":"node-1"}'
       finalizers:
      - external-attacher/csi-huawei-com
      name: csi-47abxxx
      uid: 0c87fa8a-c3d6-4623-acb8-71d6206d030d
    spec:
      attacher: csi.huawei.com
      nodeName: debian-node
      source:
        persistentVolumeName: pvc-1xx
    status:
      attached: true
      attachmentMetadata:
         publishInfo: '{<PUBLISH-INFO>}'
    
  4. If status.attachmentMetadata.publishInfo exists in the resource obtained in 3, the problem described in this FAQ is not involved in the workloads created using pvc-1xx on the node-1 node. node-1 and pvc-1xx are the query results in 2. If status.attachmentMetadata.publishInfo does not exist, rectify the fault by referring to Solution.

  5. If multiple VolumeAttachment resources exist, repeat 3 to 4.

4.7 - After a Pod Fails to Be Created or kubelet Is Restarted, Logs Show That the Mount Point Already Exists

Symptom

When a Pod is being created, the Pod is always in the ContainerCreating state. Alternatively, after kubelet is restarted, logs show that the mount point already exists. Check the log information of huawei-csi-node (for details, see Viewing Huawei CSI Logs). The error information is: The mount /var/lib/kubelet/pods/xxx/mount is already exist, but the source path is not /var/lib/kubelet/plugins/kubernetes.io/xxx/globalmount

Root Cause Analysis

The root cause of this problem is that Kubernetes performs repeated mounting operations.

Solution or Workaround

Run the following command to unmount the existing path. In the command, /var/lib/kubelet/pods/xxx/mount indicates the existing mount path displayed in the logs.

umount /var/lib/kubelet/pods/xxx/mount

4.8 - I/O error Is Displayed When a Volume Directory Is Mounted to a Pod

Symptom

When a Pod reads or writes a mounted volume, message “I/O error” is displayed.

Root Cause Analysis

When a protocol such as SCSI is used, if the Pod continuously writes data to the mount directory, the storage device will restart. As a result, the link between the device on the host and the storage device is interrupted, triggering an I/O error. When the storage device is restored, the mount directory is still read-only.

Solution

Remount the volume. That is, reconstruct the Pod to trigger re-mounting.

4.9 - Failed to Create a Pod Because the iscsi tcp Service Is Not Started Properly When the Kubernetes Platform Is Set Up for the First Time

Symptom

When you create a Pod, error Cannot connect ISCSI portal *.*.*.*: libkmod: kmod_module_insert_module: could not find module by name=‘iscsi_tcp’ is reported in the /var/log/huawei-csi-node log.

Root Cause Analysis

The iscsi_tcp service may be stopped after the Kubernetes platform is set up and the iSCSI service is installed. You can run the following command to check whether the service is stopped.

lsmod | grep iscsi | grep iscsi_tcp

The following is an example of the command output.

iscsi_tcp              18333  6 
libiscsi_tcp           25146  1 iscsi_tcp
libiscsi               57233  2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi    99909  3 iscsi_tcp,libiscsi

Solution or Workaround

Run the following command to manually load the iscsi_tcp service.

modprobe iscsi_tcp
lsmod | grep iscsi | grep iscsi_tcp

5 - Common Problems and Solutions for Interconnecting with the Tanzu Kubernetes Cluster

This section describes the common problems and solutions for interconnecting with the Tanzu Kubernetes cluster. Currently, the following problems occur during interconnection with the Tanzu Kubernetes cluster:

  • A Pod cannot be created because the PSP permission is not created.
  • The mount point of the host is different from that of the native Kubernetes. As a result, a volume fails to be mounted.
  • The livenessprobe container port conflicts with the Tanzu vSphere port. As a result, the container restarts repeatedly.

5.1 - A Pod Cannot Be Created Because the PSP Permission Is Not Created

Symptom

When huawei-csi-controller and huawei-csi-node are created, only the Deployment and DaemonSet resources are successfully created, and no Pod is created for the controller and node.

Root Cause Analysis

The service account used for creating resources does not have the “use” permission of the PSP policy.

Solution or Workaround

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the vi psp-use.yaml command to create a file named psp-use.yaml

    vi psp-use.yaml
    
  3. Configure the psp-use.yaml file.

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: huawei-csi-psp-role
    rules:
    - apiGroups: ['policy']
      resources: ['podsecuritypolicies']
      verbs: ['use']
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: huawei-csi-psp-role-cfg
    roleRef:
      kind: ClusterRole
      name: huawei-csi-psp-role
      apiGroup: rbac.authorization.k8s.io
    subjects:
    - kind: Group
      apiGroup: rbac.authorization.k8s.io
      name: system:serviceaccounts:huawei-csi
    - kind: Group
      apiGroup: rbac.authorization.k8s.io
      name: system:serviceaccounts:default
    
  4. Run the following command to create the PSP permission.

    kubectl create -f psp-use.yaml
    

5.2 - Changing the Mount Point of a Host

Symptom

A Pod fails to be created, and error message “mount point does not exist” is recorded in Huawei CSI logs.

Root Cause Analysis

The native Kubernetes cluster in the pods-dir directory of huawei-csi-node is inconsistent with the Tanzu Kubernetes cluster.

Solution or Workaround

  1. Go to the helm/esdk/ directory and run the vi values.yaml command to open the configuration file.

    vi values.yaml
    
  2. Change the value of kubeletConfigDir to the actual installation directory of kubelet.

    # Specify kubelet config dir path.
    # kubernetes and openshift is usually /var/lib/kubelet
    # Tanzu is usually /var/vcap/data/kubelet
    kubeletConfigDir: /var/vcap/data/kubelet
    

5.3 - Changing the Default Port of the livenessprobe Container

Symptom

The livenessprobe container of the huawei-csi-controller component keeps restarting.

Root Cause Analysis

The default port (9808) of the livenessprobe container of huawei-csi-controller conflicts with the existing vSphere CSI port of Tanzu.

Solution or Workaround

Change the default port of the livenessprobe container to an idle port.

  1. Go to the helm/esdk directory and run the vi values.yaml command to open the configuration file.

    vi values.yaml
    
  2. Change the default value 9808 of controller.livenessProbePort to an idle port, for example, 9809.

    controller:
      livenessProbePort: 9809
    
  3. Update Huawei CSI using Helm. For details, see Upgrading Huawei CSI.

5.4 - Failed to Create an Ephemeral Volume

Symptom

A generic ephemeral volume fails to be created, and the error message PodSecurityPolicy: unable to admit pod: [spec.volumes[0]: Invalid value: “ephemeral”: ephemeral volumes are not allowed to be used spec.volumes[0] is displayed.

Root Cause Analysis

The current PSP policy does not contain the permission to use ephemeral volumes.

Solution or Workaround

Add the permission to use ephemeral volumes to the default PSP pks-privileged and pks-restricted. The following is an example of modifying pks-privileged:

  1. Use a remote access tool, such as PuTTY, to log in to any master node in the Kubernetes cluster through the management IP address.

  2. Run the following command to modify the pks-privileged configuration.

    kubectl edit psp pks-privileged
    
  3. Add ephemeral to spec.volumes. The following is an example.

    # Please edit the object below. Lines beginning with a '#' will be ignored,
    # and an empty file will abort the edit. If an error occurs while saving this file will be
    # reopened with the relevant failures.
    #
    apiVersion: policy/v1beta1
    kind: PodSecurityPolicy
    metadata:
      annotations:
        apparmor.security.beta.kubernetes.io/allowedProfileName: '*'
        seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
      creationTimestamp: "2022-10-11T08:07:00Z"
      name: pks-privileged
      resourceVersion: "1227763"
      uid: 2f39c44a-2ce7-49fd-87ca-2c5dc3bfc0c6
    spec:
      allowPrivilegeEscalation: true
      allowedCapabilities:
      - '*'
      supplementalGroups:
        rule: RunAsAny
      volumes:
      - glusterfs
      - hostPath
      - iscsi
      - nfs
      - persistentVolumeClaim
      - ephemeral
    
  4. Run the following command to check whether the addition is successful.

    kubectl get psp pks-privileged -o yaml