Skip to content

Dfcache across disk failed with InternalError ""persistent cache peer xxxx not found"" #4673

@rayne-Li

Description

@rayne-Li

Bug report:

  • a NotFound error appears every time when import a file from /data01 to /data02/dragonfly(dragonfly rootDir)
# time(dfcache import --content-for-calculating-task-id 7788 /data01/dragonfly/download/qwen-7b-test-DeepSeek-R1-Distill-Qwen-7B-1.tar --console --ttl 15m)
Importing Failed!
*********************************
Bad Code: Internal error
Message: status: NotFound, message: "persistent cache peer {xxxx} not found", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }
real	0m21.191s
user	0m0.004s
sys	0m0.008s

  • when check ttl in redis, the peer key's ttl is unusually short, which cause a not found error
127.0.0.1:6379> TTL "scheduler: scheduler-clusters:1:persistent-cache-hosts:{worker-2}:persistent-cache-peers-for-persistent-cache-task"
(integer) 7
127.0.0.1:6379> TTL "scheduler: scheduler-clusters:1:persistent-cache-hosts:{worker-2}:persistent-cache-peers-for-persistent-cache-task"
(integer) 6
127.0.0.1:6379> TTL "scheduler: scheduler-clusters:1:persistent-cache-hosts:1{worker-2}:persistent-cache-peers-for-persistent-cache-task"
(integer) 6
127.0.0.1:6379> TTL "scheduler: scheduler-clusters:1:persistent-cache-hosts:{worker-2}:persistent-cache-peers-for-persistent-cache-task"
(integer) 5
  • the ttl is set in scheduler/resource/persistentcache/peer_manager.go#240
-- Add peer ID to the task joint-set
redis.call("SADD", task_peers_key, peer_id)
redis.call("EXPIRE", task_peers_key, ttl_seconds)
  • but the sequence of lua args are transfered incorrectly
local ttl_seconds = tonumber(ARGV[11])
local concurrent_piece_count = ARGV[12]

args := []any{
       peer.ID,                              // ARGV[1]
       peer.Persistent,                      // ARGV[2]
       string(finishedPieces),               // ARGV[3]
       peer.FSM.Current(),                   // ARGV[4]
       string(blockParents),                 // ARGV[5]
       peer.Task.ID,                         // ARGV[6]
       peer.Host.ID,                         // ARGV[7]
       peer.Cost.Nanoseconds(),              // ARGV[8]
       peer.CreatedAt.Format(time.RFC3339),  // ARGV[9]
       peer.UpdatedAt.Format(time.RFC3339),  // ARGV[10]
       peer.ConcurrentPieceCount,            // ARGV[11]  <-- should be ttl
       remainingTTLSeconds,                  // ARGV[12]  <--  should be concurrent_piece_count
    }
  • so the concurrency(8) is set in ttl, which will cause a NotFound Error

Expected behavior:

lua script args order should fixed

How to reproduce it:

dfcache import {a large file across disk} --console

Environment:

  • Dragonfly version: 2.4.3
  • OS: linux
  • Kernel (e.g. uname -a):
  • Others:

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions