Michael Yang
|
8e0641a9bf
handle asymmetric embedding KVs
|
hai 6 meses |
Daniel Hiltgen
|
359b15a597
Handle models with divergent layer sizes
|
hai 6 meses |
Daniel Hiltgen
|
7784ca33ce
Tighten up memory prediction logging
|
hai 6 meses |
Daniel Hiltgen
|
17df6520c8
Remove mmap related output calc logic
|
hai 6 meses |
Daniel Hiltgen
|
6f351bf586
review comments and coverage
|
hai 6 meses |
Daniel Hiltgen
|
6fd04ca922
Improve multi-gpu handling at the limit
|
hai 7 meses |
Michael Yang
|
6297f85606
gofmt, goimports
|
hai 6 meses |
Michael Yang
|
e40145a39d
lint
|
hai 7 meses |
Patrick Devine
|
4cc3be3035
Move envconfig and consolidate env vars (#4608)
|
hai 7 meses |
Michael Yang
|
1d359e737e
typo
|
hai 7 meses |
Michael Yang
|
50b9056e09
count memory up to NumGPU
|
hai 7 meses |
Jeffrey Morgan
|
bb6fd02298
Don't clamp ctx size in `PredictServerFit` (#4317)
|
hai 7 meses |
Daniel Hiltgen
|
bee2f4a3b0
Record GPU usage information
|
hai 7 meses |
Michael Yang
|
4736391bfb
llm: add minimum based on layer size
|
hai 7 meses |
Daniel Hiltgen
|
f56aa20014
Centralize server config handling
|
hai 7 meses |
Jeffrey Morgan
|
f0c454ab57
gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068)
|
hai 7 meses |
Michael Yang
|
f81f308118
fix gemma, command-r layer weights
|
hai 8 meses |
Michael Yang
|
7bb7cb8a60
only count output tensors
|
hai 8 meses |
Daniel Hiltgen
|
5445aaa94e
Add back memory escape valve
|
hai 8 meses |
Daniel Hiltgen
|
34b9db5afc
Request and model concurrency
|
hai 8 meses |