동작 테스트 환경

- nvcc
    - cuda 11.8
- python --version
    - 3.10.13

anaconda 설치

https://www.anaconda.com

anaconda git 설치

# 한번에 설치가 되지 않으면 2번 하면 됨 (사전 환경설정 등이 들어감)
conda install c anaconda git

anaconda 가상 환경 설정

# 가상환경 생성 
conda create -n p3.10.13 python=3.10.13

# 가상환경 변경
conda activate p3.10.13

깃 헙 체크아웃

git clone https://github.com/turboderp/exllamav2

의존성 설치

cd exllamav2
pip install -r requirements.txt

의존성 파일 만들고자 할 때에는 pip freeze > requirements.txt 를 수행하면 된다

exllamav2 설치

# 소스 기반 설치
python setup.py install --user

# whl 파일 설치
# 사용자의 cuda (11.8) 버전과 python (3.10) 버전 chipset (win_amd64) 에 따라 다름에 유의
pip install exllamav2-0.0.10+cu118-cp310-cp310-win_amd64.whl

pytorch 재 설치

pip install torch==2.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

학습 모델 다운로드

huggingface 에서 다운로드 하는 경우 아래와 같이 다운로드를 수행하면 된다. 파일은 .cache 폴더에 우선 저장되며 관리 되어진다. local_dir 을 지정하지 않은 경우 .cache 에만 저장됨에 유의

turboderp 의 모델을 다운로드하여 확인

# see more
# https://huggingface.co/docs/huggingface_hub/v0.19.3/guides/download
from huggingface_hub import snapshot_download

# https://huggingface.co/docs/huggingface_hub/guides/download
# https://huggingface.co/turboderp/Llama2-70B-exl2/tree/2.5bpw
snapshot_download(repo_id="turboderp/Llama2-70B-exl2", revision="2.5bpw", local_dir="./model/Llama2-70B-exl2")

테스트

python test_inference.py -m [모델경로] -p "once upon a time"

성능

70B 의 경우 대략 2.4 tokens/second 정도의 속도가 나옴

In my tests, this scheme allows Llama2 70B to run on a single 24 GB GPU with a 2048-token context, producing coherent and mostly stable output with 2.55 bits per weight. 13B models run at 2.65 bits within 8 GB of VRAM, although currently none of them uses GQA which effectively limits the context size to 2048. In either case it's unlikely that the model will fit alongside a desktop environment. For now.

[개발] LLAMA V2 설치 및 테스트