- Ingantattun gine-ginen MoE: jimlar 28B da ~ 3B kadarorin kowace alama tare da ViT da takamaiman asara don ma'aunin multimodal.
- Babban dalili na multimodal: RL (GSPO, IcePop), m ƙasa da "Tunani tare da Hotuna" don daki-daki da dogon wutsiya.
- Aiki mai sassauƙa: BaiduAPIs masu jituwa, ERNIEKit, vLLM da ƙididdigewa har zuwa rago 2 tare da madaidaicin buƙatun VRAM.

Alamar "Tunani" a hankali ta bayyana akan gidan ERNIE-4.5-VL na Baidu na tuƙi kuma ya haifar da wasu cece-kuce. A cikin maganganun cewa ƙaddamar da shi kusan sirri ne, ƙaramin ginshiƙi da ke kwatanta shi da masu fafatawa kamar Gemini 2.5 Pro da hypothetical "high" GPT-5, da kuma alƙawarin yanayin "tunani cikin hotuna" Saboda ba a bayyana shi sosai ba, mutane da yawa suna mamakin ko wannan samfurin yana da kyau sosai kamar yadda tallace-tallace ya nuna. Gaskiyar ita ce, nau'ikan Ernie na baya sun riga sun iya isa sosai, don haka yana da kyau a duba a hankali a ƙarƙashin hular da kuma raba zance daga gaskiya.
A takaice, ERNIE-4.5-VL-28B-A3B-Tunani shine ƙirar harshe iri-iri tare da tsarin Gine-gine na Cakudar Masana (MoE) wanda ke kunna. kawai ~ 3B sigogi kowace alama daga cikin jimlar 28B. Wannan yana ba da damar ma'auni mai ban sha'awa sosai tsakanin iko da inganci. Bambancin "Tunani" ya haɗa da horo na tsaka-tsakin da aka mayar da hankali kan tunani na multimodal, yana ƙarfafa daidaitawar ma'anar tsakanin rubutu da hoto, kuma yana ƙara dabarun ƙarfafawa irin su GSPO da IcePop don daidaita MoE a cikin ayyukan tabbatarwa, ban da sanannen aikin "tunani tare da hotuna" wanda ya haɗu da zuƙowa da haɓakawa. binciken gani don fitar da cikakkun bayanai da kuma ilimin dogon wutsiya.
Menene ERNIE-4.5-VL-28B-A3B-Tunani kuma me yasa yake da mahimmanci?
A cikin dangin ERNIE 4.5, VL-28B-A3B-Thinking version an sanya shi azaman samfuri. haske amma mai buri a cikin multimodal tunani. Yana ba da damar tsarin gine-ginen MoE tare da jumlolin biliyan 28.000 da ~ 3.000 biliyan kadarori a kowace alama, yana rage farashin ƙima yayin da yake ci gaba da yin gasa a kan mafi girma, ƙirar ƙira.
Bayanan fasaha ya ambaci har zuwa 130 masana tare da 14 masu aiki a kowane mataki, wani tsari wanda ya dace da manufar ƙwarewa ta nau'in shigarwa, sarrafa amfani da wutar lantarki da latency. Manufar ita ce, na'ura mai ba da hanya tsakanin hanyoyin sadarwa yana zaɓar "masana masu dacewa" lokacin da aka karɓi hotuna, rubutu, ko haɗin duka biyun, yana ƙara haɓaka aiki. bambancin wakilci da ingancin lissafin.
Don bangaren gani, kashin baya shine mai canza hangen nesa (ViT) wanda ke yanke hoton zuwa faci kuma yana ɗaukar su azaman alamu. Wannan hasashe akan sarari iri ɗaya kamar yadda rubutu ke sauƙaƙe "tattaunawa" mai ruwa tsakanin hanyoyin, goyan bayan dabarun horarwa kamar hasarar ma'auni. na'ura mai ba da hanya tsakanin hanyoyin sadarwa (domin kar masana su rinka cin karo da juna) da a token-daidaitacce asarar multimodal wanda ke hana daya daga cikin inuwa daya.
Tare da alamar "Tunani", Baidu yana alfahari da ingantaccen haɓakawa a cikin tunani na gani, nazarin jadawali, dalili, ƙasa, da bin umarnin gani. Bugu da ƙari, ikon kiran kayan aiki da samar da kayan aiki ... an tsara shi a cikin JSON Kuma kasancewar haɗakar da daidaituwar abun ciki ya sa ya zama yanki mai ƙarfi don wakilai na multimodal.

Gine-gine, horo da iyawa: abin da yake kawowa
Falsafar MoE tana ba da damar ɗan ƙaramin juzu'i don kunna kowace alama, wanda ke fassara zuwa ingancin lissafi ba tare da sadaukar da ma'auni na gaba ɗaya ba. Kowane "kwararre" na iya ƙware a cikin ƙira ko ayyuka (misali, OCR, zane-zane, tunanin lambobi), kuma na'ura mai ba da hanya tsakanin hanyoyin sadarwa ya koyi haɗa su bisa ga mahallin.
A aikace, ana ƙarfafa wannan ta hanyar ra'ayoyin horarwa guda biyu: hasara na orthogonal ga na'ura mai ba da hanya tsakanin hanyoyin sadarwa - wanda ke ƙarfafa bambancin tsakanin masana - da kuma aikin asarar multimodal mai daidaitawa, wanda ke kula da ma'auni tsakanin rubutu da hoto a lokacin horo. Wannan yana hana samfurin yin aiki na musamman da rubutu amma yana fama da hangen nesa (ko akasin haka). A cikin VL-28B-A3B-Tunani, haka kuma, tsakiyar horon da aka sadaukar don yin tunani game da nau'ikan rubutu-hoto yana ƙara ƙarfin wakilci kuma yana taurare multimodal semantic jeri.
Game da ma'auni, ƙididdigar kwatancen masu zaman kansu (misali, Galaxy.AI) sanya ERNIE-4.5-VL-28B-A3B daidai da-ko ma zarce-madadin kamar Qwen2.5-VL-7B da Qwen2.5-VL-32B a cikin hangen nesa na gani, fahimtar takardu, da multimodal. Wannan ya yi daidai da ɗan ƙaramin hoto na talla (e, mai wahalar karantawa) wanda ke nuna yana ci gaba da tafiya ko kuma ya fi ƙarfin nauyi kamar Gemini 2.5 Pro ko "high" GPT-5. Wasu suna zargin alamar ma'auni, amma gaskiyar ita ce, tare da haɓaka haɓakawa (GSPO, IcePop) da ƙaƙƙarfan ƙayyadaddun samfur, ana iya fahimtar ƙirar ta inganta. ƙarfi a cikin ayyukan tabbatarwa.
Ayyukan "Tunani tare da Hotuna" ya cancanci ambaton musamman: ba sihiri ba ne, amma aikin aiki wanda ya haɗu da zuƙowa hoto da kayan aikin bincike na gani don ɗaukar cikakkun bayanai masu kyau (faranti, ƙananan alamomi, alamar hoto) da samun damar sanin dogon wutsiya lokacin da ilimin ciki bai isa ba. Wannan damar, tare da ƙarin damar ƙasa (kunna ayyukan ƙasa tare da umarni masu sauƙi), yana sa ƙirar ta zama ɗan takara mai ƙarfi don aikace-aikacen masana'antu da al'amuran tare da hadaddun hotuna.
A cikin mahallin harsuna da yawa, jerin ERNIE 4.5 suna kula da babban aiki ba tare da sadaukar da fahimtar gani ba, maɓalli mai mahimmanci a cikin ayyukan aiki na duniya. Bugu da ƙari kuma, tsarin fitarwa (JSON) da kira na aiki suna buɗe kofa don amfani da lokuta inda samfurin ba kawai ya lura da amsawa ba, har ma ... yana aiki akan kayan aiki (misali, gano abubuwa da mayar da akwatunan da aka ɗaure su tare da haɗin kai).
Sharuɗɗan amfani da aka tabbatar
Hanyoyi na gani a cikin ginshiƙi masu cunkoso: ƙirar na iya ketare kwanakin nuni tare da kwanakin mako, fassara tsarin ginshiƙi, gano lokutan ƙananan yawa (misali, 12:00–14:00), da samar da tabbataccen shawara na mafi kyawun lokutan ziyarta. Anan, muna ganin tunani ta hanyar matakai da yawa wanda ya haɗu da kalanda, karatun gani da dabaru.
Matsalolin STEM daga hotuna: Fuskantar da'irar gada wanda ba za a iya warware shi ta hanyar sauƙi-daidaitacce ba, ƙirar ta shafi Dokokin Ohm's da Kirchhoff's Laws, saita daidaitattun kumburi, kuma yana samun sakamako na ƙididdiga daidai (misali, R = 7/5 Ω). Wannan yana nuna ikonsa na karanta zane-zane da fasaha dalili na alama.
Tsarin ƙasa na gani tare da ingantaccen fitarwa: an ba da “Gano duk mutanen da ke sanye da kwat da dawo da akwatunan ɗaure su a cikin JSON”, yana gano daidaikun mutane kuma yana ba da daidaitattun daidaitawar lambobi. Makullin shine haɗa ƙasa tare da bin umarnin da tsarin fitarwa na shirye-shirye.
"Tunanin hotuna" don cikakken OCR: idan mai amfani ya nemi rubutun akan alamar shuɗi a bango, kayan aikin zuƙowa yana buɗewa, yana ba da damar gano ƙananan alamun (kamar "HOTEL BUZA") tare da ƙarin cikakkun bayanai. abin dogaroMisali ne na tsayayyen mayar da hankali a cikin kyawawan yankuna.
Amfani da kayan aikin ilimin dogon wutsiya: Fuskanci tare da abin wasan wasa mai launin rawaya zagaye, ƙirar ta yanke shawarar kiran binciken hoto na waje, ya kwatanta halaye, kuma ya kammala cewa "Dundun," mai alaƙa da MINISO. Wannan bututun ya nuna ta iya aiki makada na matakai tare da kayan aiki.
Matsawar bidiyo: tsantsa subtitles tare da tambarin lokaci da gano takamaiman al'amuran (misali, sassan kusa da 17s, 37s, da 47s da aka yi fim akan gada). Anan ya haɗu da cire rubutu, tunani na ɗan lokaci, da nazarin sararin samaniya na abun ciki.
Wani sanannen bambance-bambance: ERNIE-4.5-21B-A3B-Tunani
Tare da bugu na VL-28B, akwai bambance-bambancen da aka mayar da hankali kan tunanin rubutu/lambar tare da jimlar alamun 21B da alamun aiki na 3B a kowace alama. An halicce shi da ra'ayin "mafi wayo, ba girma ba," yana nuna kyakkyawan aiki a cikin dabaru, lissafi, shirin da kuma tsawaita sarkar tunani. An buga a ƙarƙashin Apache-2.0 Kuma tare da taga faɗaɗa mahallin (a cikin kewayon 128K-131K), yana da kyau sosai don ayyuka masu tsayi da kuma kwatancen kwatancen takardu da yawa.
Ɗaya daga cikin wuraren siyar da shi shine farashin: an tallata kuɗaɗen nuni ta hanyar wasu dandamali tare da tsadar tsada sosai a kowace alamar miliyoyin (alal misali, shigarwar $ 0,07 da fita $ 0,28, har ma da “$ 0/$ 0” a cikin wasu saiti na 21B), kodayake yana da kyau a tabbatar da ainihin samuwa da yanayi, saboda da tsarin tsarin aiki. yarjejeniyar ciniki na iya bambanta.
Kwatancen kasuwa da hayaniya
Game da shahararren ɗan ƙaramin jadawali kwatanta shi da Gemini 2.5 Pro da kuma "high" GPT-5: tallace-tallace ne, ba bincike mai zaman kansa ba. Duk da haka, idan aka kwatanta da samuwan batura (Qwen2.5-VL-7B/32B, da dai sauransu), samfurin yana riƙe da nasa. Kamar koyaushe, yana da kyau a gwada shi akan bayanan da aka yi niyya da awo, saboda gama gari Ya bambanta dangane da yankin, ingancin faɗakarwa, kayan aikin da ake da su, da gaurayawan bayanai (rubutu/hoto/bidiyo).
Ƙidaya da buƙatun ƙwaƙwalwar ajiya
A cikin turawar gida, ƙididdigewa yana taimakawa. Tare da FP16, an kiyasta yana kusa da ~ 56 GB na VRAM; tare da 4-bit, a kusa da ~ 14 GB; kuma tare da 2-bit, ~ 7 GB. Lura: waɗannan lambobin sun dogara da lokacin aiki da marufi. Misali, wasu jagororin FastDeploy sun ambaci mafi ƙarancin 24 GB akan kowane kati, kuma a cikin wasu mahalli (misali, ƙarin vLLM mai buƙata) 80 GB an kawo su don takamaiman jeri. Dangane da tari (PaddlePaddle, PyTorch, kernels, tsawon jerin(, batch, KV cache), adadi mai amfani zai iya motsawa.
Tallafin harsuna da yawa da daidaitawa
Tallafin harsuna da yawa ba tare da sadaukar da gani ba wani ƙarfi ne. Kuma don samarwa da ke fuskantar mai amfani, ginanniyar daidaitawa yana ƙara matakan tsaro wanda ke rage haɗarin turawa. Fitowar da aka ƙera da kiran aiki suna ba da damar haɗa samfurin azaman "injiniya" a cikin bututun mai kayan aikin wajeba kawai a matsayin chatbot ba.
Matsanancin misali na fahimtar daftarin aiki
Samfurin na iya ɗaukar rikitattun rubuce-rubucen tarihi, kamar nassosi game da “Sarakunan Wō biyar” a cikin kafofin Sinanci, nassoshi daga “Littafin Waƙa,” rubuce-rubucen kan Gwanggaeto Stele, ko bayanan ƙafa masu shekaru (misali, 478) da wurare (Ji'an, Jilin). Wannan nau'in shigarwar yana haɗa fassarori, bayanin bayanin kula, da mahallin kayan tarihi (turunan binnewa, takuba masu rubutu irin su "Daio" masu alaƙa da Bu/Yūryaku). Tsari kamar ERNIE-4.5-VL-28B-Tunani na iya raba wannan kayan, gane sunayen da suka dace (Yomi, Mí, Sei, Ō, Bu), da haɗa su da siffofi na sarakuna Jafananci da bayyana taƙaitaccen taƙaitaccen bayani tare da gaskiyar: girmamawa ga daular kudancin Sin, rikici a yankin Koriya, tushe a Kara/Imna don albarkatun ƙarfe, da dai sauransu.
Aiwatarwa, samun dama da tambayoyin da ake yawan yi
Akwai hanyoyi da yawa don gwadawa da tura ERNIE 4.5. Baidu yana ba da damar yanar gizo don farawa ba tare da shigarwa ba. Haɗin kai tare da dandamali na ɓangare na uku (misali, Novita API Playground) yana sauƙaƙa kimanta ƙirar a cikin mahallin ci gaba da auna farashi. Don turawa gida, abin da aka ba da shawarar shine yawanci... Linuxtare da PaddlePaddle (ERNIEKit) da haɗin kai tare da masu canzawa a cikin PyTorch ta amfani da trust_remote_code idan ya taba.

Aiki tare da Transformers (PyTorch)
Hanya ta al'ada ta ƙunshi ɗora samfurin tare da AutoModelForCausalLM, ƙara ƙaddamar da hoto daga AutoProcessor, da gina saƙonnin multimodal waɗanda ke haɗa rubutu da hoto / bidiyo. Sa'an nan kuma, an ƙirƙira shi tare da iyakokin alamun da suka dace kuma an ƙaddamar da fitarwa. Makullin shine cewa Processor sarrafa duka samfurin taɗi da kuma shirye-shiryen tenors na gani.
<!-- Ejemplo orientativo (parafraseado) -->
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
name = "baidu/ERNIE-4.5-VL-28B-A3B-Thinking"
model = AutoModelForCausalLM.from_pretrained(
name, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(name, trust_remote_code=True)
model.add_image_preprocess(processor)
messages = [{
"role": "user",
"content": [
{"type": "text", "text": "¿De qué color es la ropa de la chica?"},
{"type": "image_url", "image_url": {"url": "https://.../example1.jpg"}}
]
}]
text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
out_ids = model.generate(**{k: v.to(model.device) for k, v in inputs.items()}, max_new_tokens=256)
print(processor.decode(out_ids[0][len(inputs["input_ids"][0]):]))
Ƙaddamar da vLLM
vLLM yana haɓaka ƙima kuma yana ƙara zaɓuɓɓuka kamar masu fassarori waɗanda aka tsara musamman don tunani da kiran kayan aiki. Ka tuna don kunna shi. –aminci-remote-code lokacin bautar samfurin idan ma'ajin yana buƙatar shi.
# Instalar nightly (orientativo)
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
# Servir el modelo
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code
# Con parsers de razonamiento y herramientas
evllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
--trust-remote-code \
--reasoning-parser ernie45 \
--tool-call-parser ernie45 \
--enable-auto-tool-choice
FastDeploy da ERNIEKit
FastDeploy yana ba da damar fallasa ayyuka masu sauri tare da sigogi don sarrafa matsakaicin tsayi, adadin jeri, ƙididdigewa (wint8/INT4), masu fassarori, da saitunan sarrafawa na multimodal (misali, image_max_pixels). Bukatun VRAM da aka ambata sun bambanta; tun daga lokacin ake yin tsokaci 24 GB a kowace kati har zuwa yanayin da ke buƙatar 80 GB a cikin wasu jagororin; ya dogara da haɗin samfurin, daidaito, tsari da tsawo.
# Ejemplo orientativo
fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
--max-model-len 131072 \
--max-num-seqs 32 \
--port 8180 \
--quantization wint8 \
--reasoning-parser ernie-45-vl-thinking \
--tool-call-parser ernie-45-vl-thinking \
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'
Kyakkyawan daidaitawa (SFT/LoRA) da daidaitawa (DPO)
ERNIEKit, dangane da PaddlePaddle, yana ba da saitunan da aka shirya don SFT tare da kuma ba tare da LoRA ba, kuma don DPO. Yana da amfani don daidaita samfurin zuwa takamaiman yanki (misali, takardun masana'antu, dubawa na gani, siffofin) yayin kiyaye multimodal ƙarfiKuna iya zazzage ma'ajiyar samfur kuma gudanar da samfuran horarwa da aka haɗa cikin misalan kayan aiki.
Samun dama ta APIs da dandamali
Baya ga dandalin Baidu, akwai haɗe-haɗe masu dacewa da ma'auni. BABI APIs. Wannan yana sauƙaƙa ƙaura daga kayan aikin da ake da su (misali, abokan cinikin layin umarni ko masu gyara kamar Cursor) ta hanyar guje wa buƙatar sake haɗa haɗin kai. Wasu gizagizai na GPU (kamar Novita AI) suna tallata misalai tare da isassun VRAM da farashin sa'a, gami da haɓakawa zuwa GPUs da yawa, wanda ke da amfani idan kuna so. gwada manyan daidaitawa ba tare da saka hannun jari ba hardware kansa
Lasisin Kasuwanci da Amfani
An saki dangin ERNIE 4.5 a ƙarƙashin Apache 2.0, lasisin izini wanda ke ba da izinin amfani da kasuwanci yayin mutunta sharuɗɗa da sanarwa. Wannan yana ba da sauƙi don ƙirƙirar samfuran da aka biya ta hanyar haɗa samfurin da abubuwan da aka samo asali, muddin kuna kula da samfuran. yarda da lasisi da halayen da suka dace (misali, ambaton rahoton fasaha).
Farashin da mahallin
An raba nassoshin farashin gasa sosai. Misali, don bugu na 300B A47B, mahallin da aka ambata shine 123k, tare da farashin nuni na shigarwar $ 0,28/M da fitarwa $1,10/M; don 21B A3B, alkalumman da aka yi talla a ƙasan $0/$0 an gansu. Yana da kyau a duba samuwa da ainihin yanayi akan dandamali mai dacewa, kamar yadda farashin ya dogara da mai bayarwa. kudin amfani, yankin da kuma SLA.
Yin aiki a cikin ayyukan rayuwa na ainihi
Bayan takarda, abin da ke da ban sha'awa shine inda yake haskakawa: karanta takardu tare da haɗin rubutu da abubuwa masu gani (tambayoyi, tebur, sa hannu), cire bayanai tare da ƙasa (daidaitawa), warware matsalolin STEM daga hotuna ko farar fata, taƙaitaccen bidiyo tare da wurin lokaci na abubuwan da suka faru, da kuma kayan aiki-amfani Don sanin dogon wutsiya. Idan aikace-aikacenku ya dace da wannan bayanin martaba, "Tunani" yana ƙara abubuwa masu amfani.
Mai sauri FAQ
- Menene ma'anar "Tunani da Hotuna"? - Gudun aiki ne wanda ke haɗa zuƙowa da bincike na gani don ɗaukar cikakkun bayanai da tuntuɓar ilimin waje lokacin da ilimin cikin gida bai isa ba, haɓaka kyakkyawan tunani.
- Nawa VRAM nake bukata? - Ya dogara. A matsayin jagora mai mahimmanci: FP16 ~ 56 GB; INT4 ~ 14 GB; 2-bit ~ 7 GB. Amma lokacin gudu da girman mahallin na iya ɗaga mashaya, musamman tare da vLLM.
- Yana haɗawa da kayan aiki? - Ee, yana goyan bayan kiran aiki da fitowar JSON, yana ba da damar wakilai na multimodal tare da ƙasa, OCR, bincike, da sauransu, ɗaure tare tare. matakan tabbatarwa.
- Akwai madaidaicin “rubutu-kawai” mai ƙarfi? - ERNIE-4.5-21B-A3B-Tunani ya yi fice a dabaru, lissafi, da coding, tare da kyakkyawan rabo. kudin-yi kuma mafi fadi mahallin.
Idan kana neman samfurin multimodal wanda ke daidaita inganci da iya aiki, ERNIE-4.5-VL-28B-A3B-Thinking yana da ban sha'awa musamman. Ginshikan sa MoE mai kyau ne (ƙwararrun masana 130 tare da masu amfani da 14 masu aiki), ViT haɗe zuwa sararin rubutu da aka raba, asarar na'ura mai ba da hanya tsakanin hanyoyin sadarwa, da asarar ma'auni-daidaitacce, wanda aka ƙarfafa ta hanyar tunani tsakiyar horo, RL tare da GSPO/IcePop, da "tunani cikin hotuna." Demos dinsa yana nunawa tunani na gani Matakai da yawa, daidaitaccen ƙasa, STEM daga hotuna, amfani da kayan aiki, da fahimtar bidiyo na lokaci-lokaci. Samun sassauƙa (Baidu, APIs masu jituwa, tura gida tare da Paddle/Transformers), lasisin Apache 2.0, da zaɓuɓɓukan ƙididdigewa sun kammala fakitin wanda, tallan tallace-tallace baya ga, yana da tushen fasaha don yin gasa sosai.
Marubuci mai sha'awa game da duniyar bytes da fasaha gabaɗaya. Ina son raba ilimina ta hanyar rubutu, kuma abin da zan yi ke nan a cikin wannan shafi, in nuna muku duk abubuwan da suka fi ban sha'awa game da na'urori, software, hardware, yanayin fasaha, da ƙari. Burina shine in taimaka muku kewaya duniyar dijital ta hanya mai sauƙi da nishaɗi.
