Looking at Naver's reasoning AI report card... "Outperforms Alibaba and LG"

HyperCLOVA X Think vs Global Reasoning AI Model Performance Naver dominates in Korean language capabilities... Math and coding fall slightly short

2025-06-30     Dongwon Kim
Summary of model performance on (1) General Aptitude, (2) Culture and Language, and (3) Instruction-following benchmarks for the Korean domain. The instruction-following benchmark  scores are normalized by multiplying their original values by 10. /Naver

The performance report of Naver's recently launched reasoning artificial intelligence (AI) 'HyperCLOVA X THINK' has been made public. This comprehensive scorecard shows how it competed against major global AI models, confirming that it achieved higher scores than Alibaba's Qwen series and LG AI Research's EXAONE Deep, among others.

◇ Korean Language Tests: Straight A's with Clear Dominance Over Competitors

The Korean language scorecard revealed in Naver's technical report on the 30th showed HyperCLOVA X THINK taking first place across the board.

The most overwhelming performance was in 'KoBALT-700'. In this high-difficulty Korean language test designed by Seoul National University's Department of Linguistics, HyperCLOVA X THINK scored 48.9 points. Meanwhile, Alibaba's Qwen3 32B managed only 41.4 points, and LG's EXAONE scored 33.0 points. This represents a commanding first place with a margin of 16 points.

Similar results emerged in KMMLU, which evaluates comprehensive Korean language understanding. HyperCLOVA X THINK scored 69.7 points, Qwen3 32B scored 63.5 points, and EXAONE Deep scored 53.6 points, showing a gap of more than 10 points. In CSAT, which mimics Korea's College Scholastic Ability Test, it scored 83.2 points, beating EXAONE (69.7 points) by 13 points and performing at a similar level to Qwen3.

The gaps widened even further in tests measuring cultural and historical understanding. With 87.8 points in HAERAE (other models scored 74-76 points) and 80.1 points in CLIcK (EXAONE 62.2 points, Qwen3 71.1 points), it demonstrated unparalleled depth in understanding Korean culture.

Interestingly, even Alibaba's math-specialized model QwQ 32B could not surpass HyperCLOVA X THINK in Korean language domains. While QwQ scored 98 points in mathematics, it managed only 32.4 points in KoBALT-700, about half the score of Naver's model.

Performancecomparisonof languagemodelsonKorea-centricbenchmarks.Modelsare evaluatedacross comprehensiveunderstanding, cultural sovereignty, andchat-based instruction followingtasks,highlightingtheircapabilitiesandadaptabilitywithinaKoreancontext./Naver

◇ Mathematics and Coding: Room for Improvement, but Training Efficiency Compensates

However, the results in mathematics and coding subjects were disappointing. In MATH500, the most challenging mathematics test, HyperCLOVA X THINK scored 95.2 points, falling behind QwQ (98.0 points) and Qwen3 32B (97.2 points). In the coding test HumanEval, it scored 95.7 points, slightly lower than Qwen3 32B (96.9 points).

Nevertheless, it demonstrated overwhelming superiority in training efficiency. According to Naver, HyperCLOVA X THINK achieved this level of performance despite being trained with significantly fewer graphics processing units (GPUs) compared to competing models. This is attributed to Naver's independently developed 'Peri-LN' technique and high-quality data strategy, with the technique being accepted at ICML 2025, one of the world's most prestigious AI conferences.

TrainingEfficiency(GPUHours/A100/MFU50%)

◇ "World's Best in Korean": Proving the Potential of Sovereign AI

The message from this report card is clear: even AI models created by global big tech companies cannot match domestic technology in Korean language and Korean cultural domains.

Industry experts analyze that "while mathematics and coding are universal domains regardless of language, Korean language understanding and cultural context are far more important in most situations where Korean users actually utilize AI."

Naver announced plans to release this model as open source as well. One AI researcher commented, "This result is a representative case demonstrating the importance of sovereign AI," and predicted that "in competition with global reasoning models like ChatGPT o1, Korea can maintain a dominant position at least in the Korean language domain."

Another AI industry official noted, "LG AI Research is scheduled to showcase AI models starting in July," and predicted that "LG could potentially achieve better scores than Naver again." They added, "Competition between Korea's two leading AI companies will drive the advancement of Korean AI."