AI researcher interested in how AI stores & updates information
dong-kyum.kim [at] mpi-sp.org
I am a postdoc at MPI for Security and Privacy. I received my Ph.D. in Physics from KAIST, where I applied AI to problems in nonequilibirium statistical physics.
I study how large language models store and update knowledge, and what fails when we try to edit or erase it. My goal is to make AI systems safer and more controllable by understanding their internal representations. I work across interpretability, model editing, and machine unlearning.
Currently, I am on the job market and open to opportunities and collaborations. Feel free to browse my publications or my CV. I’m always happy to connect.
Memory formation is fundamental to intelligence, yet whether deep neural networks preserve identifiable memory traces—analogous to biological memory units—remains an open question. This work introduces a geometric framework to identify such "AI engrams," by formalizing the neuroscientific criteria of specificity, reactivation, sufficiency, and necessity into a constrained inverse problem. We derive a closed-form estimator that isolates individual memory traces from globally entangled parameters. Theoretical analysis reveals that this biologically-derived solution corresponds to a natural gradient update on the parameter manifold. AI engrams enable surgical manipulation of learned knowledge: any subset of memories can be composed or erased through linear arithmetic, without iterative optimization. Experiments ranging from simple MLPs to LLMs demonstrate the causal validity and substantial scalability of AI engrams. Together, these results bridge theories of biological memory and artificial representation learning, offering geometric insight into how deep networks simultaneously support functional specificity within distributed storage.
@inproceedings{kwon2026engram,title={AI Engram: In Search of Memory Traces in Artificial Intelligence},author={Kwon, Jea and Kim, Dong-Kyum and Kim, Jiwon and Kim, Yonghyun and Kook, Woong and Cha, Meeyoung},year={2026},booktitle={Proceedings of the 43rd International Conference on Machine Learning},note={Spotlight},selected={true},}
ICLR
Bilinear representation mitigates reversal curse and enables consistent model editing
Dong-Kyum Kim, Minsung Kim, Jea Kwon, Nakyeong Yang, and Meeyoung Cha
In The Fourteenth International Conference on Learning Representations 2026
The reversal curse—a language model’s inability to infer an unseen fact "B is A" from a learned fact "A is B"—is widely considered a fundamental limitation. We show that this is not an inherent failure but an artifact of how models encode knowledge. Our results demonstrate that training from scratch on synthetic relational knowledge graphs leads to the emergence of a bilinear relational structure within the models’ hidden representations. This structure alleviates the reversal curse and facilitates inference of unseen reverse facts. Crucially, this bilinear geometry is foundational for consistent model editing: updates to a single fact propagate correctly to its reverse and logically dependent relations. In contrast, models lacking this representation suffer from the reversal curse and fail to generalize model edits, leading to logical inconsistencies. Our results establish that training on a relational knowledge dataset induces the emergence of bilinear internal representations, which in turn support language models in behaving in a logically consistent manner after editing. This suggests that the efficacy of language model editing depends not only on the choice of algorithm but on the underlying representational geometry of the knowledge itself.
@inproceedings{kim2026bilinear,title={Bilinear representation mitigates reversal curse and enables consistent model editing},author={Kim, Dong-Kyum and Kim, Minsung and Kwon, Jea and Yang, Nakyeong and Cha, Meeyoung},year={2026},url={https://openreview.net/forum?id=pdNaYcApbz},booktitle={The Fourteenth International Conference on Learning Representations},selected={true},}
ICLR
Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning
Nakyeong Yang, Dong-Kyum Kim, Jea Kwon, Minsung Kim, Kyomin Jung, and Meeyoung Cha
In The Fourteenth International Conference on Learning Representations 2026
Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.
@inproceedings{kim2026unlearnneuron,title={Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning},author={Yang, Nakyeong and Kim, Dong-Kyum and Kwon, Jea and Kim, Minsung and Jung, Kyomin and Cha, Meeyoung},year={2026},url={https://openreview.net/forum?id=z2zFk9jYpw},booktitle={The Fourteenth International Conference on Learning Representations},selected={true}}
NeurIPS
Transformer as a hippocampal memory consolidation model based on NMDAR-inspired nonlinearity
Dong-Kyum Kim, Jea Kwon, Meeyoung Cha, and C. Justin Lee
In Thirty-seventh Conference on Neural Information Processing Systems 2023
The hippocampus plays a critical role in learning, memory, and spatial representation, processes that depend on the NMDA receptor (NMDAR). Here we build on recent findings comparing deep learning models to the hippocampus and develop a new nonlinear activation function based on NMDAR dynamics. We find that NMDAR-like nonlinearity is essential for shifting short-term working memory into long-term reference memory in transformers, thus enhancing a process that resembles memory consolidation in the mammalian brain. We design a navigation task assessing these two memory functions and show that manipulating the activation function (i.e., mimicking the Mg^2+-gating of NMDAR) disrupts long-term memory processes. Our experiments suggest that place cell-like functions and reference memory reside in the feed-forward network layer of transformers and that nonlinearity drives these processes. We discuss the role of NMDAR-like nonlinearity in establishing this striking resemblance between transformer architecture and hippocampal spatial representation.
@inproceedings{kim2023nmda,author={Kim, Dong-Kyum and Kwon, Jea and Cha, Meeyoung and Lee, C. Justin},title={Transformer as a hippocampal memory consolidation model based on NMDAR-inspired nonlinearity},year={2023},url={https://openreview.net/forum?id=vKpVJxplmB},booktitle={Thirty-seventh Conference on Neural Information Processing Systems},selected={true},}
PRL
Learning Entropy Production via Neural Networks
Dong-Kyum Kim, Youngkyoung Bae, Sangyun Lee, and Hawoong Jeong
This Letter presents a neural estimator for entropy production (NEEP), that estimates entropy production (EP) from trajectories of relevant variables without detailed information on the system dynamics. For steady state, we rigorously prove that the estimator, which can be built up from different choices of deep neural networks, provides stochastic EP by optimizing the objective function proposed here. We verify the NEEP with the stochastic processes of the bead spring and discrete flashing ratchet models and also demonstrate that our method is applicable to high-dimensional data and can provide coarse-grained EP for Markov systems with unobservable states.
@article{kim2020neep,title={Learning Entropy Production via Neural Networks},author={Kim, Dong-Kyum and Bae, Youngkyoung and Lee, Sangyun and Jeong, Hawoong},journal={Phys. Rev. Lett.},volume={125},issue={14},pages={140604},numpages={6},year={2020},month=oct,publisher={American Physical Society},doi={10.1103/PhysRevLett.125.140604},url={https://link.aps.org/doi/10.1103/PhysRevLett.125.140604},selected={true},}