Are Vision Large Language Models Effective Handwriting Graph Extractors?

Co-Supervised by: Dr. Linlin Jia

If you are interested in this topic or have further questions, do not hesitate to contact linlin.jia@unibe.ch.

Background / Context

Extracting structural graph representations from handwritten images is a challenging yet essential problem for visual understanding, pattern analysis, and downstream reasoning tasks. Recent progress in vision large language models (VLLMs) introduces new possibilities by combining visual grounding with structured, language-like reasoning, enabling the generation of explicit node–edge representations and interpretable graph descriptions.

This project aims to explore the potential of VLLMs for visual-to-graph transformation, leveraging their multimodal reasoning capacity, structured output abilities, and interpretability to achieve accurate and explainable graph extraction from handwritten data. In addition, the project will investigate evaluation metrics and interpretability analysis for generated graphs, building a foundation for robust and explainable visual-structural understanding.

Research Question(s) / Goals

In this project, we aim to answer the following research questions:

  • Q1: Can pre-trained vision large language models (VLLMs) effectively extract graph-structured representations (nodes and edges) from handwritten images?
  • Q2: Can SAM-derived stroke and keypoint embeddings improve the fine-tuning performance or interpretability of VLLMs for structure extraction?
  • Q3: How to properly evaluate the extracted structures?

To answer these questions, we define the following goals:

  • G1: Construct or refine datasets of handwritten images annotated with graph structures.
  • G2: Design multi-level and multi-aspect metrics to evaluate the quality of the extracted graphs.
  • G3: Design prompting strategies and/or fine-tuning objectives to guide VLLMs in generating structured graph outputs
  • G4: Integrate semantic and geometric embeddings constructed via SAM-based stroke and keypoint extraction modules to support VLLM fine-tuning. Design loss functions able to fuse the multi-granularity information.
  • G5: Evaluate and compare the performance of the proposed methods against the baselines, such as the transformer-based extraction approaches.

Approach / Methods

  • Prepare and preprocess handwritten datasets, extracting stroke or keypoint embeddings.
  • Survey the state-of-the-art research on VLLMs for structural information extraction
  • Design and implement graph quality evaluation metrics, such as GED- or topology-based measures.
  • Collaborate with chemists to evaluate the models, validating the outputs and ensuring scientific relevance.
  • Explore different prompting and finetuning strategies on trending VLLMs to benchmark their ability to extract graphs from images.
  • Develop novel graph extraction approaches, by improving VLLMs by topological embeddings constructed via SAM.
  • Benchmark the designed approach against baselines

Expected Contributions / Outcomes

  • A novel VLLM-based framework for extracting graph-structured representations of handwritten images
  • A graph extraction quality evaluation metric matrix.
  • Prompting and fine-tuning strategies for VLLMs to output high-quality structured graph data
  • A comprehensive benchmark of the proposed VLLM-based methods against state-of-the-art approaches
  • Reproducible code, model checkpoints, and experimental results.
  • A thesis and potentially a research paper submission.

Required Skills / Prerequisites

  • Solid knowledge of statistics and machine learning, with an understanding of deep learning, learning on graphs, and computer vision being a plus
  • Strong programming skills, preferably in Python and PyTorch.
  • Familiar with the VLLM prompting and fine-tuning pipelines or willingness to learn.
  • Proficiency in English for effective communication and presentations at the research level.

Possible Extensions

  • Explore multi-modal fusion where stroke/keypoint embeddings are aligned with visual tokens inside the attention mechanism.
  • Investigate the potential of using prompt-based conditioning to jointly perform related tasks such as stroke segmentation, node localization, and graph reasoning.
  • Leverage LLMs’ inherent reasoning and explanation capabilities to analyze or verbalize how and why certain graph topologies are generated
  • Evaluate models’ generalization ability on multi-language handwriting.

Further Reading / Starting Literature

  • Roberts, J. S., Lee, T., Wong, C. H., Yasunaga, M., Mai, Y., & Liang, P. (2024). Image2struct: Benchmarking structure extraction for vision-language models. Advances in Neural Information Processing Systems, 37, 115058-115097.
  • Dutta, A. (2025). Zero-Shot Scene Graph Relationship Prediction using VLMs (Doctoral dissertation, Virginia Tech
  • Hetang, C., Xue, H., Le, C., Yue, T., Wang, W., & He, Y. (2024). Segment anything model for road network graph extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2556-2566).