new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jan 8

Exploiting the Brain's Network Structure for Automatic Identification of ADHD Subjects

Attention Deficit Hyperactive Disorder (ADHD) is a common behavioral problem affecting children. In this work, we investigate the automatic classification of ADHD subjects using the resting state Functional Magnetic Resonance Imaging (fMRI) sequences of the brain. We show that the brain can be modeled as a functional network, and certain properties of the networks differ in ADHD subjects from control subjects. We compute the pairwise correlation of brain voxels' activity over the time frame of the experimental protocol which helps to model the function of a brain as a network. Different network features are computed for each of the voxels constructing the network. The concatenation of the network features of all the voxels in a brain serves as the feature vector. Feature vectors from a set of subjects are then used to train a PCA-LDA (principal component analysis-linear discriminant analysis) based classifier. We hypothesized that ADHD-related differences lie in some specific regions of the brain and using features only from those regions is sufficient to discriminate ADHD and control subjects. We propose a method to create a brain mask that includes the useful regions only and demonstrate that using the feature from the masked regions improves classification accuracy on the test data set. We train our classifier with 776 subjects and test on 171 subjects provided by The Neuro Bureau for the ADHD-200 challenge. We demonstrate the utility of graph-motif features, specifically the maps that represent the frequency of participation of voxels in network cycles of length 3. The best classification performance (69.59%) is achieved using 3-cycle map features with masking. Our proposed approach holds promise in being able to diagnose and understand the disorder.

  • 3 authors
·
Jun 15, 2023

Peregrine: A Pattern-Aware Graph Mining System

Graph mining workloads aim to extract structural properties of a graph by exploring its subgraph structures. General purpose graph mining systems provide a generic runtime to explore subgraph structures of interest with the help of user-defined functions that guide the overall exploration process. However, the state-of-the-art graph mining systems remain largely oblivious to the shape (or pattern) of the subgraphs that they mine. This causes them to: (a) explore unnecessary subgraphs; (b) perform expensive computations on the explored subgraphs; and, (c) hold intermediate partial subgraphs in memory; all of which affect their overall performance. Furthermore, their programming models are often tied to their underlying exploration strategies, which makes it difficult for domain users to express complex mining tasks. In this paper, we develop Peregrine, a pattern-aware graph mining system that directly explores the subgraphs of interest while avoiding exploration of unnecessary subgraphs, and simultaneously bypassing expensive computations throughout the mining process. We design a pattern-based programming model that treats "graph patterns" as first class constructs and enables Peregrine to extract the semantics of patterns, which it uses to guide its exploration. Our evaluation shows that Peregrine outperforms state-of-the-art distributed and single machine graph mining systems, and scales to complex mining tasks on larger graphs, while retaining simplicity and expressivity with its "pattern-first" programming approach.

  • 3 authors
·
Apr 5, 2020

On the Power of the Weisfeiler-Leman Test for Graph Motif Parameters

Seminal research in the field of graph neural networks (GNNs) has revealed a direct correspondence between the expressive capabilities of GNNs and the k-dimensional Weisfeiler-Leman (kWL) test, a widely-recognized method for verifying graph isomorphism. This connection has reignited interest in comprehending the specific graph properties effectively distinguishable by the kWL test. A central focus of research in this field revolves around determining the least dimensionality k, for which kWL can discern graphs with different number of occurrences of a pattern graph P. We refer to such a least k as the WL-dimension of this pattern counting problem. This inquiry traditionally delves into two distinct counting problems related to patterns: subgraph counting and induced subgraph counting. Intriguingly, despite their initial appearance as separate challenges with seemingly divergent approaches, both of these problems are interconnected components of a more comprehensive problem: "graph motif parameters". In this paper, we provide a precise characterization of the WL-dimension of labeled graph motif parameters. As specific instances of this result, we obtain characterizations of the WL-dimension of the subgraph counting and induced subgraph counting problem for every labeled pattern P. We additionally demonstrate that in cases where the kWL test distinguishes between graphs with varying occurrences of a pattern P, the exact number of occurrences of P can be computed uniformly using only local information of the last layer of a corresponding GNN. We finally delve into the challenge of recognizing the WL-dimension of various graph parameters. We give a polynomial time algorithm for determining the WL-dimension of the subgraph counting problem for given pattern P, answering an open question from previous work.

  • 2 authors
·
Sep 29, 2023

A Survey on Machine Learning Solutions for Graph Pattern Extraction

A subgraph is constructed by using a subset of vertices and edges of a given graph. There exist many graph properties that are hereditary for subgraphs. Hence, researchers from different communities have paid a great deal of attention in studying numerous subgraph problems, on top of the ordinary graph problems. Many algorithms are proposed in studying subgraph problems, where one common approach is by extracting the patterns and structures of a given graph. Due to the complex structures of certain types of graphs and to improve overall performances of the existing frameworks, machine learning techniques have recently been employed in dealing with various subgraph problems. In this article, we present a comprehensive review on five well known subgraph problems that have been tackled by using machine learning methods. They are subgraph isomorphism (both counting and matching), maximum common subgraph, community detection and community search problems. We provide an outline of each proposed method, and examine its designs and performances. We also explore non-learning-based algorithms for each problem and a brief discussion is given. We then suggest some promising research directions in this area, hoping that relevant subgraph problems can be tackled by using a similar strategy. Since there is a huge growth in employing machine learning techniques in recent years, we believe that this survey will serve as a good reference point to relevant research communities.

  • 6 authors
·
Apr 3, 2022

Understanding Graph Databases: A Comprehensive Tutorial and Survey

This tutorial serves as a comprehensive guide for understanding graph databases, focusing on the fundamentals of graph theory while showcasing practical applications across various fields. It starts by introducing foundational concepts and delves into the structure of graphs through nodes and edges, covering different types such as undirected, directed, weighted, and unweighted graphs. Key graph properties, terminologies, and essential algorithms for network analysis are outlined, including Dijkstras shortest path algorithm and methods for calculating node centrality and graph connectivity. The tutorial highlights the advantages of graph databases over traditional relational databases, particularly in efficiently managing complex, interconnected data. It examines leading graph database systems such as Neo4j, Amazon Neptune, and ArangoDB, emphasizing their unique features for handling large datasets. Practical instructions on graph operations using NetworkX and Neo4j are provided, covering node and edge creation, attribute assignment, and advanced queries with Cypher. Additionally, the tutorial explores common graph visualization techniques using tools like Plotly and Neo4j Bloom, which enhance the interpretation and usability of graph data. It also delves into community detection algorithms, including the Louvain method, which facilitates clustering in large networks. Finally, the paper concludes with recommendations for researchers interested in exploring the vast potential of graph technologies.

  • 3 authors
·
Nov 15, 2024

Extending Bootstrap AMG for Clustering of Attributed Graphs

In this paper we propose a new approach to detect clusters in undirected graphs with attributed vertices. We incorporate structural and attribute similarities between the vertices in an augmented graph by creating additional vertices and edges as proposed in [1, 2]. The augmented graph is then embedded in a Euclidean space associated to its Laplacian and we cluster vertices via a modified K-means algorithm, using a new vector-valued distance in the embedding space. Main novelty of our method, which can be classified as an early fusion method, i.e., a method in which additional information on vertices are fused to the structure information before applying clustering, is the interpretation of attributes as new realizations of graph vertices, which can be dealt with as coordinate vectors in a related Euclidean space. This allows us to extend a scalable generalized spectral clustering procedure which substitutes graph Laplacian eigenvectors with some vectors, named algebraically smooth vectors, obtained by a linear-time complexity Algebraic MultiGrid (AMG) method. We discuss the performance of our proposed clustering method by comparison with recent literature approaches and public available results. Extensive experiments on different types of synthetic datasets and real-world attributed graphs show that our new algorithm, embedding attributes information in the clustering, outperforms structure-only-based methods, when the attributed network has an ambiguous structure. Furthermore, our new method largely outperforms the method which originally proposed the graph augmentation, showing that our embedding strategy and vector-valued distance are very effective in taking advantages from the augmented-graph representation.

  • 3 authors
·
Sep 20, 2021

Can LLMs Convert Graphs to Text-Attributed Graphs?

Graphs are ubiquitous structures found in numerous real-world applications, such as drug discovery, recommender systems, and social network analysis. To model graph-structured data, graph neural networks (GNNs) have become a popular tool. However, existing GNN architectures encounter challenges in cross-graph learning where multiple graphs have different feature spaces. To address this, recent approaches introduce text-attributed graphs (TAGs), where each node is associated with a textual description, which can be projected into a unified feature space using textual encoders. While promising, this method relies heavily on the availability of text-attributed graph data, which is difficult to obtain in practice. To bridge this gap, we propose a novel method named Topology-Aware Node description Synthesis (TANS), leveraging large language models (LLMs) to convert existing graphs into text-attributed graphs. The key idea is to integrate topological information into LLMs to explain how graph topology influences node semantics. We evaluate our TANS on text-rich, text-limited, and text-free graphs, demonstrating its applicability. Notably, on text-free graphs, our method significantly outperforms existing approaches that manually design node features, showcasing the potential of LLMs for preprocessing graph-structured data in the absence of textual information. The code and data are available at https://github.com/Zehong-Wang/TANS.

  • 6 authors
·
Dec 13, 2024

From Graphs to Hypergraphs: Hypergraph Projection and its Remediation

We study the implications of the modeling choice to use a graph, instead of a hypergraph, to represent real-world interconnected systems whose constituent relationships are of higher order by nature. Such a modeling choice typically involves an underlying projection process that maps the original hypergraph onto a graph, and is common in graph-based analysis. While hypergraph projection can potentially lead to loss of higher-order relations, there exists very limited studies on the consequences of doing so, as well as its remediation. This work fills this gap by doing two things: (1) we develop analysis based on graph and set theory, showing two ubiquitous patterns of hyperedges that are root to structural information loss in all hypergraph projections; we also quantify the combinatorial impossibility of recovering the lost higher-order structures if no extra help is provided; (2) we still seek to recover the lost higher-order structures in hypergraph projection, and in light of (1)'s findings we propose to relax the problem into a learning-based setting. Under this setting, we develop a learning-based hypergraph reconstruction method based on an important statistic of hyperedge distributions that we find. Our reconstruction method is evaluated on 8 real-world datasets under different settings, and exhibits consistently good performance. We also demonstrate benefits of the reconstructed hypergraphs via use cases of protein rankings and link predictions.

  • 2 authors
·
Jan 16, 2024

Graph Prompt Learning: A Comprehensive Survey and Beyond

Artificial General Intelligence (AGI) has revolutionized numerous fields, yet its integration with graph data, a cornerstone in our interconnected world, remains nascent. This paper presents a pioneering survey on the emerging domain of graph prompts in AGI, addressing key challenges and opportunities in harnessing graph data for AGI applications. Despite substantial advancements in AGI across natural language processing and computer vision, the application to graph data is relatively underexplored. This survey critically evaluates the current landscape of AGI in handling graph data, highlighting the distinct challenges in cross-modality, cross-domain, and cross-task applications specific to graphs. Our work is the first to propose a unified framework for understanding graph prompt learning, offering clarity on prompt tokens, token structures, and insertion patterns in the graph domain. We delve into the intrinsic properties of graph prompts, exploring their flexibility, expressiveness, and interplay with existing graph models. A comprehensive taxonomy categorizes over 100 works in this field, aligning them with pre-training tasks across node-level, edge-level, and graph-level objectives. Additionally, we present, ProG, a Python library, and an accompanying website, to support and advance research in graph prompting. The survey culminates in a discussion of current challenges and future directions, offering a roadmap for research in graph prompting within AGI. Through this comprehensive analysis, we aim to catalyze further exploration and practical applications of AGI in graph data, underlining its potential to reshape AGI fields and beyond. ProG and the website can be accessed by https://github.com/WxxShirley/Awesome-Graph-Prompt, and https://github.com/sheldonresearch/ProG, respectively.

  • 6 authors
·
Nov 28, 2023

TempME: Towards the Explainability of Temporal Graph Neural Networks via Motif Discovery

Temporal graphs are widely used to model dynamic systems with time-varying interactions. In real-world scenarios, the underlying mechanisms of generating future interactions in dynamic systems are typically governed by a set of recurring substructures within the graph, known as temporal motifs. Despite the success and prevalence of current temporal graph neural networks (TGNN), it remains uncertain which temporal motifs are recognized as the significant indications that trigger a certain prediction from the model, which is a critical challenge for advancing the explainability and trustworthiness of current TGNNs. To address this challenge, we propose a novel approach, called Temporal Motifs Explainer (TempME), which uncovers the most pivotal temporal motifs guiding the prediction of TGNNs. Derived from the information bottleneck principle, TempME extracts the most interaction-related motifs while minimizing the amount of contained information to preserve the sparsity and succinctness of the explanation. Events in the explanations generated by TempME are verified to be more spatiotemporally correlated than those of existing approaches, providing more understandable insights. Extensive experiments validate the superiority of TempME, with up to 8.21% increase in terms of explanation accuracy across six real-world datasets and up to 22.96% increase in boosting the prediction Average Precision of current TGNNs.

  • 2 authors
·
Oct 30, 2023

Linguistic and Structural Basis of Engineering Design Knowledge

Artefact descriptions are the primary carriers of engineering design knowledge that is both an outcome and a driver of the design process. While an artefact could be described in different connotations, the design process requires a description to embody engineering design knowledge, which is expressed in the text through intricate placement of entities and relationships. As large-language models learn from all kinds of text merely as a sequence of characters/tokens, these are yet to generate text that embodies explicit engineering design facts. Existing ontological design theories are less likely to guide the large-language models whose applications are currently limited to ideation and learning purposes. In this article, we explicate engineering design knowledge as knowledge graphs from a large sample of 33,881 patent documents. We examine the constituents of these knowledge graphs to understand the linguistic and structural basis of engineering design knowledge. In terms of linguistic basis, we observe that entities and relationships could be generalised to 64 and 24 linguistic syntaxes. While relationships mainly capture attributes ('of'), structure ('in', 'with'), purpose ('to', 'for'), hierarchy ('include'), exemplification ('such as'), and behaviour ('to', 'from'), the hierarchical relationships could specifically be identified using 75 unique syntaxes. To understand the structural basis, we draw inspiration from various studies on biological/ecological networks and discover motifs from patent knowledge graphs. We identify four 3-node and four 4-node patterns that could further be converged and simplified into sequence [->...->], aggregation [->...<-], and hierarchy [<-...->]. Expected to guide large-language model based design tools, we propose few regulatory precepts for concretising abstract entities and relationships within subgraphs, while explicating hierarchical structures.

  • 2 authors
·
Dec 11, 2023

The Underappreciated Power of Vision Models for Graph Structural Understanding

Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models' ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.

  • 9 authors
·
Oct 27, 2025 5

Efficient Maximum Fair Clique Search over Large Networks

Mining cohesive subgraphs in attributed graphs is an essential problem in the domain of graph data analysis. The integration of fairness considerations significantly fuels interest in models and algorithms for mining fairness-aware cohesive subgraphs. Notably, the relative fair clique emerges as a robust model, ensuring not only comprehensive attribute coverage but also greater flexibility in distributing attribute vertices. Motivated by the strength of this model, we for the first time pioneer an investigation into the identification of the maximum relative fair clique in large-scale graphs. We introduce a novel concept of colorful support, which serves as the foundation for two innovative graph reduction techniques. These techniques effectively narrow the graph's size by iteratively removing edges that do not belong to relative fair cliques. Furthermore, a series of upper bounds of the maximum relative fair clique size is proposed by incorporating consideration of vertex attributes and colors. The pruning techniques derived from these upper bounds can significantly trim unnecessary search space during the branch-and-bound procedure. Adding to this, we present a heuristic algorithm with a linear time complexity, employing both a degree-based greedy strategy and a colored degree-based greedy strategy to identify a larger relative fair clique. This heuristic algorithm can serve a dual purpose by aiding in branch pruning, thereby enhancing overall search efficiency. Extensive experiments conducted on six real-life datasets demonstrate the efficiency, scalability, and effectiveness of our algorithms.

  • 6 authors
·
Dec 7, 2023

Simplicial Closure and higher-order link prediction

Networks provide a powerful formalism for modeling complex systems by using a model of pairwise interactions. But much of the structure within these systems involves interactions that take place among more than two nodes at once; for example, communication within a group rather than person-to person, collaboration among a team rather than a pair of coauthors, or biological interaction between a set of molecules rather than just two. Such higher-order interactions are ubiquitous, but their empirical study has received limited attention, and little is known about possible organizational principles of such structures. Here we study the temporal evolution of 19 datasets with explicit accounting for higher-order interactions. We show that there is a rich variety of structure in our datasets but datasets from the same system types have consistent patterns of higher-order structure. Furthermore, we find that tie strength and edge density are competing positive indicators of higher-order organization, and these trends are consistent across interactions involving differing numbers of nodes. To systematically further the study of theories for such higher-order structures, we propose higher-order link prediction as a benchmark problem to assess models and algorithms that predict higher-order structure. We find a fundamental differences from traditional pairwise link prediction, with a greater role for local rather than long-range information in predicting the appearance of new interactions.

  • 5 authors
·
Feb 19, 2018

Retrieval-Augmented Generation with Graphs (GraphRAG)

Retrieval-augmented generation (RAG) is a powerful technique that enhances downstream task execution by retrieving additional information, such as knowledge, skills, and tools from external sources. Graph, by its intrinsic "nodes connected by edges" nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real-world applications. As a result, we have recently witnessed increasing attention on equipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG, where the retriever, generator, and external data sources can be uniformly designed in the neural-embedding space, the uniqueness of graph-structured data, such as diverse-formatted and domain-specific relational knowledge, poses unique and significant challenges when designing GraphRAG for different domains. Given the broad applicability, the associated design challenges, and the recent surge in GraphRAG, a systematic and up-to-date survey of its key concepts and techniques is urgently desired. Following this motivation, we present a comprehensive and up-to-date survey on GraphRAG. Our survey first proposes a holistic GraphRAG framework by defining its key components, including query processor, retriever, organizer, generator, and data source. Furthermore, recognizing that graphs in different domains exhibit distinct relational patterns and require dedicated designs, we review GraphRAG techniques uniquely tailored to each domain. Finally, we discuss research challenges and brainstorm directions to inspire cross-disciplinary opportunities. Our survey repository is publicly maintained at https://github.com/Graph-RAG/GraphRAG/.

  • 18 authors
·
Dec 31, 2024

Invariant Graph Transformer

Rationale discovery is defined as finding a subset of the input data that maximally supports the prediction of downstream tasks. In graph machine learning context, graph rationale is defined to locate the critical subgraph in the given graph topology, which fundamentally determines the prediction results. In contrast to the rationale subgraph, the remaining subgraph is named the environment subgraph. Graph rationalization can enhance the model performance as the mapping between the graph rationale and prediction label is viewed as invariant, by assumption. To ensure the discriminative power of the extracted rationale subgraphs, a key technique named "intervention" is applied. The core idea of intervention is that given any changing environment subgraphs, the semantics from the rationale subgraph is invariant, which guarantees the correct prediction result. However, most, if not all, of the existing rationalization works on graph data develop their intervention strategies on the graph level, which is coarse-grained. In this paper, we propose well-tailored intervention strategies on graph data. Our idea is driven by the development of Transformer models, whose self-attention module provides rich interactions between input nodes. Based on the self-attention module, our proposed invariant graph Transformer (IGT) can achieve fine-grained, more specifically, node-level and virtual node-level intervention. Our comprehensive experiments involve 7 real-world datasets, and the proposed IGT shows significant performance advantages compared to 13 baseline methods.

  • 7 authors
·
Dec 12, 2023

GAugLLM: Improving Graph Contrastive Learning for Text-Attributed Graphs with Large Language Models

This work studies self-supervised graph learning for text-attributed graphs (TAGs) where nodes are represented by textual attributes. Unlike traditional graph contrastive methods that perturb the numerical feature space and alter the graph's topological structure, we aim to improve view generation through language supervision. This is driven by the prevalence of textual attributes in real applications, which complement graph structures with rich semantic information. However, this presents challenges because of two major reasons. First, text attributes often vary in length and quality, making it difficulty to perturb raw text descriptions without altering their original semantic meanings. Second, although text attributes complement graph structures, they are not inherently well-aligned. To bridge the gap, we introduce GAugLLM, a novel framework for augmenting TAGs. It leverages advanced large language models like Mistral to enhance self-supervised graph learning. Specifically, we introduce a mixture-of-prompt-expert technique to generate augmented node features. This approach adaptively maps multiple prompt experts, each of which modifies raw text attributes using prompt engineering, into numerical feature space. Additionally, we devise a collaborative edge modifier to leverage structural and textual commonalities, enhancing edge augmentation by examining or building connections between nodes. Empirical results across five benchmark datasets spanning various domains underscore our framework's ability to enhance the performance of leading contrastive methods as a plug-in tool. Notably, we observe that the augmented features and graph structure can also enhance the performance of standard generative methods, as well as popular graph neural networks. The open-sourced implementation of our GAugLLM is available at Github.

  • 4 authors
·
Jun 17, 2024

Disentangled Structural and Featural Representation for Task-Agnostic Graph Valuation

With the emergence of data marketplaces, the demand for methods to assess the value of data has increased significantly. While numerous techniques have been proposed for this purpose, none have specifically addressed graphs as the main data modality. Graphs are widely used across various fields, ranging from chemical molecules to social networks. In this study, we break down graphs into two main components: structural and featural, and we focus on evaluating data without relying on specific task-related metrics, making it applicable in practical scenarios where validation requirements may be lacking. We introduce a novel framework called blind message passing, which aligns the seller's and buyer's graphs using a shared node permutation based on graph matching. This allows us to utilize the graph Wasserstein distance to quantify the differences in the structural distribution of graph datasets, called the structural disparities. We then consider featural aspects of buyers' and sellers' graphs for data valuation and capture their statistical similarities and differences, referred to as relevance and diversity, respectively. Our approach ensures that buyers and sellers remain unaware of each other's datasets. Our experiments on real datasets demonstrate the effectiveness of our approach in capturing the relevance, diversity, and structural disparities of seller data for buyers, particularly in graph-based data valuation scenarios.

  • 2 authors
·
Aug 22, 2024

Graphlets correct for the topological information missed by random walks

Random walks are widely used for mining networks due to the computational efficiency of computing them. For instance, graph representation learning learns a d-dimensional embedding space, so that the nodes that tend to co-occur on random walks (a proxy of being in the same network neighborhood) are close in the embedding space. Specific local network topology (i.e., structure) influences the co-occurrence of nodes on random walks, so random walks of limited length capture only partial topological information, hence diminishing the performance of downstream methods. We explicitly capture all topological neighborhood information and improve performance by introducing orbit adjacencies that quantify the adjacencies of two nodes as co-occurring on a given pair of graphlet orbits, which are symmetric positions on graphlets (small, connected, non-isomorphic, induced subgraphs of a large network). Importantly, we mathematically prove that random walks on up to k nodes capture only a subset of all the possible orbit adjacencies for up to k-node graphlets. Furthermore, we enable orbit adjacency-based analysis of networks by developing an efficient GRaphlet-orbit ADjacency COunter (GRADCO), which exhaustively computes all 28 orbit adjacency matrices for up to four-node graphlets. Note that four-node graphlets suffice, because real networks are usually small-world. In large networks on around 20,000 nodes, GRADCOcomputesthe28matricesinminutes. Onsixrealnetworksfromvarious domains, we compare the performance of node-label predictors obtained by using the network embeddings based on our orbit adjacencies to those based on random walks. We find that orbit adjacencies, which include those unseen by random walks, outperform random walk-based adjacencies, demonstrating the importance of the inclusion of the topological neighborhood information that is unseen by random walks.

  • 3 authors
·
May 23, 2024

IRWE: Inductive Random Walk for Joint Inference of Identity and Position Network Embedding

Network embedding, which maps graphs to distributed representations, is a unified framework for various graph inference tasks. According to the topology properties (e.g., structural roles and community memberships of nodes) to be preserved, it can be categorized into the identity and position embedding. However, existing methods can only capture one type of property. Some approaches can support the inductive inference that generalizes the embedding model to new nodes or graphs but relies on the availability of attributes. Due to the complicated correlations between topology and attributes, it is unclear for some inductive methods which type of property they can capture. In this study, we explore a unified framework for the joint inductive inference of identity and position embeddings without attributes. An inductive random walk embedding (IRWE) method is proposed, which combines multiple attention units to handle the random walk on graph topology and simultaneously derives identity and position embeddings that are jointly optimized. In particular, we demonstrate that some random walk statistics can be informative features to characterize node identities and positions while supporting the inductive embedding inference. Experiments validate the superior performance of IRWE beyond various baselines for the transductive and inductive inference of identity and position embeddings.

  • 2 authors
·
Dec 31, 2023

One for All: Towards Training One Graph Model for All Classification Tasks

Designing a single model to address multiple tasks has been a long-standing objective in artificial intelligence. Recently, large language models have demonstrated exceptional capability in solving different tasks within the language domain. However, a unified model for various graph tasks remains underexplored, primarily due to the challenges unique to the graph learning domain. First, graph data from different areas carry distinct attributes and follow different distributions. Such discrepancy makes it hard to represent graphs in a single representation space. Second, tasks on graphs diversify into node, link, and graph tasks, requiring distinct embedding strategies. Finally, an appropriate graph prompting paradigm for in-context learning is unclear. We propose One for All (OFA), the first general framework that can use a single graph model to address the above challenges. Specifically, OFA proposes text-attributed graphs to unify different graph data by describing nodes and edges with natural language and uses language models to encode the diverse and possibly cross-domain text attributes to feature vectors in the same embedding space. Furthermore, OFA introduces the concept of nodes-of-interest to standardize different tasks with a single task representation. For in-context learning on graphs, OFA introduces a novel graph prompting paradigm that appends prompting substructures to the input graph, which enables it to address varied tasks without fine-tuning. We train the OFA model using graph data from multiple domains (including citation networks, molecular graphs, knowledge graphs, etc.) simultaneously and evaluate its ability in supervised, few-shot, and zero-shot learning scenarios. OFA performs well across different tasks, making it the first general-purpose across-domains classification model on graphs.

  • 7 authors
·
Sep 29, 2023

SiMilarity-Enhanced Homophily for Multi-View Heterophilous Graph Clustering

With the increasing prevalence of graph-structured data, multi-view graph clustering has been widely used in various downstream applications. Existing approaches primarily rely on a unified message passing mechanism, which significantly enhances clustering performance. Nevertheless, this mechanism limits its applicability to heterophilous situations, as it is fundamentally predicated on the assumption of homophily, i.e., the connected nodes often belong to the same class. In reality, this assumption does not always hold; a moderately or even mildly homophilous graph is more common than a fully homophilous one due to inevitable heterophilous information in the graph. To address this issue, in this paper, we propose a novel SiMilarity-enhanced Homophily for Multi-view Heterophilous Graph Clustering (SMHGC) approach. By analyzing the relationship between similarity and graph homophily, we propose to enhance the homophily by introducing three similarity terms, i.e., neighbor pattern similarity, node feature similarity, and multi-view global similarity, in a label-free manner. Then, a consensus-based inter- and intra-view fusion paradigm is proposed to fuse the improved homophilous graph from different views and utilize them for clustering. The state-of-the-art experimental results on both multi-view heterophilous and homophilous datasets collectively demonstrate the strong capacity of similarity for unsupervised multi-view heterophilous graph learning. Additionally, the consistent performance across semi-synthetic datasets with varying levels of homophily serves as further evidence of SMHGC's resilience to heterophily.

  • 7 authors
·
Oct 4, 2024

Towards Data-centric Machine Learning on Directed Graphs: a Survey

In recent years, Graph Neural Networks (GNNs) have made significant advances in processing structured data. However, most of them primarily adopted a model-centric approach, which simplifies graphs by converting them into undirected formats and emphasizes model designs. This approach is inherently limited in real-world applications due to the unavoidable information loss in simple undirected graphs and the model optimization challenges that arise when exceeding the upper bounds of this sub-optimal data representational capacity. As a result, there has been a shift toward data-centric methods that prioritize improving graph quality and representation. Specifically, various types of graphs can be derived from naturally structured data, including heterogeneous graphs, hypergraphs, and directed graphs. Among these, directed graphs offer distinct advantages in topological systems by modeling causal relationships, and directed GNNs have been extensively studied in recent years. However, a comprehensive survey of this emerging topic is still lacking. Therefore, we aim to provide a comprehensive review of directed graph learning, with a particular focus on a data-centric perspective. Specifically, we first introduce a novel taxonomy for existing studies. Subsequently, we re-examine these methods from the data-centric perspective, with an emphasis on understanding and improving data representation. It demonstrates that a deep understanding of directed graphs and their quality plays a crucial role in model performance. Additionally, we explore the diverse applications of directed GNNs across 10+ domains, highlighting their broad applicability. Finally, we identify key opportunities and challenges within the field, offering insights that can guide future research and development in directed graph learning.

  • 6 authors
·
Nov 28, 2024

AceMap: Knowledge Discovery through Academic Graph

The exponential growth of scientific literature requires effective management and extraction of valuable insights. While existing scientific search engines excel at delivering search results based on relational databases, they often neglect the analysis of collaborations between scientific entities and the evolution of ideas, as well as the in-depth analysis of content within scientific publications. The representation of heterogeneous graphs and the effective measurement, analysis, and mining of such graphs pose significant challenges. To address these challenges, we present AceMap, an academic system designed for knowledge discovery through academic graph. We present advanced database construction techniques to build the comprehensive AceMap database with large-scale academic entities that contain rich visual, textual, and numerical information. AceMap also employs innovative visualization, quantification, and analysis methods to explore associations and logical relationships among academic entities. AceMap introduces large-scale academic network visualization techniques centered on nebular graphs, providing a comprehensive view of academic networks from multiple perspectives. In addition, AceMap proposes a unified metric based on structural entropy to quantitatively measure the knowledge content of different academic entities. Moreover, AceMap provides advanced analysis capabilities, including tracing the evolution of academic ideas through citation relationships and concept co-occurrence, and generating concise summaries informed by this evolutionary process. In addition, AceMap uses machine reading methods to generate potential new ideas at the intersection of different fields. Exploring the integration of large language models and knowledge graphs is a promising direction for future research in idea evolution. Please visit https://www.acemap.info for further exploration.

  • 26 authors
·
Mar 4, 2024

UniGraph: Learning a Unified Cross-Domain Foundation Model for Text-Attributed Graphs

Foundation models like ChatGPT and GPT-4 have revolutionized artificial intelligence, exhibiting remarkable abilities to generalize across a wide array of tasks and applications beyond their initial training objectives. However, graph learning has predominantly focused on single-graph models, tailored to specific tasks or datasets, lacking the ability to transfer learned knowledge to different domains. This limitation stems from the inherent complexity and diversity of graph structures, along with the different feature and label spaces specific to graph data. In this paper, we recognize text as an effective unifying medium and employ Text-Attributed Graphs (TAGs) to leverage this potential. We present our UniGraph framework, designed to learn a foundation model for TAGs, which is capable of generalizing to unseen graphs and tasks across diverse domains. Unlike single-graph models that use pre-computed node features of varying dimensions as input, our approach leverages textual features for unifying node representations, even for graphs such as molecular graphs that do not naturally have textual features. We propose a novel cascaded architecture of Language Models (LMs) and Graph Neural Networks (GNNs) as backbone networks. Additionally, we propose the first pre-training algorithm specifically designed for large-scale self-supervised learning on TAGs, based on Masked Graph Modeling. We introduce graph instruction tuning using Large Language Models (LLMs) to enable zero-shot prediction ability. Our comprehensive experiments across various graph learning tasks and domains demonstrate the model's effectiveness in self-supervised representation learning on unseen graphs, few-shot in-context transfer, and zero-shot transfer, even surpassing or matching the performance of GNNs that have undergone supervised training on target datasets.

  • 4 authors
·
Feb 21, 2024

Large Language Models on Graphs: A Comprehensive Survey

Large language models (LLMs), such as ChatGPT and LLaMA, are creating significant advancements in natural language processing, due to their strong text encoding/decoding ability and newly found emergent capability (e.g., reasoning). While LLMs are mainly designed to process pure texts, there are many real-world scenarios where text data are associated with rich structure information in the form of graphs (e.g., academic networks, and e-commerce networks) or scenarios where graph data are paired with rich textual information (e.g., molecules with descriptions). Besides, although LLMs have shown their pure text-based reasoning ability, it is underexplored whether such ability can be generalized to graph scenarios (i.e., graph-based reasoning). In this paper, we provide a systematic review of scenarios and techniques related to large language models on graphs. We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-rich graphs, and text-paired graphs. We then discuss detailed techniques for utilizing LLMs on graphs, including LLM as Predictor, LLM as Encoder, and LLM as Aligner, and compare the advantages and disadvantages of different schools of models. Furthermore, we mention the real-world applications of such methods and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future research directions in this fast-growing field. The related source can be found at https://github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs.

  • 6 authors
·
Dec 5, 2023

GraphPrompter: Multi-stage Adaptive Prompt Optimization for Graph In-Context Learning

Graph In-Context Learning, with the ability to adapt pre-trained graph models to novel and diverse downstream graphs without updating any parameters, has gained much attention in the community. The key to graph in-context learning is to perform downstream graphs conditioned on chosen prompt examples. Existing methods randomly select subgraphs or edges as prompts, leading to noisy graph prompts and inferior model performance. Additionally, due to the gap between pre-training and testing graphs, when the number of classes in the testing graphs is much greater than that in the training, the in-context learning ability will also significantly deteriorate. To tackle the aforementioned challenges, we develop a multi-stage adaptive prompt optimization method GraphPrompter, which optimizes the entire process of generating, selecting, and using graph prompts for better in-context learning capabilities. Firstly, Prompt Generator introduces a reconstruction layer to highlight the most informative edges and reduce irrelevant noise for graph prompt construction. Furthermore, in the selection stage, Prompt Selector employs the k-nearest neighbors algorithm and pre-trained selection layers to dynamically choose appropriate samples and minimize the influence of irrelevant prompts. Finally, we leverage a Prompt Augmenter with a cache replacement strategy to enhance the generalization capability of the pre-trained model on new datasets. Extensive experiments show that GraphPrompter effectively enhances the in-context learning ability of graph models. On average across all the settings, our approach surpasses the state-of-the-art baselines by over 8%. Our code is released at https://github.com/karin0018/GraphPrompter.

  • 9 authors
·
May 4, 2025

Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning

Leveraging generative Artificial Intelligence (AI), we have transformed a dataset comprising 1,000 scientific papers into an ontological knowledge graph. Through an in-depth structural analysis, we have calculated node degrees, identified communities and connectivities, and evaluated clustering coefficients and betweenness centrality of pivotal nodes, uncovering fascinating knowledge architectures. The graph has an inherently scale-free nature, is highly connected, and can be used for graph reasoning by taking advantage of transitive and isomorphic properties that reveal unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, propose never-before-seen material designs, and predict material behaviors. We compute deep node embeddings for combinatorial node similarity ranking for use in a path sampling strategy links dissimilar concepts that have previously not been related. One comparison revealed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. In another example, the algorithm proposed a hierarchical mycelium-based composite based on integrating path sampling with principles extracted from Kandinsky's 'Composition VII' painting. The resulting material integrates an innovative set of concepts that include a balance of chaos/order, adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across science, technology and art, revealing a nuanced ontology of immanence that reveal a context-dependent heterarchical interplay of constituents. Graph-based generative AI achieves a far higher degree of novelty, explorative capacity, and technical detail, than conventional approaches and establishes a widely useful framework for innovation by revealing hidden connections.

  • 1 authors
·
Mar 18, 2024

Scalable and Interpretable Identification of Minimal Undesignable RNA Structure Motifs with Rotational Invariance

RNA design aims to find a sequence that folds with highest probability into a designated target structure. However, certain structures are undesignable, meaning no sequence can fold into the target structure under the default (Turner) RNA folding model. Understanding the specific local structures (i.e., "motifs") that contribute to undesignability is crucial for refining RNA folding models and determining the limits of RNA designability. Despite its importance, this problem has received very little attention, and previous efforts are neither scalable nor interpretable. We develop a new theoretical framework for motif (un-)designability, and design scalable and interpretable algorithms to identify minimal undesignable motifs within a given RNA secondary structure. Our approach establishes motif undesignability by searching for rival motifs, rather than exhaustively enumerating all (partial) sequences that could potentially fold into the motif. Furthermore, we exploit rotational invariance in RNA structures to detect, group, and reuse equivalent motifs and to construct a database of unique minimal undesignable motifs. To achieve that, we propose a loop-pair graph representation for motifs and a recursive graph isomorphism algorithm for motif equivalence. Our algorithms successfully identify 24 unique minimal undesignable motifs among 18 undesignable puzzles from the Eterna100 benchmark. Surprisingly, we also find over 350 unique minimal undesignable motifs and 663 undesignable native structures in the ArchiveII dataset, drawn from a diverse set of RNA families. Our source code is available at https://github.com/shanry/RNA-Undesign and our web server is available at http://linearfold.org/motifs.

  • 5 authors
·
Feb 26, 2024

H4G: Unlocking Faithful Inference for Zero-Shot Graph Learning in Hyperbolic Space

Text-attributed graphs are widely used across domains, offering rich opportunities for zero-shot learning via graph-text alignment. However, existing methods struggle with tasks requiring fine-grained pattern recognition, particularly on heterophilic graphs. Through empirical and theoretical analysis, we identify an over-abstraction problem: current approaches operate at excessively large hyperbolic radii, compressing multi-scale structural information into uniform high-level abstractions. This abstraction-induced information loss obscures critical local patterns essential for accurate predictions. By analyzing embeddings in hyperbolic space, we demonstrate that optimal graph learning requires faithful preservation of fine-grained structural details, better retained by representations positioned closer to the origin. To address this, we propose H4G, a framework that systematically reduces embedding radii using learnable block-diagonal scaling matrices and M\"obius matrix multiplication. This approach restores access to fine-grained patterns while maintaining global receptive ability with minimal computational overhead. Experiments show H4G achieves state-of-the-art zero-shot performance with 12.8\% improvement on heterophilic graphs and 8.4\% on homophilic graphs, confirming that radius reduction enables faithful multi-scale representation for advancing zero-shot graph learning.

  • 9 authors
·
Oct 13, 2025

From Cities to Series: Complex Networks and Deep Learning for Improved Spatial and Temporal Analytics*

Graphs have often been used to answer questions about the interaction between real-world entities by taking advantage of their capacity to represent complex topologies. Complex networks are known to be graphs that capture such non-trivial topologies; they are able to represent human phenomena such as epidemic processes, the dynamics of populations, and the urbanization of cities. The investigation of complex networks has been extrapolated to many fields of science, with particular emphasis on computing techniques, including artificial intelligence. In such a case, the analysis of the interaction between entities of interest is transposed to the internal learning of algorithms, a paradigm whose investigation is able to expand the state of the art in Computer Science. By exploring this paradigm, this thesis puts together complex networks and machine learning techniques to improve the understanding of the human phenomena observed in pandemics, pendular migration, and street networks. Accordingly, we contribute with: (i) a new neural network architecture capable of modeling dynamic processes observed in spatial and temporal data with applications in epidemics propagation, weather forecasting, and patient monitoring in intensive care units; (ii) a machine-learning methodology for analyzing and predicting links in the scope of human mobility between all the cities of Brazil; and, (iii) techniques for identifying inconsistencies in the urban planning of cities while tracking the most influential vertices, with applications over Brazilian and worldwide cities. We obtained results sustained by sound evidence of advances to the state of the art in artificial intelligence, rigorous formalisms, and ample experimentation. Our findings rely upon real-world applications in a range of domains, demonstrating the applicability of our methodologies.

  • 2 authors
·
Jun 1, 2022

Are Code Pre-trained Models Powerful to Learn Code Syntax and Semantics?

Analysis of pre-trained code models also has revealed that they can effectively learn program syntax. However, these works are limited in analyzing code syntax and their distance-based approaches are not accurate due to the curse of high dimensionality. Furthermore, the study of the learnt program semantics of these models is rarely discussed. To further understand the code features learnt by these models, in this paper, we target two well-known representative code pre-trained models (i.e., CodeBERT and GraphCodeBERT) and devise a set of probing tasks for the syntax and semantics analysis. Specifically, on one hand, we design two probing tasks (i.e., syntax pair node prediction and token tagging prediction) to manipulate AST for the understanding of learnt program syntax. On the other hand, we design two tasks (i.e., semantic relationship prediction and semantic propagation prediction(inGraph) ) on the constructed control flow graph (CFG), data dependency graph (DDG) and control dependency graph (CDG) for the learnt program semantic analysis. In addition, to understand which kind of program semantics these pre-trained models can comprehend well, we conduct the statistical analysis for attention weights learnt by different heads and layers. Through extensive analysis in terms of program syntax and semantics, we have the following findings: 1) Both CodeBERT and GraphCodeBERT can learn the program syntax well. 2) Both CodeBERT and GraphCodeBERT can learn program semantics to different extents. GraphCodeBERT is superior to CodeBERT in learning program control flow and data dependency information but has a similar capability to CodeBERT in learning control dependency information. 3) Both CodeBERT and GraphCodeBERT can capture program semantics in the final layer of representation, but different attention heads and layers exhibit different roles in learning program semantics.

  • 8 authors
·
Dec 20, 2022

FOS: A Large-Scale Temporal Graph Benchmark for Scientific Interdisciplinary Link Prediction

Interdisciplinary scientific breakthroughs mostly emerge unexpectedly, and forecasting the formation of novel research fields remains a major challenge. We introduce FOS (Future Of Science), a comprehensive time-aware graph-based benchmark that reconstructs annual co-occurrence graphs of 65,027 research sub-fields (spanning 19 general domains) over the period 1827-2024. In these graphs, edges denote the co-occurrence of two fields in a single publication and are timestamped with the corresponding publication year. Nodes are enriched with semantic embeddings, and edges are characterized by temporal and topological descriptors. We formulate the prediction of new field-pair linkages as a temporal link-prediction task, emphasizing the "first-time" connections that signify pioneering interdisciplinary directions. Through extensive experiments, we evaluate a suite of state-of-the-art temporal graph architectures under multiple negative-sampling regimes and show that (i) embedding long-form textual descriptions of fields significantly boosts prediction accuracy, and (ii) distinct model classes excel under different evaluation settings. Case analyses show that top-ranked link predictions on FOS align with field pairings that emerge in subsequent years of academic publications. We publicly release FOS, along with its temporal data splits and evaluation code, to establish a reproducible benchmark for advancing research in predicting scientific frontiers.

  • 7 authors
·
Nov 23, 2025

TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs

Text-Attributed Graphs (TAGs) augment graph structures with natural language descriptions, facilitating detailed depictions of data and their interconnections across various real-world settings. However, existing TAG datasets predominantly feature textual information only at the nodes, with edges typically represented by mere binary or categorical attributes. This lack of rich textual edge annotations significantly limits the exploration of contextual relationships between entities, hindering deeper insights into graph-structured data. To address this gap, we introduce Textual-Edge Graphs Datasets and Benchmark (TEG-DB), a comprehensive and diverse collection of benchmark textual-edge datasets featuring rich textual descriptions on nodes and edges. The TEG-DB datasets are large-scale and encompass a wide range of domains, from citation networks to social networks. In addition, we conduct extensive benchmark experiments on TEG-DB to assess the extent to which current techniques, including pre-trained language models, graph neural networks, and their combinations, can utilize textual node and edge information. Our goal is to elicit advancements in textual-edge graph research, specifically in developing methodologies that exploit rich textual node and edge descriptions to enhance graph analysis and provide deeper insights into complex real-world networks. The entire TEG-DB project is publicly accessible as an open-source repository on Github, accessible at https://github.com/Zhuofeng-Li/TEG-Benchmark.

  • 9 authors
·
Jun 14, 2024

GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs

Graph foundation models represent a transformative paradigm for learning transferable representations across diverse graph domains. Recent methods leverage large language models to unify graph and text modalities into a shared representation space using contrastive learning. However, systematic evaluations reveal significant performance degradation at structural boundaries where distinct topological patterns converge, with accuracy losses exceeding 20 percentage points. This issue arises from a key limitation: current methods assume all graph structures can be encoded within a single Euclidean space. In reality, tree structures require hyperbolic geometry to preserve hierarchical branching, while cyclic patterns depend on spherical geometry for closure properties. At structural boundaries, nodes experience conflicting geometric constraints that uniform encoding spaces cannot resolve. This raises a crucial challenge: Can alignment frameworks be designed to respect the intrinsic geometric diversity of graph structures? We introduce GraphShaper, a geometry-aware framework that enhances graph encoding through multi-geometric specialization. Our approach employs expert networks tailored to different geometric spaces, dynamically computing fusion weights to adaptively integrate geometric properties based on local structural characteristics. This adaptive fusion preserves structural integrity before alignment with text embeddings. Extensive experiments demonstrate that GraphShaper achieves 9.47\% accuracy improvements on citation networks and 7.63\% on social networks in zero-shot settings.

  • 9 authors
·
Oct 13, 2025

Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning

Representation learning on text-attributed graphs (TAGs) has become a critical research problem in recent years. A typical example of a TAG is a paper citation graph, where the text of each paper serves as node attributes. Initial graph neural network (GNN) pipelines handled these text attributes by transforming them into shallow or hand-crafted features, such as skip-gram or bag-of-words features. Recent efforts have focused on enhancing these pipelines with language models (LMs), which typically demand intricate designs and substantial computational resources. With the advent of powerful large language models (LLMs) such as GPT or Llama2, which demonstrate an ability to reason and to utilize general knowledge, there is a growing need for techniques which combine the textual modelling abilities of LLMs with the structural learning capabilities of GNNs. Hence, in this work, we focus on leveraging LLMs to capture textual information as features, which can be used to boost GNN performance on downstream tasks. A key innovation is our use of explanations as features: we prompt an LLM to perform zero-shot classification, request textual explanations for its decision-making process, and design an LLM-to-LM interpreter to translate these explanations into informative features for downstream GNNs. Our experiments demonstrate that our method achieves state-of-the-art results on well-established TAG datasets, including Cora, PubMed, ogbn-arxiv, as well as our newly introduced dataset, tape-arxiv23. Furthermore, our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv. Lastly, we believe the versatility of the proposed method extends beyond TAGs and holds the potential to enhance other tasks involving graph-text data. Our codes and datasets are available at: https://github.com/XiaoxinHe/TAPE.

  • 6 authors
·
May 30, 2023

Contextualized Messages Boost Graph Representations

Graph neural networks (GNNs) have gained significant attention in recent years for their ability to process data that may be represented as graphs. This has prompted several studies to explore their representational capability based on the graph isomorphism task. Notably, these works inherently assume a countable node feature representation, potentially limiting their applicability. Interestingly, only a few study GNNs with uncountable node feature representation. In the paper, a new perspective on the representational capability of GNNs is investigated across all levelsx2014node-level, neighborhood-level, and graph-levelx2014when the space of node feature representation is uncountable. Specifically, the injective and metric requirements of previous works are softly relaxed by employing a pseudometric distance on the space of input to create a soft-injective function such that distinct inputs may produce similar outputs if and only if the pseudometric deems the inputs to be sufficiently similar on some representation. As a consequence, a simple and computationally efficient soft-isomorphic relational graph convolution network (SIR-GCN) that emphasizes the contextualized transformation of neighborhood feature representations via anisotropic and dynamic message functions is proposed. Furthermore, a mathematical discussion on the relationship between SIR-GCN and key GNNs in literature is laid out to put the contribution into context, establishing SIR-GCN as a generalization of classical GNN methodologies. To close, experiments on synthetic and benchmark datasets demonstrate the relative superiority of SIR-GCN, outperforming comparable models in node and graph property prediction tasks.

  • 4 authors
·
Mar 19, 2024

RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases

Relational databases (RDBs) are composed of interconnected tables, where relationships between them are defined through foreign keys. Recent research on applying machine learning to RDBs has explored graph-based representations of RDBs, where rows of tables are modeled as nodes, and foreign key relationships are modeled as edges. RDB-to-graph modeling helps capture cross-table dependencies, ultimately leading to enhanced performance across diverse tasks. However, there are numerous ways to model RDBs as graphs, and performance varies significantly depending on the chosen graph model. In our analysis, applying a common heuristic rule for graph modeling leads to up to a 10% drop in performance compared to the best-performing graph model, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 9 automatic RDB-to-graph modeling methods on the 12 tasks over 600x faster than on-the-fly evaluation, which requires repeated model training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling.

kaistdata KAIST Data Mining Lab
·
Jun 2, 2025

View-based Explanations for Graph Neural Networks

Generating explanations for graph neural networks (GNNs) has been studied to understand their behavior in analytical tasks such as graph classification. Existing approaches aim to understand the overall results of GNNs rather than providing explanations for specific class labels of interest, and may return explanation structures that are hard to access, nor directly queryable.We propose GVEX, a novel paradigm that generates Graph Views for EXplanation. (1) We design a two-tier explanation structure called explanation views. An explanation view consists of a set of graph patterns and a set of induced explanation subgraphs. Given a database G of multiple graphs and a specific class label l assigned by a GNN-based classifier M, it concisely describes the fraction of G that best explains why l is assigned by M. (2) We propose quality measures and formulate an optimization problem to compute optimal explanation views for GNN explanation. We show that the problem is Σ^2_P-hard. (3) We present two algorithms. The first one follows an explain-and-summarize strategy that first generates high-quality explanation subgraphs which best explain GNNs in terms of feature influence maximization, and then performs a summarization step to generate patterns. We show that this strategy provides an approximation ratio of 1/2. Our second algorithm performs a single-pass to an input node stream in batches to incrementally maintain explanation views, having an anytime quality guarantee of 1/4 approximation. Using real-world benchmark data, we experimentally demonstrate the effectiveness, efficiency, and scalability of GVEX. Through case studies, we showcase the practical applications of GVEX.

  • 6 authors
·
Jan 4, 2024

Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks

We present an agentic, autonomous graph expansion framework that iteratively structures and refines knowledge in situ. Unlike conventional knowledge graph construction methods relying on static extraction or single-pass learning, our approach couples a reasoning-native large language model with a continually updated graph representation. At each step, the system actively generates new concepts and relationships, merges them into a global graph, and formulates subsequent prompts based on its evolving structure. Through this feedback-driven loop, the model organizes information into a scale-free network characterized by hub formation, stable modularity, and bridging nodes that link disparate knowledge clusters. Over hundreds of iterations, new nodes and edges continue to appear without saturating, while centrality measures and shortest path distributions evolve to yield increasingly distributed connectivity. Our analysis reveals emergent patterns, such as the rise of highly connected 'hub' concepts and the shifting influence of 'bridge' nodes, indicating that agentic, self-reinforcing graph construction can yield open-ended, coherent knowledge structures. Applied to materials design problems, we present compositional reasoning experiments by extracting node-specific and synergy-level principles to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that transcend rote summarization and strengthen the framework's potential for open-ended scientific discovery. We discuss other applications in scientific discovery and outline future directions for enhancing scalability and interpretability.

  • 1 authors
·
Feb 18, 2025

Learning to Represent Programs with Heterogeneous Graphs

Program source code contains complex structure information, which can be represented in structured data forms like trees or graphs. To acquire the structural information in source code, most existing researches use abstract syntax trees (AST). A group of works add additional edges to ASTs to convert source code into graphs and use graph neural networks to learn representations for program graphs. Although these works provide additional control or data flow information to ASTs for downstream tasks, they neglect an important aspect of structure information in AST itself: the different types of nodes and edges. In ASTs, different nodes contain different kinds of information like variables or control flow, and the relation between a node and all its children can also be different. To address the information of node and edge types, we bring the idea of heterogeneous graphs to learning on source code and present a new formula of building heterogeneous program graphs from ASTs with additional type information for nodes and edges. We use the ASDL grammar of programming language to define the node and edge types of program graphs. Then we use heterogeneous graph neural networks to learn on these graphs. We evaluate our approach on two tasks: code comment generation and method naming. Both tasks require reasoning on the semantics of complete code snippets. Experiment results show that our approach outperforms baseline models, including homogeneous graph-based models, showing that leveraging the type information of nodes and edges in program graphs can help in learning program semantics.

  • 5 authors
·
Dec 7, 2020

Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models

The need to analyze graphs is ubiquitous across various fields, from social networks to biological research and recommendation systems. Therefore, enabling the ability of large language models (LLMs) to process graphs is an important step toward more advanced general intelligence. However, current LLM benchmarks on graph analysis require models to directly reason over the prompts describing graph topology, and are thus limited to small graphs with only a few dozens of nodes. In contrast, human experts typically write programs based on popular libraries for task solving, and can thus handle graphs with different scales. To this end, a question naturally arises: can LLMs analyze graphs like professionals? In this paper, we introduce ProGraph, a manually crafted benchmark containing 3 categories of graph tasks. The benchmark expects solutions based on programming instead of directly reasoning over raw inputs. Our findings reveal that the performance of current LLMs is unsatisfactory, with the best model achieving only 36% accuracy. To bridge this gap, we propose LLM4Graph datasets, which include crawled documents and auto-generated codes based on 6 widely used graph libraries. By augmenting closed-source LLMs with document retrieval and fine-tuning open-source ones on the codes, we show 11-32% absolute improvements in their accuracies. Our results underscore that the capabilities of LLMs in handling structured data are still under-explored, and show the effectiveness of LLM4Graph in enhancing LLMs' proficiency of graph analysis. The benchmark, datasets and enhanced open-source models are available at https://github.com/BUPT-GAMMA/ProGraph.

  • 12 authors
·
Sep 29, 2024