I am an ELLIS PhD student in the Computer Vision Lab at the University of Amsterdam, advised by Prof. Dimitris Tzionas and Prof. Theo Gevers. My research focus on 3D Human Object Interaction (HOI) synthesis, while I am also interested in reconstructing 4D HOIs from videos. Before joining UvA I had the great opportunity to spend 4 months as a research intern at Simon Fraser University working together with Prof. Manolis Savva. Prior to that I completed my Master at the University of Patras collaborating with Prof. Emmanouil Psarakis, while also working as a Lead Quality Assurance Enginner at Hellenic Air Force. I am also a passionate windsurfer. However, when the sea and wind are not there I enjoy spending my time running or going to the gym.
Publications
Reconstructing 3D Human-Object Interaction from an RGB image is essential for
perceptive systems. Yet, this remains challenging as it requires capturing the
subtle physical coupling between the body and objects. While current methods rely on
sparse, binary contact cues, these fail to model the continuous proximity and dense
spatial relationships that characterize natural interactions. We address this
limitation via InterFields, a representation that encodes dense,
continuous proximity across the entire body and object surfaces. However, inferring
these fields from single images is inherently ill-posed. To tackle this, our
intuition is that interaction patterns are characteristically structured by the action
and object geometry. We capture this structure in LEXIS, a novel discrete
manifold of interaction signatures learned via a VQ-VAE. We then develop LEXIS-Flow, a
diffusion framework that leverages LEXIS signatures to estimate human and object
meshes alongside their InterFields. Notably, these InterFields help in a guided
refinement that ensures physically-plausible, proximity-aware
reconstructions without requiring post-hoc optimization. Evaluation on Open3DHOI and
BEHAVE shows that LEXIS-Flow significantly outperforms existing SotA baselines in
reconstruction, contact, and proximity quality. Our approach not only improves
generalization but also yields reconstructions perceived as more
realistic, moving us closer to holistic 3D scene understanding. Code & models will be
public at https://anticdimi.github.io/lexis/.
@article{antic2026lexis,
title = {{LEXIS}: {LatEnt} {ProXimal} Interaction Signatures for {3D} {HOI} from an Image},
author = {Anti\'{c}, Dimitrije and Budria, Alvaro and Paschalidis, Georgios and Dwivedi, Sai Kumar and Tzionas, Dimitrios},
journal = {arXiv preprint arXiv:2604.20800},
year = {2026},
}
Reconstructing people, objects, and their interactions in 3D is a long-standing and
fundamental goal for intelligent systems. Often the input is RGB video from a moving
camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each
other, and camera and object motion entangle to create apparent motion. Most
prior work addresses humans or objects in isolation, ignoring their interplay, or
assumes known 3D shapes or cameras, which is impractical for real-world applications.
We develop RHINO (Reconstructing Human Interactions with Novel Objects), a novel
three-step framework that recovers in 3D a human, novel (unseen) manipulated object,
and static scene in a common world frame from a monocular RGB video. First, we
leverage 3D-aware foundation models to obtain cues that stabilize
Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse
shape and apparent motion of a manipulated object from foreground pixels, and a coarse
scene shape and camera motion from background pixels. Second, we estimate a
human in the camera frame via an off-the-shelf method, and subtract the
camera motion from apparent motion to extract the object motion; this registers
the human, object, and coarse scene shapes into a common world frame. Third, we refine
shapes using a compositional neural field with per-component signed-distance fields.
The latter further enables differentiable contact priors that attract surfaces while
penalizing interpenetration, improving the physical plausibility of the final
reconstruction. For evaluation, we capture a new dataset of handheld monocular videos
synchronized with a volumetric 4D capture stage, providing ground-truth shape and
camera motion. RHINO outperforms state-of-the-art baselines on novel-view
synthesis and 4D reconstruction. Ablations show that each stage contributes
substantially.
@inproceedings{xue2026rhino,
title = {RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos},
author = {Xue, Lixin and Zheng, Chengwei and Paschalidis, Georgios and Guo, Chen and Kaufmann, Manuel and Zarate, Juan and Tzionas, Dimitrios},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
Recovering 3D object pose and shape from a single image is a challenging and
ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities,
the vast intra- and inter-class shape variance, and the lack of 3D ground truth
for natural images. Existing deep-network methods are trained on synthetic datasets
to predict 3D shapes, so they often struggle generalizing to real-world images.
Moreover, they lack an explicit feedback loop for refining noisy estimates, and
primarily focus on geometry without directly considering pixel alignment. To tackle
these limitations, we develop a novel render-and-compare optimization framework,
called SDFit. This has three key innovations: First, it uses a learned category-specific
and morphable signed-distance-function (mSDF) model, and fits this to an image by
iteratively refining both 3D pose and shape. The mSDF robustifies inference by
constraining the search on the manifold of valid shapes, while allowing for arbitrary
shape topologies. Second, SDFit retrieves an initial 3D shape that likely matches
the image, by exploiting foundational models for efficient look-up into 3D shape
databases. Third, SDFit initializes pose by establishing rich 2D-3D correspondences
between the image and the mSDF through foundational features. We evaluate SDFit on
three image datasets, i.e., Pix3D, Pascal3D+, and COMIC. SDFit performs on par with
SotA feed-forward networks for unoccluded images and common poses, but is uniquely
robust to occlusions and uncommon poses. Moreover, it requires no retraining for
unseen images. Thus, SDFit contributes new insights for generalizing in the wild.
@inproceedings{antic2025sdfit,
title = {{SDFit}: {3D} Object Pose and Shape by Fitting a Morphable {SDF} to a Single Image},
author = {Anti\'{c}, Dimitrije and Paschalidis, Georgios and Tripathi, Shashank and Gevers, Theo and Dwivedi, Sai Kumar and Tzionas, Dimitrios},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
}
Synthesizing 3D whole bodies that realistically grasp objects is useful for
animation, mixed reality, and robotics. This is challenging, because the hands
and body need to look natural w.r.t. each other, the grasped object, as well as
the local scene (i.e., a receptacle supporting the object). Moreover, training
data for this task is really scarce, while capturing new data is expensive.
Recent work goes beyond finite datasets via a divide-and-conquer approach; it
first generates a “guiding” right-hand grasp, and then searches for bodies that
match this. However, the guiding-hand synthesis lacks controllability and
receptacle awareness, so it likely has an implausible direction (i.e., a body
can’t match this without penetrating the receptacle) and needs corrections
through major post-processing. Moreover, the body search needs exhaustive
sampling and is expensive. These are strong limitations. We tackle these with a
novel method called CWGrasp. Our key idea is that performing geometry-based
reasoning “early on,” instead of “too late,” provides rich “control” signals
for inference. To this end, CWGrasp first samples a plausible
reaching-direction vector (used later for both the arm and hand) from a
probabilistic model built via ray-casting from the object and collision
checking. Then, it generates a reaching body with a desired arm direction, as
well as a “guiding” grasping hand with a desired palm direction that complies
with the arm’s one. Eventually, CWGrasp refines the body to match the “guiding”
hand, while plausibly contacting the scene. Notably, generating
already-compatible “parts” greatly simplifies the “whole”. Moreover, CWGrasp
uniquely tackles both right and left-hand grasps. We evaluate on the GRAB and
ReplicaGrasp datasets. CWGrasp outperforms baselines, at lower runtime and
budget, while all components help performance. Code and models are available for
for research.
@inproceedings{paschalidis2025cwgrasp,
title={{3D} {W}hole-Body Grasp Synthesis with Directional Controllability},
author={Paschalidis, Georgios and Wilschut, Romana and Anti\'{c}, Dimitrije and Taheri, Omid and Tzionas, Dimitrios},
booktitle = {{International Conference on 3D Vision (3DV)}},
year={2025}
}