开源数据
A Scaling Up Embodied AI Dataset Containing Multiple Types of Agents and Perceptual Modalities
TIP-Editor is a 3D scene editing framework that combines text, image prompts, and a 3D bounding box to enable accurate control over the appearance and placement of the edited content, utilizing a stepwise 2D personalization strategy and 3D Gaussian splatting for precise local editing.
This repository contains the implementation of our paper “To Err like Human: Affective Bias-Inspired Measures for Visual Emotion Recognition Evaluation”, which introduces a new evaluation metric for Visual Emotion Recognition (VER).
We present ExtDM, a new diffusion model that extrapolates video content from current frames by accurately modeling distribution shifts towards future frames.
CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).
HCP-Diffusion is a toolbox for Stable Diffusion models based on Diffusers. It facilitates flexiable configurations and component support for training, in comparison with webui and sd-scripts.
This project contains the official PyTorch implementation, pre-trained models, fine-tuning code, and inference demo for OV-DINO.
A large-scale dataset with rationale annotations to enhance model reasoning by predicting justifications for correct or incorrect responses.
A new Vision-Language Navigation paradigm that uses ChatGPT and CLIP for open-world landmark discovery, correcting prior knowledge with a learnable co-occurrence scoring module to enhance navigation accuracy and outperform existing methods on benchmarks like R2R and R4R.
A parameter-efficient in-domain training strategy that enables large language models to guide navigational decision-making in Vision-and-Language Navigation tasks
A robot manipulation framework that improves natural language instruction understanding and physical action execution by explicitly modeling action and scene predictions in a multi-modal world model
DreamEditor is a novel framework that enables controlled editing of neural fields using text prompts, allowing localized edits in specific regions of real-world scenes while maintaining consistency and generating realistic textures and geometry.
Fashion Matrix is dedicated to bridging various visual and language models and continuously refining its capabilities as a comprehensive fashion AI assistant.