多智能体与具身智能研究所 鹏城实验室网络智能研究部

开源数据

ARIO
ARIO A Scaling Up Embodied AI Dataset Containing Multiple Types of Agents and Perceptual Modalities

A Scaling Up Embodied AI Dataset Containing Multiple Types of Agents and Perceptual Modalities

TIP-Editor
TIP-Editor TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

TIP-Editor is a 3D scene editing framework that combines text, image prompts, and a 3D bounding box to enable accurate control over the appearance and placement of the edited content, utilizing a stepwise 2D personalization strategy and 3D Gaussian splatting for precise local editing.

To Err like Human
To Err like Human To Err like Human: Affective Bias-Inspired Measures for Visual Emotion Recognition Evaluation

This repository contains the implementation of our paper “To Err like Human: Affective Bias-Inspired Measures for Visual Emotion Recognition Evaluation”, which introduces a new evaluation metric for Visual Emotion Recognition (VER).

ExtDM
ExtDM ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction

We present ExtDM, a new diffusion model that extrapolates video content from current frames by accurately modeling distribution shifts towards future frames.

CatVTON
CatVTON CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).

HCP-Diffusion
HCP-Diffusion HCP-Diffusion

HCP-Diffusion is a toolbox for Stable Diffusion models based on Diffusers. It facilitates flexiable configurations and component support for training, in comparison with webui and sd-scripts.

OV-DINO
OV-DINO Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

This project contains the official PyTorch implementation, pre-trained models, fine-tuning code, and inference demo for OV-DINO.

REVERIE
REVERIE Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

A large-scale dataset with rationale annotations to enhance model reasoning by predicting justifications for correct or incorrect responses.

CONSOLE
CONSOLE Correctable Landmark Discovery via Large Models for Vision-Language Navigation

A new Vision-Language Navigation paradigm that uses ChatGPT and CLIP for open-world landmark discovery, correcting prior knowledge with a learnable co-occurrence scoring module to enhance navigation accuracy and outperform existing methods on benchmarks like R2R and R4R.

NavCoT
NavCoT NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

A parameter-efficient in-domain training strategy that enables large language models to guide navigational decision-making in Vision-and-Language Navigation tasks

Surfer
Surfer Surfer: Progressive Reasoning with World Models for Robotic Manipulation

A robot manipulation framework that improves natural language instruction understanding and physical action execution by explicitly modeling action and scene predictions in a multi-modal world model

DreamEditor
DreamEditor DreamEditor: Text-Driven 3D Scene Editing with Neural Fields

DreamEditor is a novel framework that enables controlled editing of neural fields using text prompts, allowing localized edits in specific regions of real-world scenes while maintaining consistency and generating realistic textures and geometry.

Fashion Matrix
Fashion Matrix Fashion Matrix: Editing Photos by Just Talking

Fashion Matrix is dedicated to bridging various visual and language models and continuously refining its capabilities as a comprehensive fashion AI assistant.