Sims,Mosi, Mosei


目录
  • sims:中文多模态情感识别数据集
    • label
    • Feature
    • 数据集结构
    • Statistics
  • MOSI:英文多模态情感识别数据集
    • label
    • Feature
    • 数据集结构
    • Statistics
  • MOSEI
    • label
    • Feature Extraction
    • 数据集结构
    • Statistics

sims:中文多模态情感识别数据集

label

**sentimental state **

emotion label
negative -1
neutral 0
positive 1

**regression task: average the five labeled results. **
{-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0}.

divide these values into 5 classifications

emotion label
negative {-1.0, -0.8}
weakly negative {-0.6, -0.4, -0.2}
neutral {0.0}
weakly positive {0.2, 0.4, 0.6}
positive {0.8, 1.0}

Feature

Text

BERT-base word embeddings (768-dimensional word vector)

Audio

LibROSA speech toolkit with default parameters to extract acoustic features at 22050Hz.
Totally, 33dimensional frame-level acoustic features are extracted, including 1-dimensional logarithmic fundamental frequency (log F0), 20-dimensional Melfrequency cepstral coefficients (MFCCs) and 12dimensional Constant-Q chromatogram (CQT).

Vision

Frames are extracted from the video segments at 30Hz.
MTCNN face detection algorithm to extract aligned faces.
MultiComp OpenFace2.0 toolkit to extract the set of 68 facial landmarks, 17 facial action units, head pose, head orientation, and eye gaze. Lastly, 709-dimensional frame-level visual features are extracted in total.

数据集结构

import pickle
import numpy as np

with open('data/SIMS/unaligned_39.pkl', 'rb') as f:
    data = pickle.load(f)

print(data.keys())
output:
dict_keys(['train', 'valid', 'test'])
print(data['train'].keys())
output:
dict_keys(['raw_text', 'text_bert', 'audio_lengths', 'vision_lengths', 'classification_labels', 'regression_labels', 'classification_labels_T', 'regression_labels_T', 'classification_labels_A', 'regression_labels_A', 'classification_labels_V', 'regression_labels_V', 'text', 'audio', 'vision', 'id'])
print(data['train']['raw_text'][0])
output:
闭嘴,不是来抓你的。
保存数据
for mode in ['train','valid','test']:
    # str --> float32
    if use_bert:
        self.text = data[self.mode]['text_bert'].astype(np.float32)
    else:
        self.text = data[self.mode]['text'].astype(np.float32)
        
    vision = data[mode]['vision'].astype(np.float32)
    audio = data[mode]['audio'].astype(np.float32)
    rawText = data[mode]['raw_text']
    ids = data[mode]['id']

Statistics

print(len(data['train']['id']))
print(len(data['valid']['id']))
print(len(data['test']['id']))

output:
1368
456
457

MOSI:英文多模态情感识别数据集

label

emotion label
strongly positive +3
positive +2
weakly positive +1
neutral 0
weakly negative -1
negative -2
strongly negative -3

Feature

Audio and visual features have been automatically extracted from MPEG files with framerates of 1000 for audio and 30 for video

Visual

16 Facial Action Units, 68 Facial Landmarks, Head Pose and Orientation, 6 Basic Emotions6 and Eye Gaze

Audio

COVAREP: pitch, energy, NAQ (Normalized Amplitude Quotient), MFCCs (Mel-frequency Cepstral Coefficients), Peak Slope, Energy Slope

数据集结构

import pickle
import numpy as np

with open('data/MOSI/aligned_50.pkl', 'rb') as f:
    data = pickle.load(f)

print(data.keys())
output:
dict_keys(['train', 'valid', 'test'])
print(data['train'].keys())
output:
dict_keys(['raw_text', 'audio', 'vision', 'id', 'text', 'text_bert', 'annotations', 'classification_labels', 'regression_labels'])
print(data['train']['raw_text'][0])
output:
A LOT OF SAD PARTS
保存数据
for mode in ['train','valid','test']:
    if use_bert:
        self.text = data[mode]['text_bert'].astype(np.float32)
    else:
        self.text = data[mode]['text'].astype(np.float32)
        
    vision = data[mode]['vision'].astype(np.float32)
    audio = data[mode]['audio'].astype(np.float32)
    rawText = data[mode]['raw_text']
    ids = data[mode]['id']

Statistics

print(len(data['train']['id']))
print(len(data['valid']['id']))
print(len(data['test']['id']))

output:
1284
229
686

MOSEI

label

emotion label
strongly positive +3
positive +2
weakly positive +1
neutral 0
weakly negative -1
negative -2
strongly negative -3

Feature Extraction

Text

All videos have manual transcription. Glove word embeddings

Visual:

Frames are extracted from the full videos at 30Hz.

The bounding box of the face is extracted using the MTCNN face detection algorithm .

facial action units through Facial Action Coding System (FACS) .

a set of six basic emotions purely from static faces using Emotient FACET .

MultiComp OpenFace is used to extract the set of 68 facial landmarks, 20 facial shape parameters, facial HoG features, head pose, head orientation and eye gaze.

face embeddings from commonly used facial recognition models such as DeepFace , FaceNet and SphereFace .

Acoustic

COVAREP software: extract acoustic features including 12 Mel-frequency cepstral coefficients, pitch, voiced/unvoiced segmenting features, glottal source parameters , peak slope parameters and maxima dispersion quotients.

数据集结构

import pickle
import numpy as np

with open('data/MOSEI/aligned_50.pkl', 'rb') as f:
    data = pickle.load(f)

print(data.keys())
output:
dict_keys(['train', 'valid', 'test'])
print(data['train'].keys())
output:
dict_keys(['raw_text', 'audio', 'vision', 'id', 'text', 'text_bert', 'annotations', 'classification_labels', 'regression_labels'])
print(data['train']['raw_text'][0])
output:
Key is part of the people that we use to solve those issues, whether it's stretch or outdoor resistance or abrasions or different technical aspects that we really need to solve to get into new markets, they've been able to bring solutions.
保存数据
for mode in ['train','valid','test']:
    if use_bert:
        self.text = data[mode]['text_bert'].astype(np.float32)
    else:
        self.text = data[mode]['text'].astype(np.float32)
        
    vision = data[mode]['vision'].astype(np.float32)
    audio = data[mode]['audio'].astype(np.float32)
    rawText = data[mode]['raw_text']
    ids = data[mode]['id']

Statistics

print(len(data['train']['id']))
print(len(data['valid']['id']))
print(len(data['test']['id']))

output:
16326
1871
4659