Sims,Mosi, Mosei
- sims:中文多模态情感识别数据集
- label
- Feature
- 数据集结构
- Statistics
- MOSI:英文多模态情感识别数据集
- label
- Feature
- 数据集结构
- Statistics
- MOSEI
- label
- Feature Extraction
- 数据集结构
- Statistics
sims:中文多模态情感识别数据集
label
**sentimental state **
emotion | label |
---|---|
negative | -1 |
neutral | 0 |
positive | 1 |
**regression task: average the five labeled results. **
{-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0}.
divide these values into 5 classifications
emotion | label |
---|---|
negative | {-1.0, -0.8} |
weakly negative | {-0.6, -0.4, -0.2} |
neutral | {0.0} |
weakly positive | {0.2, 0.4, 0.6} |
positive | {0.8, 1.0} |
Feature
Text
BERT-base word embeddings (768-dimensional word vector)
Audio
LibROSA speech toolkit with default parameters to extract acoustic features at 22050Hz.
Totally, 33dimensional frame-level acoustic features are extracted, including 1-dimensional logarithmic fundamental frequency (log F0), 20-dimensional Melfrequency cepstral coefficients (MFCCs) and 12dimensional Constant-Q chromatogram (CQT).
Vision
Frames are extracted from the video segments at 30Hz.
MTCNN face detection algorithm to extract aligned faces.
MultiComp OpenFace2.0 toolkit to extract the set of 68 facial landmarks, 17 facial action units, head pose, head orientation, and eye gaze. Lastly, 709-dimensional frame-level visual features are extracted in total.
数据集结构
import pickle
import numpy as np
with open('data/SIMS/unaligned_39.pkl', 'rb') as f:
data = pickle.load(f)
print(data.keys())
output:
dict_keys(['train', 'valid', 'test'])
print(data['train'].keys())
output:
dict_keys(['raw_text', 'text_bert', 'audio_lengths', 'vision_lengths', 'classification_labels', 'regression_labels', 'classification_labels_T', 'regression_labels_T', 'classification_labels_A', 'regression_labels_A', 'classification_labels_V', 'regression_labels_V', 'text', 'audio', 'vision', 'id'])
print(data['train']['raw_text'][0])
output:
闭嘴,不是来抓你的。
保存数据
for mode in ['train','valid','test']:
# str --> float32
if use_bert:
self.text = data[self.mode]['text_bert'].astype(np.float32)
else:
self.text = data[self.mode]['text'].astype(np.float32)
vision = data[mode]['vision'].astype(np.float32)
audio = data[mode]['audio'].astype(np.float32)
rawText = data[mode]['raw_text']
ids = data[mode]['id']
Statistics
print(len(data['train']['id']))
print(len(data['valid']['id']))
print(len(data['test']['id']))
output:
1368
456
457
MOSI:英文多模态情感识别数据集
label
emotion | label |
---|---|
strongly positive | +3 |
positive | +2 |
weakly positive | +1 |
neutral | 0 |
weakly negative | -1 |
negative | -2 |
strongly negative | -3 |
Feature
Audio and visual features have been automatically extracted from MPEG files with framerates of 1000 for audio and 30 for video
Visual
16 Facial Action Units, 68 Facial Landmarks, Head Pose and Orientation, 6 Basic Emotions6 and Eye Gaze
Audio
COVAREP: pitch, energy, NAQ (Normalized Amplitude Quotient), MFCCs (Mel-frequency Cepstral Coefficients), Peak Slope, Energy Slope
数据集结构
import pickle
import numpy as np
with open('data/MOSI/aligned_50.pkl', 'rb') as f:
data = pickle.load(f)
print(data.keys())
output:
dict_keys(['train', 'valid', 'test'])
print(data['train'].keys())
output:
dict_keys(['raw_text', 'audio', 'vision', 'id', 'text', 'text_bert', 'annotations', 'classification_labels', 'regression_labels'])
print(data['train']['raw_text'][0])
output:
A LOT OF SAD PARTS
保存数据
for mode in ['train','valid','test']:
if use_bert:
self.text = data[mode]['text_bert'].astype(np.float32)
else:
self.text = data[mode]['text'].astype(np.float32)
vision = data[mode]['vision'].astype(np.float32)
audio = data[mode]['audio'].astype(np.float32)
rawText = data[mode]['raw_text']
ids = data[mode]['id']
Statistics
print(len(data['train']['id']))
print(len(data['valid']['id']))
print(len(data['test']['id']))
output:
1284
229
686
MOSEI
label
emotion | label |
---|---|
strongly positive | +3 |
positive | +2 |
weakly positive | +1 |
neutral | 0 |
weakly negative | -1 |
negative | -2 |
strongly negative | -3 |
Feature Extraction
Text
All videos have manual transcription. Glove word embeddings
Visual:
Frames are extracted from the full videos at 30Hz.
The bounding box of the face is extracted using the MTCNN face detection algorithm .
facial action units through Facial Action Coding System (FACS) .
a set of six basic emotions purely from static faces using Emotient FACET .
MultiComp OpenFace is used to extract the set of 68 facial landmarks, 20 facial shape parameters, facial HoG features, head pose, head orientation and eye gaze.
face embeddings from commonly used facial recognition models such as DeepFace , FaceNet and SphereFace .
Acoustic
COVAREP software: extract acoustic features including 12 Mel-frequency cepstral coefficients, pitch, voiced/unvoiced segmenting features, glottal source parameters , peak slope parameters and maxima dispersion quotients.
数据集结构
import pickle
import numpy as np
with open('data/MOSEI/aligned_50.pkl', 'rb') as f:
data = pickle.load(f)
print(data.keys())
output:
dict_keys(['train', 'valid', 'test'])
print(data['train'].keys())
output:
dict_keys(['raw_text', 'audio', 'vision', 'id', 'text', 'text_bert', 'annotations', 'classification_labels', 'regression_labels'])
print(data['train']['raw_text'][0])
output:
Key is part of the people that we use to solve those issues, whether it's stretch or outdoor resistance or abrasions or different technical aspects that we really need to solve to get into new markets, they've been able to bring solutions.
保存数据
for mode in ['train','valid','test']:
if use_bert:
self.text = data[mode]['text_bert'].astype(np.float32)
else:
self.text = data[mode]['text'].astype(np.float32)
vision = data[mode]['vision'].astype(np.float32)
audio = data[mode]['audio'].astype(np.float32)
rawText = data[mode]['raw_text']
ids = data[mode]['id']
Statistics
print(len(data['train']['id']))
print(len(data['valid']['id']))
print(len(data['test']['id']))
output:
16326
1871
4659