Survey of related databases
It captures 25 people preparing 2 mixed salads each and contains over 4h of annotated accelerometer and RGB-D video data. Including detailed annotations, multiple sensor types, and two sequences per participant, the 50 Salads dataset may be used for research in areas such as activity recognition, activity spotting, sequence analysis, progress tracking, sensor fusion, transfer learning, and user-adaptation. http://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads/
The dataset comprises of two views of various scenario’s of people acting out various interactions. Ten basic scenarios were acted out by some of the Vision Group team members. These were called InGroup (IG), Approach (A), WalkTogether (WT), Split (S), Ignore (I), Following (FO), Chase (C), Fight (FI), RunTogether (RT), and Meet (M). Many of the interactions in the video sequence are labelled accordingly.
The data is captured at 25 frames per second. The resolution is 640x480. The videos are available either as AVI’s or as a numbered set of JPEG single image files.
A lot (but not all) of the video sequences have ground truth bounding boxes of the pedestrians in the scene.
The Berkeley Multimodal Human Action Database (MHAD) contains 11 actions performed by 7 male and 5 female subjects in the range 23-30 years of age except for one elderly subject. All the subjects performed 5 repetitions of each action, yielding about 660 action sequences which correspond to about 82 minutes of total recording time. In addition, we have recorded a T-pose for each subject which can be used for the skeleton extraction; and the background data (with and without the chair used in some of the activities).
The specified set of actions comprises of the following: (1) actions with movement in both upper and lower extremities, e.g., jumping in place, jumping jacks, throwing, etc., (2) actions with high dynamics in upper extremities, e.g., waving hands, clapping hands, etc. and (3) actions with high dynamics in lower extremities, e.g., sit down, stand up.
8 interaction classes, such as bow, boxing, handshake, high-five, hug, kick, pat and push. 400 video clips.
CAD120 (Cornell Activity Datasets)
The CAD-120 data sets comprise of RGB-D video (120 RGB-D videos) sequences of humans performing activities which are recording using the Microsoft Kinect sensor. 4 subjects: two male, two female.
10 high-level activities: making cereal, taking medicine, stacking objects, unstacking objects, microwaving food, picking objects, cleaning objects, taking food, arranging objects, having a meal
10 sub-activity labels: reaching, moving, pouring, eating, drinking, opening, placing, closing, scrubbing, null
12 object affordance labels: reachable, movable, pourable, pourto, containable, drinkable, openable, placeable, closable, scrubbable, scrubber, stationary
CAD60 (Cornell Activity Datasets)
The CAD-60 comprise of RGB-D video sequences (60 RGB-D videos) of humans performing activities which are recording using the Microsoft Kinect sensor. 4 subjects: two male, two female. There are 5 different environments: office, kitchen, bedroom, bathroom and living room, and 12 activities: rinsing mouth, brushing teeth, wearing contact lens, talking on the phone, drinking water, opening pill container, cooking (chopping), cooking (stirring), talking on couch, relaxing on couch, writing on whiteboard, working on computer
CASIA (action database for recognition)
CASIA action database is a collection of sequences of human activities captured by video cameras outdoors from different angle of view. There are 1446 sequences in all containing eight types of actions of single person (walk, run, bend, jump, crouch, faint, wander and punching a car) performed each by 24 subjects and seven types of two person interactions (rob, fight, follow, follow and gather, meet and part, meet and gather, overtake) performed by every 2 subjects.
For the CAVIAR project a number of video clips were recorded acting out the different scenarios of interest. These include people walking alone, meeting with others, window shopping, entering and exitting shops, fighting and passing out and last, but not least, leaving a package in a public place. The ground truth for these sequences was found by hand-labeling the images
CMU MMAC (CMU Multi-Modal Activity Database)
The CMU Multi-Modal Activity Database (CMU-MMAC) contains multimodal measures of the human activity of subjects performing the tasks involved in cooking and food preparation. A kitchen was built and to date twenty-five subjects have been recorded cooking five different recipes: brownies, pizza, sandwich, salad, and scrambled eggs.
CMU MoCap (CMU Graphics Lab Motion Capture Database)
CMU MoCap is a database of Human Interaction with Environment and Locomotion. There are 2605 trials in 6 categories and 23 subcategories, which include some common two people interaction, some activities, such as walking, running, and some ports, such as play basketball, dance.
CONVERSE (Human Conversational Interaction Dataset)
This is a human interaction recognition dataset intended for the exploration of classifying naturally executed conversational scenarios between a pair of individuals via the use of pose- and appearance-based features. The motivation behind CONVERSE is to present the problem of classifying subtle and complex behaviors between participants with pose-based information, classes which are not easily defined by the poses they contain. A pair of individuals are recorded performing natural dialogues across 7 different conversational scenarios by use of commercial depth sensor, providing pose-based representation of the interactions in the form of the extracted human skeletal models. Baseline classification results are presented in the associated publication to allow cross-comparison with future research into pose-based interaction recognition.
Drinking/Smoking (Drinking & Smoking action annotaion)
The annotation describes each action by a cuboid in space-time, a keyframe and the position of the head on the keyframe. 308 events with labels: PersonDrinking (159) , PersonSmoking (149) are extracted from movies.
(Hollywood-2 Human Actions and Scenes dataset, Hollywood Human Actions dataset)
ETISEO focuses on the treatment and interpretation of videos involving pedestrians and vehicles, indoors or outdoors, obtained from fixed cameras.
ETISEO aims at studying the dependency between algorithms and the video characteristics.
G3D (gaming dataset)
G3D dataset contains a range of gaming actions captured with Microsoft Kinect. The Kinect enabled us to record synchronised video, depth and skeleton data. The dataset contains 10 subjects performing 20 gaming actions: punch right, punch left, kick right, kick left, defend, golf swing, tennis swing forehand, tennis swing backhand, tennis serve, throw bowling ball, aim and fire gun, walk, run, jump, climb, crouch, steer a car, wave, flap and clap. The 20 gaming actions are recorded in 7 action sequences. Most sequences contain multiple actions in a controlled indoor environment with a fixed camera, a typical setup for gesture based gaming.
G3Di (gaming dataset)
G3Di is a realistic and challenging human interaction dataset for multiplayer gaming, containing synchronised colour, depth and skeleton data. This dataset contains 12 people split into 6 pairs. Each pair interacted through a gaming interface showcasing six sports: boxing, volleyball, football, table tennis, sprint and hurdles. The interactions can be collaborative or competitive depending on the specific sport and game mode. In this dataset volleyball was played collaboratively and the other sports in competitive mode. In most sports the interactions were explicit and can be decomposed by an action and counter action but in the sprint and hurdles the interactions were implicit, as the players competed with each other for the fastest time. The actions for each sport are: boxing (right punch, left punch, defend), volleyball (serve, overhand hit, underhand hit, and jump hit), football (kick, block and save), table tennis (serve, forehand hit and backhand hit), sprint (run) and hurdles (run and jump)
HMDB51 (A Large Human Motion Database)
HMDB collected from various sources, mostly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips. The actions categories can be grouped in five types :general facial actions, facial actions with object manipulation, general body movements, body movements with object interaction and body movements for human interaction.
General facial actions smile, laugh, chew, talk.
Facial actions with object manipulation: smoke, eat, drink.
General body movements: cartwheel, clap hands, climb, climb stairs, dive, fall on the floor, backhand flip, handstand, jump, pull up, push up, run, sit down, sit up, somersault, stand up, turn, walk, wave.
Body movements with object interaction: brush hair, catch, draw sword, dribble, golf, hit something, kick ball, pick, pour, push something, ride bike, ride horse, shoot ball, shoot bow, shoot gun, swing baseball bat, sword exercise, throw.
Body movements for human interaction: fencing, hug, kick someone, kiss, punch, shake hands, sword fight.
Hollywood (Hollywood Human Actions dataset)
Hollywood dataset contains video samples with human action from 32 movies. Each sample is labeled according to one or more of 8 action classes: AnswerPhone, GetOutCar, HandShake, HugPerson, Kiss, SitDown, SitUp, StandUp. The dataset is divided into a test set obtained from 20 movies and two training sets obtained from 12 movies different from the test set. The Automatic training set is obtained using automatic script-based action annotation and contains 233 video samples with approximately 60% correct labels. The Clean training set contains 219 video samples with manually verified labels. The test set contains 211 samples with manually verified labels.
Hollywood2 (Hollywood-2 Human Actions and Scenes dataset)
Hollywood2 contains 12 classes of human actions and 10 classes of scenes distributed over 3669 video clips and approximately 20.1 hours of video in total. The dataset intends to provide a comprehensive benchmark for human action recognition in realistic and challenging settings. The dataset is composed of video clips extracted from 69 movies, it contains approximately 150 samples per action class and 130 samples per scene class in training and test subsets.
The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use of the 3D data, and five new interest point detection strategies are also proposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
The HumanEVA-I dataset contains 7 calibrated video sequences (4 grayscale and 3 color) that are synchronized with 3D body poses obtained from a motion capture system. The database contains 4 subjects performing a 6 common actions (e.g. walking, jogging, gesturing, etc.). The error metrics for computing error in 2D and 3D pose are provided to participants. The dataset contains training, validation and testing (with withheld ground truth) sets.
HUMANEVA-II contains only 2 subjects (both also appear in the HUMANEVA-I dataset) performing an extended sequence of actions that we call Combo. In this sequence a subject starts by walking along an elliptical path, then continues on to jog in the same direction and concludes with the subject alternatively balancing on each of the two feet roughly in the center of the viewing volume.
The HUMANEVA-I training and validation data is intended to be shared across the two datasets with test results primarily being reported on HUMANEVA-II.
IXMAS (INRIA Xmas Motion Acquisition Sequences)
INRIA Xmas Motion Acquisition Sequences (IXMAS) is a multiview dataset for view-invariant human action recognition. There are 13 daily-live motions performed each 3 times by 11 actors. The actors choose freely position and orientation.
Framewise ground truth labeling: 0 - nothing, 1 - check watch, 2 - cross arms, 3 - scratch head, 4 - sit down, 5 - get up, 6 - turn around, 7 - walk, 8 - wave, 9 - punch, 10 - kick, 11 - point, 12 - pick up, 13 - throw (over head), 14 - throw (from bottom up).
JPL (JPL First-Person Interaction dataset)
JPL First-Person Interaction dataset (JPL-Interaction dataset) is composed of human activity videos taken from a first-person viewpoint. The dataset particularly aims to provide first-person videos of interaction-level activities, recording how things visually look from the perspective (i.e., viewpoint) of a person/robot participating in such physical interactions.
We attached a GoPro2 camera to the head of our humanoid model, and asked human participants to interact with the humanoid by performing activities. In order to emulate the mobility of a real robot, we also placed wheels below the humanoid and made an operator to move the humanoid by pushing it from the behind.
There are 7 different types of activities in the dataset, including 4 positive (i.e., friendly) interactions with the observer, 1 neutral interaction, and 2 negative (i.e., hostile) interactions. ‘Shaking hands with the observer’, ‘hugging the observer’, ‘petting the observer’, and ‘waving a hand to the observer’ are the four friendly interactions. The neutral interaction is the situation where two persons have a conversation about the observer while occasionally pointing it. ‘Punching the observer’ and ‘throwing objects to the observer’ are the two negative interactions. Videos were recorded continuously during human activities where each video sequence contains 0 to 3 activities. The videos are in 320*240 resolution with 30 fps.
KTH (Action Database)
The current video database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4. Currently the database contains 2391 sequences. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average.
LIRIS (The LIRIS human activities dataset)
The LIRIS human activities dataset contains (gray/rgb/depth) videos showing people performing various activities taken from daily life (discussing, telphone calls, giving an item etc.). The dataset is fully annotated, where the annotation not only contains information on the action class but also its spatial and temporal positions in the video.
The dataset has been shot with two different cameras:
Subset D1 has been shot with a MS Kinect module mounted on a remotely controlled Wany robotics Pekee II mobile robot which is part of the LIRIS-VOIR platform.
Subset D2 has been shot with a sony consumer camcorder
The indoor motion capture dataset (MPI08) of:
sequences : multi-view sequences obtained from 8 calibrated cameras.
silhouettes : binary segmented images obtained with chroma-keying.
meshes : 3D laser scans for each of the four actors in the dataset and also the registered meshes with inserted skeletton.
projection matrices : one for each of the 8 cameras.
orientation data : raw and calibrated and sensor orientation data (5 sensors)
All takes have been recorded in a lab environment using eight calibrated video cameras and five inertial sensors fixed at the two lower legs, the two hands, and the neck. Our evaluation data set comprises various actions including standard motions such as walking, sitting down and standing up as well as fast and complex motions such as jumping, throwing, arm rotations, and cartwheels
MPII Cooking (MPII (Max Planck Institute for Informatics) Cooking Activities dataset)
The dataset records 12 participants performing 65 different cooking activities, such as cut slices, pour, or spice. To record realistic behavior we did not record activities individually but asked participants to prepare one to six of a total of 14 dishes such as fruit salad or cake containing several cooking activities. In total we recorded 44 videos with a total length of more than 8 hours or 881,755 frames.
We also provide an annotated body pose training and test set. This allows to work on the raw data but also on higher level modeling of activities. Activities are distinguished by fine-grained body motions that have low inter-class variability and high intraclass variability due to diverse subjects and ingredients.
We record a dataset containing different cooking activities. We discard some of the composite activities in the script corpus which are either too elementary to form a composite activity (e.g. how to secure a chopping board), or were duplicates with slightly different titles, or because of limited availability of the ingredients (e.g. butternut squash). This resulted in 41 composite cooking activities for evaluation. For each composite activity, we asked the subjects to give tutorial-like sequential instructions for executing the respective kitchen task. The instructions had to be divided into sequential steps with at most 15 steps per sequence. We select 53 relevant kitchen tasks as composite activities by mining the tutorials for basic kitchen tasks on the webpage Jamie’s Home Cooking Skills”4. All those tasks are steps to process ingredients or to use certain kitchen tools. In addition to the data we collected in this experiment, we use data from the OMICS corpus for 6 kitchen-related composite activities. This results in a corpus with 2124 sequences in sum, having a total of 12958 event descriptions.
This is a data set used for human action-detection experiments. It consists of a number of video sequences we have recorded.
It contains 16 video sequences and has in total 63 actions: 14 hand clapping, 24 hand waving, and 25 boxing, performed by 10 subjects. Each sequence contains multiple types of actions. Some sequences contain actions performed by different people. There are both indoor and outdoor scenes. All of the video sequences are captured with clutter and moving backgrounds. Each video is of low resolution 320 x 240 and frame rate 15 frames per second. Their lengths are between 32 to 76 seconds. To evaluate the performance, we manually label a spatio-temporal bounding box for each action. The ground truth labeling can be found in the groundtruth.txt file. The ground truth format of each labeled action is “X width Y height T length”.
Microsoft Research Action Data Set II is an extended version of the Microsoft Research Action Data Set. It consists of 54 video sequences recorded in a crowded environment. Each video sequence consists of multiple actions. There are three action types: hand waving, handclapping, and boxing. These action types are overlapped with the KTH data set. One could perform cross-data-set action recognition by using the KTH data set for training while using this data set for testing.
(This is Part 2 of Microsoft Research Action Data Set II. There are five parts in total.)
MSR-Action3D dataset contains twenty actions: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side-boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pick up & throw. There are 10 subjects, each subject performs each action 2 or 3 times. There are 567 depth map sequences in total. The resolution is 320x240. The data was recorded with a depth sensor similar to the Kinect device.
DailyActivity3D dataset is a daily activity dataset captured by a Kinect device. There are 16 activity types: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lay down on sofa, walk, play guitar, stand up, sit down. There are 10 subjects. Each subject performs each activity twice, once in standing position, and once in sitting position. There is a sofa in the scene. Three channels are recorded: depth maps (.bin), skeleton joint positions (.txt), and RGB video (.avi).
MSR Gesture 3D Dataset (weak correlation)
The dataset was captured by a Kinect device. There are 12 dynamic American Sign Language (ASL) gestures, and 10 people. Each person performs each gesture 2-3 times. There are 336 files in total, each corresponding to a depth sequence. The hand portion (above the wrist) has been segmented.
MuHAVi (Multicamera Human Action Video Data)
We have collected a large body of human action video (MuHAVi) data using 8 cameras. There are 17 action classes performed by 14 actors, that is, WalkTurnBack, RunStop, Punch,Kick, ShotGunCollapse, PullHeavyObject, PickupThrowObject, WalkFall, LookInCar, CrawlOnKnees, WaveArms, DrawGraffiti, JumpOverFence, DrunkWalk, ClimbLadder, SmashObject, JumpOverGap
The Olympic Sports Dataset contains videos of athletes practicing different sports. We have obtained all video sequences from YouTube and annotated their class label with the help of Amazon Mechanical Turk.
The current release contains 16 sports: High-jump, long-jump, tirple-jump, pole-vault, discus throw, hammer throw, javelin throw, shot put, basketballlay-up, bowling, tennis-serve, platform diving, springboard diving, snatch (weightlifting), clean-jerk(weightlifting), Gymnasticvault.
POETICON video dataset is used for several experiments, separated into 6 activities, with segmented actions that describe each activity saved in separate zip files, such as Cleaning, Make Salad, Make Sangria, Packing a Parcel, Planting and Table Setting.
For example, actions of Activity Cleaning are of sweeping with broom(ub1), clean chair with cloth(ub2), clear trash bin(I), clean lamp with cloth(ub1), clean glasses with cloth(T), change light bulb(ub2), fold napkin(T), clean small table with cloth(K), change clock batteries(ub2), adjust clock time(ub2).
Rochester AoDL (University of Rochester Activities of Daily Living Dataset)
A high resolution video dataset is recorded about the activities of daily living, such as answering a phone, dialing a phone, looking up a phone number in a telephone directory, writing a phone number on a whiteboard, drinking a glass of water, eating snack chips, peeling a banana, eating a banana, chopping a banana, and eating food with silverware.
These activities were each performed three times by five different people. These people were all members of computer science department, and were naive to the details of our model when the data was collected.
SBU Kinect Interaction
We collect eight interactions: approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands from seven participants and 21 pairs of two-actor sets. The entire dataset has a total of 300 interactions approximately. It comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft Kinect sensor. In our dataset, color-depth video and motion capture data have been synchronized and annotated with action label for each frame.
Stanford 40 Actions
The Stanford 40 Action Dataset contains images of humans performing 40 actions. In each image, we provide a bounding box of the person who is performing the action indicated by the filename of the image. There are 9532 images in total with 180-300 images per action class.
The TUM Kitchen Data Set is provided to foster research in the areas of marker less human motion capture, motion segmentation and human activity recognition. It should aid researchers in these fields by providing a comprehensive collection of sensory input data that can be used to try out and to verify their algorithms. It is also meant to serve as a benchmark for comparative studies given the manually annotated “ground truth” labels of the underlying actions. The recorded activities have been selected with the intention to provide realistic and seemingly natural motions, and consist of everyday manipulation activities in a natural kitchen environment.
UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 data set which has 50 action categories.
With 13320 videos from 101 action categories, UCF101 gives the largest diversity in terms of actions and with the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc, it is the most challenging data set to date. As most of the available action recognition data sets are not realistic and are staged by actors, UCF101 aims to encourage further research into action recognition by learning and exploring new realistic action categories.
The videos in 101 action categories are grouped into 25 groups, where each group can consist of 4-7 videos of an action. The videos from the same group may share some common features, such as similar background, similar viewpoint, etc.
The action categories can be divided into five types: 1)Human-Object Interaction 2) Body-Motion Only 3) Human-Human Interaction 4) Playing Musical Instruments 5) Sports.
It contains 11 action categories: basketball shooting, biking/cycling, diving, golf swinging, horseback riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog.
This data set is very challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc.
For each category, the videos are grouped into 25 groups with more than 4 action clips in it. The video clips in the same group share some common features, such as the same actor, similar background, similar viewpoint, and so on.
The videos are ms mpeg4 format. You need to install the right Codec (e.g. K-lite Codec Pack contains a cellection of Codecs) to access them.
UCF50 is an action recognition data set with 50 action categories, consisting of realistic videos taken from youtube. This data set is an extension of YouTube Action data set (UCF11) which has 11 action categories.
Most of the available action recognition data sets are not realistic and are staged by actors. In our data set, the primary focus is to provide the computer vision community with an action recognition data set consisting of realistic videos which are taken from youtube. Our data set is very challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc. For all the 50 categories, the videos are grouped into 25 groups, where each group consists of more than 4 action clips. The video clips in the same group may share some common features, such as the same person, similar background, similar viewpoint, and so on.
UCF Sports dataset consists of a set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. The video sequences were obtained from a wide range of stock footage websites including BBC Motion gallery and GettyImages.
The dataset includes a total of 150 sequences with the resolution of 720 x 480. The collection represents a natural pool of actions featured in a wide range of scenes and viewpoints. By releasing the data set we hope to encourage further research into this class of action recognition in unconstrained environments. Since its introduction, the dataset has been used for numerous applications such as: action recognition, action localization, and saliency detection.
The dataset includes the following 10 actions. Diving (14 videos)、Golf Swing (18 videos)、Kicking (20 videos)、Lifting (6 videos)、Riding Horse (12 videos)、Running (13 videos)、SkateBoarding (12 videos)、Swing-Bench (20 videos)、Swing-Side (13 videos)、Walking (22 videos)
The UMPM Benchmark is a collection of video recordings together with a ground truth based on motion capture data. It is intended to be used for assessing the quality of methods for recognition of poses from multiple persons from video data, both using a single or multiple cameras.
The (UMPM) benchmark includes synchronized motion capture data and video sequences from multiple viewpoints for multi-person motion including multi-person interaction. The data set is available to the research community to promote research in multi-person articulated human motion analysis.
The recordings should also show the main challenges of multi-person motion, which are visibility (a (part of a) person is not visible because of occlusions by other persons or static objects, or by self-occlusions)
and ambiguity (body parts are identified ambiguously when persons are close to each other). The body poses and gestures are classified as natural (commonly used in daily life) and synthetic (special human movements for some particular purpose such as human-computer interaction, sports or gaming). Each of these two classes is subdivided into a few scenarios. In total, our data set consists of 9 different scenarios. Each scenario is recorded with 1, 2, 3 and 4 persons in the scene and is recorded multiple times to provide variations, i.e. different subject combination, order of poses and motion patterns. For natural motion we defined 5 different scenarios where the subjects (1) walk, jog and run in an arbitrary way among each other, (2) walk along a circle or triangle of a predetermined size, (3) walk around while one of them sits or hangs on a chair, (4) sit, lie, hang or stand on a table or walk around it, and (5) grab objects from a table. These scenarios include individual actions, but the number of subjects moving around in the restricted area cause inter-person occlusions. We also include two scenarios with interaction between the subjects: (6) a conversation with natural gestures, and (7) the subjects throw or pass a ball to each other while walking around. The scenarios with synthetic motions include poses as shown in Figure 1, performed when the subjects (8) stand still and (9) move around. These scenarios are recorded without any static occluders to focus only on inter-person occlusions.
The UT-Interaction dataset contains videos of continuous executions of 6 classes of human-human interactions: shake-hands, point, hug, push, kick and punch. Ground truth labels for these interactions are provided, including time intervals and bounding boxes. There is a total of 20 video sequences whose lengths are around 1 minute. Each video contains at least one execution per interaction, providing us 8 executions of human activities per video on average. Several participants with more than 15 different clothing conditions appear in the videos. The videos are taken with the resolution of 720*480, 30fps, and the height of a person in the video is about 200 pixels.
We divide videos into two sets. The set 1 is composed of 10 video sequences taken on a parking lot. The videos of the set 1 are taken with slightly different zoom rate, and their backgrounds are mostly static with little camera jitter. The set 2 (i.e. the other 10 sequences) are taken on a lawn in a windy day. Background is moving slightly (e.g. tree moves), and they contain more camera jitters. From sequences 1 to 4 and from 11 to 13, only two interacting persons appear in the scene. From sequences 5 to 8 and from 14 to 17, both interacting persons and pedestrians are present in the scene. In sets 9, 10, 18, 19, and 20, several pairs of interacting persons execute the activities simultaneously. Each set has a different background, scale, and illumination.
We provide a large body of synthetic video data generated for the purpose of evaluating different algorithms on human action recognition which are based on silhouettes. The data consist of 20 action classes, 9 actors and up to 40 synchronized perspective camera views. It is well known that for the action recognition algorithms which are purely based on human body masks, where other image properties such as colour and intensity are not used, it is important to obtain accurate silhouette data from video frames. This problem is not usually considered as part of the action recognition, but as a lower level problem in the motion tracking and change detection. Hence for researchers working on the recognition side, access to reliable Virtual Human Action Silhouette (ViHASi) data seems to be both a necessity and a relief. The reason for this is that such data provide a way of comprehensive experimentation and evaluation of the methods under study, that might even lead to their improvements.
The dataset is designed to be realistic, natural and challenging for video surveillance domains in terms of its resolution, background clutter, diversity in scenes, and human activity/event categories than existing action recognition datasets.
Data was collected in natural scenes showing people performing normal actions in standard contexts, with uncontrolled, cluttered backgrounds. There are frequent incidental movers and background activities. Actions performed by directed actors were minimized; most were actions performed by the general population. Data was collected at multiple sites distributed throughout the USA. A variety of camera viewpoints and resolutions were included, and actions are performed by many different people. Diverse types of human actions and human-vehicle interactions are included, with a large number of examples (>30) per action class. Many applications such as video surveillance operate across a wide range of spatial and temporal resolutions. The dataset is designed to capture these ranges, with 2–30Hz frame rates and 10–200 pixels in person-height. The dataset provides both the original videos with HD quality and down sampled versions both spatially and temporally. Both ground camera videos and aerial videos are collected released as part of VIRAT Video Dataset.
we collected a database of 90 low-resolution (180 144, deinterlaced 50 fps) video sequences showing nine different people, each performing 10 natural actions such as “run,” “walk,” “skip,”“jumping-jack” (or shortly “jack”), “jump-forward-on-two-legs” (or“jump”), “jump-in-place-on-two-legs” (or “pjump”), “gallopsideways” (or“side”), “wave-two-hands” (or “wave2”), “wave one-hand” (or “wave1”), or “bend.” To obtain space-time shapes of the actions, we subtracted the median background from each of the sequences and used a simple thresholding in color-space.
As part of our research on real-time multi-view human action recognition in a camera network, we collected data of subjects performing several actions from different views using a network of 8 embedded cameras. This data could be potentially useful for related research on activity recognition.
Dataset 1: This dataset was used to evaluate recognition of unit actions – each sample consists of a subject performing only one action, the start and end times for each action are known, and the input provided is exactly equal to the duration of an action. The subject performs a set of 12 actions at approximately the same pace. The data was collected at a rate of 20 fps with 640 x 480 resolution.
Dataset 2: This dataset was used for evaluating interleaved sequences of actions. Each sequence consists of multiple unit actions and each unit actions may be of varying duration. The data was collected at a rate of 20 fps with 960 x 720 resolution.
The multi-camera network system consists of 8 cameras that provide completely overlapping coverage of a rectangular region R (about 50 x 50 feet) from different viewing directions.