In this post, we will quickly check the information really contained in the video files produced by digital recording systems and obtain some useful still images. The procedure applies to almost all video file formats found anywhere. Normally, we would find the video itself, a soundtrack, and some metadata inside the video file. Video imagery is nothing more (nothing less, as it is quite complex) than a collection of still images reproduced in succession. For machine learning applications, it is generally necessary to split the moving sequence into separate frames.
We will use this video file in this post:
First of all, we will use the lesser-known ffprobe command from FFmpeg utilities to check the content and formats of our video; we need, of course, FFmpeg installed:
!pip install ffmpeg
!ffprobe /content/Irizar_trimmed.mp4
The probing should return something similar to this:
This probing report shows us the type of input file (video container formats), the metadata standard, duration, bitrate, and a list with the streams contained in the file, in this case, Stream #0:0, with indeterminate language, being a video with the listed characteristics (h264 encoding in this case) and Stream #0:1, mp3 format audio. The video is taken from a drone with a microphone. This drone is a mid-level commercial drone, not a professional or military drone. Therefore, we cannot share the video with encoded metadata; we can share what the results would be if we had such a video or we multiplexed some MISB KLV data into a commercial drone video stream:
!ffprobe /content/video_with_MISB_mdata.ts
This new probing returns this print:
This file is a MPEG Transport Stream (TS) container file. This container type can hold multiple programs, video, audio, and data. In this case, Stream #0:1 and #0:3 in Program 1 carry KLV encoded data. FFprobe is also letting us know that for these streams, 1, 3, and 5, it cannot read the data contained. We will post a method to strip this data as soon as we have shareable videos with a KLV data stream embedded. We cannot do this today.
Today we can split into frames our own video. To do that, we need to import OpenCV for Python:
import cv2
We will point the video capture to our video file and name a folder for the captured frames. OpenCV contains this VideoCapture tool that will enable us to iterate over the individual frames of the video file, and we can also first check the metadata that we already know from our probing by counting the frames that make up the video:
source = '/content/Irizar_trimmed.mp4'
frames_dir = '/content/frames'
vidcap = cv2.VideoCapture(source)
fps = vidcap.get(cv2.CAP_PROP_FPS)
frame_count = int(vidcap.get(cv2.CAP_PROP_FRAME_COUNT))
duration = frame_count//fps
print(f'Duration: {duration} seconds - {frame_count} frames')
vidcap = cv2.VideoCapture(source)
We have 2,104 frames in this 70-second video. Let´s create the directory to hold our frames and set the starting second, the ending second, and the resolution of frames we want to capture:
import os
# Captures in seconds:
capture_from = 40
capture_to = 50
every_n = 1
if not os.path.exists(frames_dir):
os.mkdir(frames_dir)
Now we have to iterate over the video capture object:
count = 0
while count < capture_to*fps:
success, image = vidcap.read()
count += 1
if not success: continue
if count < capture_from*fps:
continue
if count % int(fps*every_n) != 0:
continue
print('Capturing:', count)
cv2.imwrite(frames_dir+"/frame_%d.jpg" % count, image) # save frame as JPEG file
vidcap.release()
We are counting the frames for control purposes. As a trial, printing from second 40 to second 50 every second produces fast and good results. We should have 10 jpg files in our "frames" folder depicting different moments in the flight of our drone and showing different elements of interest. Two examples follow.
Frame 1200:
And frame 1500:
And these are the images we would use for labeling and boundary box generation, probably extracting the whole video as individual frames and manually bounding elements of interest, a time-intensive task:
With the information on the video known and the frames extracted, it is a matter of working through the frames and label the elements of interest; later, these extracted frames will be the training set for a machine learning application. The good news is that training a preliminary model can help with the bounding boxes for multiple other videos; as soon as a model shows reliable labeling, the human trainer can adjust good and bad labels so that a more advanced production model can produce the actual, real inference.
If you require quantitative model development, deployment, verification, or validation, do not hesitate and contact us. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, automation, or intelligence gathering from satellite, drone, or fixed-point imagery.
The notebook, in Google Colab, for this post is located here.
Comments