Generate depth map from any input image
Transcribe audio or YouTube videos into text
Predict depth map from a single image