Why copying text from videos is genuinely hard

People reach for the same instinct they use on a webpage: triple-click, copy, paste. Then they remember a video isn't a document. The text on screen is part of the picture. There is no underlying string the operating system can hand back to you, because the player is showing you a sequence of pixel-based frames decoded from H.264 or HEVC. Each frame is just a bitmap. The terminal output, the slide bullet, the on-screen lower-third — all of it is paint, not text.

Three things make this harder than a normal screenshot:

None of these are problems your OCR engine can solve. They're upstream of OCR, in the way the video is being delivered to your screen. The good news: most legitimate use cases — programming tutorials on YouTube, online lecture slides, recorded webinar Q&A, internal Loom walkthroughs — render normally and capture cleanly. The trick is knowing which tool to reach for once you've paused on the right frame.

Try Live Text first (and know exactly where it works)

If the video is playing inside Safari, Apple's Live Text is the easiest path. macOS detects text inside the paused video frame and lets you select it directly:

  1. Pause the video.
  2. Right-click on the paused frame.
  3. Choose Show Live Text (the menu item appears when Live Text has detected something selectable).
  4. Drag-select the text you want and copy it.

This works because Safari is built on WebKit, which Apple integrates tightly with the Vision framework. When you pause, WebKit hands the current frame to the system as a still image, Vision processes it, and the player overlays a selection layer. No extra app required, no hotkey to memorize.

The catch — and this is the part most articles bury — is that Live Text on video frames is Safari-only. Apple has not exposed this same hook to other browsers, and the GPU compositing path Chrome and Firefox use to render video bypasses the layer Live Text inspects. As of macOS Sonoma and Sequoia, paused video in Chrome, Firefox, Brave, Arc, and Edge does not get a Live Text affordance. Chrome on Mac can do Live Text on regular images, but not on the video element's current frame.

Where Live Text quietly fails

Once you step outside Safari, Live Text on video frames is gone. In our experience that covers the majority of real-world cases:

This is not a knock on Live Text. For the cases it covers — Photos, Preview, Notes, Safari images and video — it's an excellent piece of system software. It's just not the right tool when your tutorial is in Chrome, your team's recording is in Loom, or the slides you need are on a Zoom replay.

The workflow that actually covers everything

The approach we settled on for our own use, and the one we built Cheese! OCR around, treats the OCR step as a system-level operation rather than a browser feature:

  1. Pause the video. Hit space. Give the player a beat to settle.
  2. Trigger the OCR hotkey. Default in Cheese! OCR is ⇧⌘E. The screen dims and a crosshair appears.
  3. Drag-select the region with the text. The terminal window in the tutorial, the bullet on the slide, the Q&A panel on the webinar — whatever you need.
  4. Paste. The recognized text is on your clipboard. Cmd+V into your notes, your code editor, your Slack reply.

This works regardless of player because Cheese! OCR doesn't ask the browser, the video element, or the meeting client what they're showing. It uses the macOS screen capture API to grab pixels off the display and runs Apple Vision on those pixels locally. From the OS's perspective it's the same operation as ⇧⌘4 — anything that draws to the screen and isn't actively DRM-protected can be captured and OCR'd.

Two practical consequences worth flagging. First, this is on-device. Apple Vision runs entirely on your Mac, the captured frame never leaves your machine, and Cheese! OCR has no network entitlements at all. That matters more than usual when you're reading from confidential meeting recordings or unreleased course material. Second, you keep a history. Cheese! OCR stores recent recognitions in a searchable list, so if you OCR'd four consecutive code snippets out of a tutorial, you don't have to lose the earlier ones to capture the latest.

Real use cases this fixes

Some workflows where this comes up most often, in our own usage and in user feedback:

Programming tutorials

YouTube, Udemy, Coursera, Egghead, Frontend Masters. The instructor types a snippet on screen and you want the exact code in your editor without retyping. Pause, OCR, paste. Apple Vision is good enough on modern programming-font screencasts that you usually only need to fix indentation and the occasional l/1 confusion in low-bitrate streams.

Online lecture slides

You're watching a recorded course and the lecturer is reading from a slide that has a definition, a formula, or a citation you need. The slide passes in 20 seconds and the next chapter starts. Pause, OCR the slide region, move on. Faster than scrubbing back and pausing again to retype.

Webinar Q&A and chat panels

The chat in a recorded webinar often contains the most useful question the speaker addressed live but didn't repeat. OCR the chat panel and you have the exact wording.

Bullet-point presentation videos

Conference talks on YouTube, internal all-hands recordings, sales kickoff videos. The speaker barrels through a slide with five bullets. You want the bullets in your meeting notes. OCR the slide, paste, done.

Loom and Zoom walkthroughs from teammates

Someone records a Loom showing how to configure a tool, with command-line snippets on screen. The recording is in Chrome or the Loom desktop app. Live Text doesn't help. A hotkey OCR tool does.

Workflow tips that actually move the needle

Use IINA for frame-by-frame stepping. IINA is a polished open-source video player for Mac built on MPV. The arrow keys step one frame at a time. When the text you want flashes on screen for less than a second, IINA lets you land exactly on the frame where the text is sharpest, then OCR off that frame.

Pause at high-resolution moments. If a tutorial cuts between a wide shot of the speaker and a zoomed-in screen recording of code, OCR the zoomed-in frame. The same code in the wide shot is too small for the codec to preserve cleanly.

Zoom the player before pausing if the text is tiny. Most browsers honor Cmd+Plus on YouTube and other web players. A single zoom step often takes 12px caption text up to 18px and dramatically improves OCR accuracy.

Use the OCR history for multi-frame captures. Long code blocks frequently span more than one frame. Capture each frame as you reach it; Cheese! OCR keeps every recognition in a searchable list, so you can stitch them together in your editor afterward without losing intermediate captures.

Switch to closed captions when they exist. If the video has CC and you only need the spoken content, captions are always going to be more accurate than OCR'ing burned-in subtitles. OCR on video frames is the right tool for on-screen text the captions don't include — code, slide bullets, chat panels, screen-shared documents — not for replacing transcripts.

Caveats worth knowing before you trust the output

OCR on video frames is good, not magical. A few honest limits:

None of these are dealbreakers. They're just the same trade-offs you'd hit with any screenshot OCR — slightly amplified by the fact that video is doubly compressed (once by the codec, once by whatever scaling the player applies). With a deliberate pause and a clean frame, modern Apple Vision OCR on a tutorial screencast is reliably good enough to skip retyping.

The summary, if you only remember one thing: video is just a sequence of pictures, so OCR works on it whenever you can capture a clean picture of it. Pause first, capture second, OCR third. The rest is choosing the right tool for the player you're in.