Why copying text from videos is genuinely hard
People reach for the same instinct they use on a webpage: triple-click, copy, paste. Then they remember a video isn't a document. The text on screen is part of the picture. There is no underlying string the operating system can hand back to you, because the player is showing you a sequence of pixel-based frames decoded from H.264 or HEVC. Each frame is just a bitmap. The terminal output, the slide bullet, the on-screen lower-third — all of it is paint, not text.
Three things make this harder than a normal screenshot:
- Compression artifacts. Video codecs throw away high-frequency detail to save bandwidth. Text edges that look crisp in a static screenshot end up blurry in a video frame, especially when the bitrate is low or the player is upscaling 720p source to a Retina display.
- Motion. Even if you intend to pause, you sometimes catch the player a frame too early. The codec is mid-keyframe and the text smears.
- DRM. Netflix, Disney+, certain Zoom recordings, and some enterprise webinar tools deliberately render video into a protected layer. macOS honors the protection bit and your screenshot comes back as a black rectangle.
None of these are problems your OCR engine can solve. They're upstream of OCR, in the way the video is being delivered to your screen. The good news: most legitimate use cases — programming tutorials on YouTube, online lecture slides, recorded webinar Q&A, internal Loom walkthroughs — render normally and capture cleanly. The trick is knowing which tool to reach for once you've paused on the right frame.
Try Live Text first (and know exactly where it works)
If the video is playing inside Safari, Apple's Live Text is the easiest path. macOS detects text inside the paused video frame and lets you select it directly:
- Pause the video.
- Right-click on the paused frame.
- Choose Show Live Text (the menu item appears when Live Text has detected something selectable).
- Drag-select the text you want and copy it.
This works because Safari is built on WebKit, which Apple integrates tightly with the Vision framework. When you pause, WebKit hands the current frame to the system as a still image, Vision processes it, and the player overlays a selection layer. No extra app required, no hotkey to memorize.
The catch — and this is the part most articles bury — is that Live Text on video frames is Safari-only. Apple has not exposed this same hook to other browsers, and the GPU compositing path Chrome and Firefox use to render video bypasses the layer Live Text inspects. As of macOS Sonoma and Sequoia, paused video in Chrome, Firefox, Brave, Arc, and Edge does not get a Live Text affordance. Chrome on Mac can do Live Text on regular images, but not on the video element's current frame.
Where Live Text quietly fails
Once you step outside Safari, Live Text on video frames is gone. In our experience that covers the majority of real-world cases:
- Chrome, Firefox, Brave, Arc, Edge. Most people watch tutorials and webinars in Chrome. Live Text on video frames doesn't engage. Right-click gives you the standard browser menu.
- Native Mac players. VLC, IINA, MPV, and QuickTime Player render video through their own pipelines. Live Text doesn't reach into them.
- Zoom, Microsoft Teams, Webex meeting clients and recordings. Even when the recording renders normally, the meeting clients use custom rendering layers. Live Text isn't available.
- Loom playback in the desktop app. The web playback in Safari can sometimes work; the desktop app and Chrome playback don't.
- DRM-protected services. Netflix, Disney+, Apple TV+, HBO Max, Amazon Prime Video. The frame captures as black, so even if you switched to a different OCR tool, there's nothing to read.
This is not a knock on Live Text. For the cases it covers — Photos, Preview, Notes, Safari images and video — it's an excellent piece of system software. It's just not the right tool when your tutorial is in Chrome, your team's recording is in Loom, or the slides you need are on a Zoom replay.
The workflow that actually covers everything
The approach we settled on for our own use, and the one we built Cheese! OCR around, treats the OCR step as a system-level operation rather than a browser feature:
- Pause the video. Hit space. Give the player a beat to settle.
- Trigger the OCR hotkey. Default in Cheese! OCR is ⇧⌘E. The screen dims and a crosshair appears.
- Drag-select the region with the text. The terminal window in the tutorial, the bullet on the slide, the Q&A panel on the webinar — whatever you need.
- Paste. The recognized text is on your clipboard. Cmd+V into your notes, your code editor, your Slack reply.
This works regardless of player because Cheese! OCR doesn't ask the browser, the video element, or the meeting client what they're showing. It uses the macOS screen capture API to grab pixels off the display and runs Apple Vision on those pixels locally. From the OS's perspective it's the same operation as ⇧⌘4 — anything that draws to the screen and isn't actively DRM-protected can be captured and OCR'd.
Two practical consequences worth flagging. First, this is on-device. Apple Vision runs entirely on your Mac, the captured frame never leaves your machine, and Cheese! OCR has no network entitlements at all. That matters more than usual when you're reading from confidential meeting recordings or unreleased course material. Second, you keep a history. Cheese! OCR stores recent recognitions in a searchable list, so if you OCR'd four consecutive code snippets out of a tutorial, you don't have to lose the earlier ones to capture the latest.
Real use cases this fixes
Some workflows where this comes up most often, in our own usage and in user feedback:
Programming tutorials
YouTube, Udemy, Coursera, Egghead, Frontend Masters. The instructor types a snippet on screen and you want the exact code in your editor without retyping. Pause, OCR, paste. Apple Vision is good enough on modern programming-font screencasts that you usually only need to fix indentation and the occasional l/1 confusion in low-bitrate streams.
Online lecture slides
You're watching a recorded course and the lecturer is reading from a slide that has a definition, a formula, or a citation you need. The slide passes in 20 seconds and the next chapter starts. Pause, OCR the slide region, move on. Faster than scrubbing back and pausing again to retype.
Webinar Q&A and chat panels
The chat in a recorded webinar often contains the most useful question the speaker addressed live but didn't repeat. OCR the chat panel and you have the exact wording.
Bullet-point presentation videos
Conference talks on YouTube, internal all-hands recordings, sales kickoff videos. The speaker barrels through a slide with five bullets. You want the bullets in your meeting notes. OCR the slide, paste, done.
Loom and Zoom walkthroughs from teammates
Someone records a Loom showing how to configure a tool, with command-line snippets on screen. The recording is in Chrome or the Loom desktop app. Live Text doesn't help. A hotkey OCR tool does.
Workflow tips that actually move the needle
Use IINA for frame-by-frame stepping. IINA is a polished open-source video player for Mac built on MPV. The arrow keys step one frame at a time. When the text you want flashes on screen for less than a second, IINA lets you land exactly on the frame where the text is sharpest, then OCR off that frame.
Pause at high-resolution moments. If a tutorial cuts between a wide shot of the speaker and a zoomed-in screen recording of code, OCR the zoomed-in frame. The same code in the wide shot is too small for the codec to preserve cleanly.
Zoom the player before pausing if the text is tiny. Most browsers honor Cmd+Plus on YouTube and other web players. A single zoom step often takes 12px caption text up to 18px and dramatically improves OCR accuracy.
Use the OCR history for multi-frame captures. Long code blocks frequently span more than one frame. Capture each frame as you reach it; Cheese! OCR keeps every recognition in a searchable list, so you can stitch them together in your editor afterward without losing intermediate captures.
Switch to closed captions when they exist. If the video has CC and you only need the spoken content, captions are always going to be more accurate than OCR'ing burned-in subtitles. OCR on video frames is the right tool for on-screen text the captions don't include — code, slide bullets, chat panels, screen-shared documents — not for replacing transcripts.
Caveats worth knowing before you trust the output
OCR on video frames is good, not magical. A few honest limits:
- Highly compressed video produces worse OCR than a screenshot. A 480p stream of a 1080p screen recording is going to lose strokes on small CJK characters and confuse
Owith0. Capture the highest available quality. - Tiny text may need a zoom pass first. If the video player is rendering 10px text on a 13" Retina display, OCR will struggle. Zoom the page or the player before pausing.
- Animated text needs a deliberate pause. If a callout slides in over half a second, pause when the animation completes, not while it's still moving.
- DRM-protected recordings can't be helped from the OCR side. If the screenshot comes back as a black rectangle, no OCR tool will rescue it. The fix is to get an unprotected version of the content.
None of these are dealbreakers. They're just the same trade-offs you'd hit with any screenshot OCR — slightly amplified by the fact that video is doubly compressed (once by the codec, once by whatever scaling the player applies). With a deliberate pause and a clean frame, modern Apple Vision OCR on a tutorial screencast is reliably good enough to skip retyping.
The summary, if you only remember one thing: video is just a sequence of pictures, so OCR works on it whenever you can capture a clean picture of it. Pause first, capture second, OCR third. The rest is choosing the right tool for the player you're in.