What "on-device" actually means
"On-device OCR" is a phrase marketing copy throws around a lot, so it's worth being precise about what it means in the Apple Vision case. When an app on macOS asks Vision to recognize text in an image, the entire pipeline — image preprocessing, the convolutional and transformer networks that detect and decode glyphs, the language model that resolves ambiguity — runs on hardware physically inside your Mac. The CPU, the integrated GPU, and on Apple Silicon the Neural Engine, do all of the work. No bytes leave the machine.
Compare that to cloud OCR. Google Cloud Vision, AWS Textract, and Azure Computer Vision all expect you to upload the image as a base64 blob over HTTPS. The image lands in the provider's data center, runs through their model, and the recognized text comes back. That round trip is bounded by the speed of light, your network, and the provider's queue. It also means the image is, however briefly, on someone else's hard drive — and subject to whatever logging, retention, and ML-training policies that vendor has at the moment.
On-device avoids all of that. There is no upload step, no network entitlement required for the app, no API key to manage, no quota to monitor, no per-call billing, and no privacy policy to read. The trade-off is that you are bound by the model the OS ships and the silicon you bought. Most Mac users find that trade-off easy to make for screenshot OCR.
A short history of Apple Vision Framework
Apple Vision launched at WWDC 2017 as part of iOS 11 and macOS 10.13 High Sierra. The first version handled face detection, rectangle detection, and barcode reading — useful but not yet OCR. Real text recognition arrived with iOS 13 and macOS 10.15 Catalina in 2019, when Apple added VNRecognizeTextRequest. That was when "Vision can read text" stopped being a marketing claim and became something you could call from a few lines of Swift.
Live Text, the user-facing feature most people associate with Apple's on-device OCR, arrived two years later with iOS 15 and macOS Monterey. Live Text didn't replace the framework — it just put a UI in front of it. Underneath, the same VNRecognizeTextRequest call is doing the work whether you long-press a photo in Messages, hover over text in Safari, or run a third-party app like Cheese! OCR. Apple iterated on the recognition model heavily through macOS Ventura, Sonoma, and Sequoia, adding more languages and improving accuracy on each major release.
By 2026 the framework supports dozens of languages, including the four CJK scripts that historically gave OCR engines the most trouble. Vision is now also used as a building block for Photos search ("show me photos with text"), VoiceOver image descriptions, and the Visual Look Up feature.
The VNRecognizeTextRequest API
The class developers actually call is VNRecognizeTextRequest. It takes an input image, a configuration, and a completion handler that returns an array of VNRecognizedTextObservation objects, each containing recognition candidates and a bounding box.
Two configuration choices matter most. The first is recognitionLevel, which has two modes: .fast uses a smaller model optimized for speed and is the right choice for live camera frames or scrolling video; .accurate uses a larger model and is the right choice for static screenshots and PDF pages. Most Mac OCR utilities, including Cheese! OCR, default to .accurate because the user is staring at a single image and a few hundred extra milliseconds is invisible.
The second is recognitionLanguages, an ordered array of BCP-47 language tags like "en-US", "zh-Hans", "ja-JP", "ko-KR". The order is a hint to the language model when it disambiguates similar shapes — for example, distinguishing the Latin letter "o" from the digit "0" or the Cyrillic "о". Passing all four CJK + English tags is the right default for a general-purpose Mac OCR tool whose user might paste anything from a Notion screenshot to a Japanese manga panel into the same hotkey.
There's also usesLanguageCorrection, a Boolean that turns on the post-recognition language model. With it on, "teh" becomes "the" and recognized strings are smoothed against vocabulary statistics. With it off, the engine returns whatever it sees pixel-for-pixel, which is sometimes preferable for code, license plates, or product SKUs.
A minimal Swift example
Here is the smallest useful chunk of Swift that runs Apple Vision OCR on a CGImage and prints out the recognized strings. This is the same call path Cheese! OCR uses internally, stripped of error handling and UI plumbing.
import Vision
func recognizeText(in cgImage: CGImage) {
let request = VNRecognizeTextRequest { request, error in
guard let observations = request.results as? [VNRecognizedTextObservation] else {
return
}
let lines = observations.compactMap { $0.topCandidates(1).first?.string }
print(lines.joined(separator: "\n"))
}
request.recognitionLevel = .accurate
request.recognitionLanguages = ["en-US", "zh-Hans", "ja-JP", "ko-KR"]
request.usesLanguageCorrection = true
let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
try? handler.perform([request])
}
A few things are worth noting. The completion handler runs on a background queue, so a UI app would dispatch back to the main thread before updating the clipboard or showing a toast. topCandidates(1) returns the top guess; you can ask for more if you want to surface ambiguous lines to the user. The VNImageRequestHandler takes a CGImage here, but it can also accept CVPixelBuffer, CIImage, or a file URL — useful when you're feeding it from a video frame or a PDF page.
That's the whole API surface for the common case. There is no SDK to install, no model file to ship, no auth flow, no usage dashboard. The OS provides the model, the silicon runs it, and your app just calls a function.
Why the Neural Engine matters on Apple Silicon
The Neural Engine is the dedicated machine-learning accelerator inside every Apple Silicon chip. The M1 launched with 16 cores; M2 and M3 kept that core count; M4 brought architectural improvements to throughput and shared memory bandwidth. On an Apple Silicon Mac, Vision's text recognition models are compiled to run on the Neural Engine by default, with the GPU as a fallback for parts of the pipeline the Neural Engine can't accelerate.
The practical effect is twofold. First, OCR is fast — fast enough that a screenshot region of a few hundred pixels feels instantaneous when you press a hotkey. Second, OCR is energy-efficient. The Neural Engine is much more power-efficient per inference than the CPU or GPU, which means running OCR on hundreds of screenshots in a row doesn't noticeably drain the battery on a MacBook. The same hardware also accelerates Siri, Photos search, transcription in Voice Memos, and Apple Intelligence features — Vision is just one of many tenants.
On Intel Macs, the framework still works. It falls back to CPU and GPU paths and is somewhat slower, but for typical screenshot-sized images the difference is rarely felt by a human user. Vision was deliberately designed to remain usable across the entire Mac lineup, not just the latest hardware.
Privacy: why on-device matters
Privacy is the part of on-device OCR that most often gets glossed over. The technical claim is simple: when you OCR a screenshot using Apple Vision, the bytes of that screenshot never leave your Mac. Apple does not see them. Your network does not see them. The app vendor does not see them, even if their app is the one calling Vision.
That last point matters because "on-device OCR" is sometimes used loosely. An app can use Apple Vision and still phone home with the recognized text, telemetry, or screenshots for "training." The way to verify on-device behavior is to look at the app's network entitlements in the App Store sandbox. An app distributed through the Mac App Store must declare every network capability it uses; an app with no network entitlements at all cannot make HTTP calls of any kind, period — the operating system blocks them at the kernel level.
Cheese! OCR is in that latter category. It has no network entitlements, no telemetry SDK, no analytics, and no crash reporting service that ships data off the machine. The Vision call goes to the OS, the recognized text goes to the system clipboard, and that is the entire data flow. If you're using OCR for screenshots that contain confidential business documents, internal Slack threads, or personal correspondence, this matters quite a lot.
Apple Vision vs cloud OCR — comparison table
To put the trade-offs in one place, here is how Apple Vision compares to the three big cloud OCR APIs. Cloud pricing is given as a range because the major vendors restructure their tiers regularly and the exact number depends on volume, region, and feature flags; treat these as ballpark figures and check the vendor's pricing page for current numbers.
| Feature | Apple Vision | Google Cloud Vision | AWS Textract | Azure Computer Vision |
|---|---|---|---|---|
| Pricing | Free with macOS / iOS | Roughly $1.50 per 1,000 calls (after free tier) | Roughly $1.50 per 1,000 pages | Roughly $1.00 per 1,000 transactions |
| Languages | Dozens, including CJK + Latin scripts | 50+ printed languages, handwriting subset | Strong on English / Latin scripts; growing CJK | Broad coverage including CJK |
| Latency | Local; no network round trip | Network bound + processing | Network bound + processing | Network bound + processing |
| Privacy | 100% on-device, image never leaves Mac | Image uploaded to Google data center | Image uploaded to AWS data center | Image uploaded to Azure data center |
| Best at | Real-time UI OCR, screenshots, Live Text | Document AI, large-scale pipelines | Forms, tables, structured documents | Mixed enterprise OCR + image analysis |
The summary, for an end-user Mac app: Apple Vision wins on cost, latency, and privacy. The cloud APIs win when you need server-side batch processing, structured form parsing, or languages that Apple doesn't ship.
What Apple Vision still struggles with
Pretending Apple Vision is perfect would be dishonest, and the framework's limitations are worth knowing if you're picking an OCR strategy.
Cursive handwriting. Vision can sometimes pull characters out of neat block printing, but cursive English, hand-written kanji, or any free-form note-taking is hit or miss. If your workflow centers on scanning handwritten notes, Apple Vision alone won't be enough — a multimodal LLM or a specialized handwriting recognizer will do better.
Vertical CJK text. Japanese books, Chinese classical texts, and traditional posters often run text top-to-bottom, right-to-left. Vision's text detector is heavily biased toward left-to-right horizontal lines and frequently fragments vertical columns or reads them in the wrong order. There's no public configuration flag to tell it the text is vertical.
Severely skewed or warped text. Photos of book pages bent at the spine, signs photographed at sharp angles, or screenshots of zoomed perspective-projected text can confuse the detector. Vision does some deskewing but it's not as aggressive as some cloud services.
Decorative typography. Stylized fonts on game UI, calligraphy, vintage signage, or heavily kerned logo type degrade accuracy. The model is trained on common typefaces; the long tail of decorative ones is a known weak spot.
Math and chemistry notation. Equations, integrals, chemical structures, and any 2D layout that isn't reading order get flattened to a linear string. For LaTeX or MathML output you need a specialized model — Mathpix, for instance, exists for exactly this reason.
None of these limitations mean on-device OCR is wrong for the common case. They just mean it isn't a universal solvent.
Why this matters for tools like Cheese! OCR
The reason any of this matters for a Mac OCR utility is that the framework choice determines what the app can and can't promise. A cloud-backed OCR app that markets "lightning-fast OCR" has a fundamental floor at network latency and a permanent privacy asterisk. A local-only OCR app built on Apple Vision inherits the framework's floor (very fast) and its ceiling (the languages and accuracy Apple ships).
Cheese! OCR is built directly on Apple Vision. Press the hotkey, drag-select a region of the screen, and the resulting CGImage is handed to VNRecognizeTextRequest with the four default languages enabled. The recognized text lands in the clipboard a moment later and gets logged to a local SQLite history database. There is no server. There are no network entitlements. There is no telemetry. The app is running the same model as Live Text, with a different — and frankly more useful — UI bolted on top.
That choice has implications a user can feel. The app works on a plane. It works on a corporate VPN that blocks half the internet. It works in a country where Google services are unreachable. It doesn't slow down when your home Wi-Fi flakes. And when you OCR a screenshot of a confidential contract or a private message, you don't have to think about which third party has now seen it.
If you're picking an OCR strategy for your own Mac, the framework conversation is the one that should come first. Cloud and local each have legitimate use cases. For everyday screenshot OCR on a personal machine, on-device with Apple Vision is hard to beat — and the questions below cover the ones we hear most often.