How to Listen to Academic Papers (and Actually Retain Them)

· 8 min read · by Harkable

Listening to a research paper sounds great until you actually try it. You hit play on a 35-page PDF, the synthetic voice ploughs straight into "Figure 1 (a)," reads the entire reference list out loud, and ten minutes later you have absorbed nothing. Most people try this once, decide audio doesn't work for academic content, and go back to highlighting PDFs they will never re-read.

That's the wrong conclusion. Audio works fine for papers, but papers were not written to be read aloud. They were written for silent scanning, with figures, equations, and footnotes that communicate structurally rather than linearly. If you don't account for that, TTS is useless. If you do, you can move through substantially more papers per week than reading allows, and retain more of each one.

This guide is the workflow we've landed on after a few years of doing this in earnest. It covers what to strip out before you press play, how to actually listen so the material lands, the playback-speed sweet spot most people miss, and a four-step retention method that beats just re-reading.

Strip the paper before you press play

The single biggest reason TTS-on-papers fails is that nobody strips the paper first. A typical journal article contains roughly 20–30% text that is hostile to audio: figure captions reading in mid-sentence, table contents, equation labels, bibliography entries, and "see Section 4.2.1" cross-references that go nowhere when you can't see the page.

Before converting, do a 90-second pass: copy the paper into a text editor, delete the references section, delete the figure and table contents (keep the captions, those are usually useful prose), collapse inline equations into a short verbal description in brackets (e.g. "[an equation defining loss as the sum of two terms]"), and remove anything in a sidebar or footnote unless it's load-bearing.

For most papers this takes one minute and cuts about a third of the length. The remaining 70% reads cleanly because it was already prose. If you're listening to a lot of papers in one field, build a small text-cleaner script, it's the highest-leverage 30 lines of Python you'll ever write.

Voice and speed actually matter

For technical content, neutral voices outperform expressive ones. OpenAI's onyx, echo, and sage all work well; the more "characterful" voices (think anything described as warm or playful) are great for fiction and tiring for academic writing. You want a voice you stop noticing within thirty seconds.

Playback speed is the lever most people get wrong. The default feels too slow within five minutes, so people crank it to 2× and then comprehension cliff-dives on anything with dense terminology. The sweet spot for technical material is roughly 1.25×–1.4×. That's meaningfully faster than reading aloud, but slow enough that your brain can resolve unfamiliar terms in real time. Save 1.75× and 2× for second listens.

If you find yourself rewinding more than once or twice per minute, slow down. The whole point is to absorb the material, not to clear a queue.

Listen actively, not as background noise

Audio can be background-able for podcasts and fiction. It generally is not for technical papers. You will retain about half as much from a paper you "listened to while doing the dishes" as from one you listened to on a walk with no other input. Pick the listening context deliberately.

The best contexts in our experience are walking, commuting on transit (not driving, too cognitively expensive), light exercise like a stationary bike, and the dead 20 minutes before bed. The worst are working on something else, driving, and any environment with people talking nearby. If you only have a noisy environment available, save the paper for later and listen to something non-technical instead.

It's also worth taking voice notes when something lands. Holding a thought across a 30-minute listen without writing it down is unrealistic, most of what you noted at minute 8 is gone by minute 22. A 10-second voice memo saved to a "papers" folder is enough.

A four-step retention loop

This is the loop that, in practice, has produced the highest hit rate on us actually remembering papers a month later:

Step one, read the abstract on the page before listening. Two minutes. This loads the structural context: what the paper claims, roughly how, and what to listen for.

Step two, listen to the cleaned paper end to end at 1.3×, ideally on a walk. Don't pause to look things up; let it run. If something seems important, voice-memo a single sentence.

Step three, within 24 hours, write 5 sentences from memory: what's the claim, what's the method, what's the result, what would falsify it, what's the strongest objection. If you can't write them, you haven't absorbed the paper yet, listen to the discussion section again.

Step four, a week later, look at the figures you skipped. By now you have the verbal model in your head, and the figures land hard instead of being decorative.

How Harkable solves this

Harkable was built specifically for documents like this, long PDFs that need to become MP3s you keep on your phone, listen to once or twice, and then archive. You pay roughly $0.40–$1.00 to convert a typical paper, not a $139/year subscription. The audio is a normal MP3 file, it plays in your car, on a flight, in airplane mode, anywhere a music file plays. There is no app you have to open.

Our text cleaner already handles the common offenders (figure labels, equation tags, reference sections) so most papers don't need much pre-processing. You get 2 free MP3s every month forever, which is enough to test the workflow on real material before spending anything. If you want a deeper dive into the use case, we wrote one up at Harkable for researchers.

Try Harkable free

Upload a PDF, Word doc, or pasted text. Get an MP3 you keep forever. 2 free conversions every month. No subscription.