1

For the transcription of audio recordings, I use the following command:

fish -c 'echo $fish_pid > ~/.kill_pid;exec arecord -f cd' | \
curl --url https://api.openai.com/v1/audio/transcriptions \
     --request POST \
     --header "Authorization: Bearer $OPENAI_API_KEY" \
     --header "Content-Type: multipart/form-data" \
     --form file=@-\;filename=f.wav \
     --form model=gpt-4o-mini-transcribe \
     --form response_format=text

Explanation

The first command of the pipe (fish -c ...) starts the audio recorder arecord that records from the mic in audio/wav format and streams the recorded audio data to stdout (until it is manually killed by a SIGTERM using the PID in ~/.kill_pid).

The second command of the pipe (curl) uploads the audio data to the transcription endpoint of OpenAI. The endpoint receives the data in the structure of an HTML form and expects besides the actual data a filename with a suffix (in my case: f.wav) to detect the format of the audio data transmitted. Hence, the curl parameter: --form file=@-\;filename=f.wav

The problem

The problem is that curl does not start the transmission/upload of the data until the pipe closes. It seems that curl badly wants to set the Content-Length header for which it needs to know the length of all file data to send. But I want the data to be uploaded continuously while the audio is being recorded, so I don't need to wait for the upload after the recording is terminated. (So, I don't care about the Content-Length header; just want my audio data to be being transmitted while the data is being created/recorded.)

Is there a tool or a library that allows to do that?
(Or maybe just another magic parameter to curl?)

What didn't work out so far

  • httpie does not recognize - as a file specifier for stdin. Specifying /dev/stdin runs into the problem: (a) it does not know the additional filename=f.wav specifier, (b) exchanging it for type=audio/wav returns Request body (from stdin, --raw or a file) and request data (key=value) cannot be mixed.
  • The http libraries of python and perl do not accept stdin for form-POSTING of file data.
    (No matter what AI assistants may claim.)
3
  • 1
    In your test with HTTPie, you need to add --ignore-stdin to avoid the “cannot be mixed” error. Commented Oct 30 at 9:03
  • @StephenKitt Ending up with http: error: OSError: [Errno 29] Illegal seek - as you already pointed out in a comment to my other question. Thx. Commented 2 days ago
  • @Min-SooPipefeet It's not clear why you would need to go the stdin route if you're using PERL or Python, but yeah, they defintiely do, I've written more than one Python HTTP client that took data from stdin. You do need to know how to read things from stdin and put them into the request. Anyways, the easier way here would seem to simply use any library that reads samples from your sound system in a loop that then sends them out, instead of trying to push audio samples through stdout/stdin. Commented 2 days ago

1 Answer 1

3

It seems you're working on a dead end: you want to stream audio as it's being recorded to someone else's server, for them to transcribe it on the fly. You assume that other server will take that stream.

However, you need to work with the functionality that other server offers.

OpenAI's HTTP transcription endpoint only supports a maximum file size of 25 MB. Your approach cannot work. No matter what trickery you do to HTTP, what client you use, and how you paramterize that client. OpenAI just does not offer streaming arbitrary-length recordings via HTTP: that's all very explicitly documented (I literally just used a search engine to for documentation https://api.openai.com/v1/audio/transcriptions and found that what you want to do is impossible in < 10s. I think your research here might have been going in the wrong directions).

What it does support is streaming via WSS, websockets, which makes a whole lot mere sense, because as grawity wrote, the HypterText Transfer Protocol might not be the right thing to use if you're streaming unterminated data. There's literally a section titled "Streaming the transcription of an ongoing audio recording" in OpenAI's endpoint docs, which explains what you need to do. It's not what you've been doing; so: a change in approach is necessary.

Also, for someone who's interested in streaming voice, you're sure using the worst possible audio format – a stereo, uncompressed, 44100 Hz sampled PCM stream. That makes each second of audio 1.346 Mb large, which certainly makes your upload latency large, for absolutely no gain whatsoever – stereo doesn't help transcription, and the first thing that happens is that your audio stream gets resampled to 16 kHz by openAI before even feeding it into the model. (Ask yourself: Have you read up on what you're using there to understand its inputs and outputs sufficiently to know how it can be used? When finding that, have you read up on the -f cd option for arecord?)

Really, a few lines of Python (and read the documentation, letting an AI assistant write code for you has not served you well, has it? It's just converted your time in more work in dead-end directions, and the time of real people really answering your questions, instead of just writing something that sounds as if it was an answer, which is literally the definition of what an LLM does.):

  • open the connection to the endpoint via WebSockets, prepare for streaming data
  • open the audio device
  • set up signal handler in case you really want to quit using SIGTERM or something; it would seem to me there'd be more elegant ways.
  • while (no signal handled): samples = get samples from audio system; compressed = encode as webm containing opus(samples); send to server (compressed);
  • finally, close the audio device and the connection properly.
5
  • AI classification: "The tone of the message is blunt, condescending, and corrective, with moments of technical thoroughness. - The degree of friendliness is low — the author shows little empathy or patience for the recipient’s misunderstandings and frequently uses phrasing that comes across as dismissive or belittling." Commented yesterday
  • Hi Min-Soo! I don't care the least how my effects on AI are, because AI is not humans and has no understanding nor feelings. So, miss me with that nonsense! But! I care about whether you are OK with this. I think I helped you a lot, because, well, you're working in a dead end? Commented yesterday
  • I agree with the AI classification. Commented yesterday
  • cool, why don't you just then say that? Getting an AI to "rate" a human's text is incredibly offensive, quite honestly. Commented yesterday
  • So, back on topic: Did this post help you find your way out of the dead end? Commented yesterday

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.