Feat/hzjh-AudioOps#496
Open
starlight6336 wants to merge 4 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Audio Operators Final Overview
本文档说明最终版本新增并保留的音频算子。仅描述算子职责、预期输入输出、模型调用和运行依赖边界;不包含测试步骤,也不解释中间曾加入后删除的实现。
通用约定
Mapper算子,运行时处理单个输入样本并返回单个输出样本。sample["data"]音频字节;没有上游字节时读取sample["filePath"]。sample["ext_params"],同时尽量保留当前音频,便于后续算子继续处理。sample["text"],通常导出为.txt。local_libs、bin、lib、site-packages。/models/AudioOperations/...,不作为 Python 包打包进算子。音频转文本链路相关算子
AudioFormatConvert
AudioFormatConvertsample["data"]音频字节。sample["data"],目标格式写入sample["target_type"],用于后续 LID/ASR。targetFormat、sampleRate、channels。pydub==0.25.1、soundfile==0.12.1、numpy==2.2.6;系统ffmpeg由 DataMate 环境提供。AudioGtcrnDenoise
AudioGtcrnDenoisesample["data"]音频字节。sample["data"],sample["target_type"]为wav。modelPath、device、sampleRate等降噪相关配置。/models/AudioOperations/gtcrn/gtcrn.onnx。onnxruntime==1.19.2、soundfile==0.12.1、numpy==2.2.6、scipy==1.13.1。AudioAnomalyFilter
AudioAnomalyFiltersample["data"]音频字节。sample["ext_params"]["audio_quality"]。后续音频算子可根据quality_flag和skip_downstream决定是否软跳过。minDur、maxDur、silenceRatioTh、silenceRmsRatioTh、skipInvalidDownstream。torchaudio==2.8.0,兜底使用soundfile==0.12.1。AudioFastLangId
AudioFastLangIdzh或en,用于 ASR 自动选模型。sample["data"]音频字节。sample["ext_params"]["audio_lid"]["lang"],并可在文件名追加__lid_zh或__lid_en。modelSource、modelSavedir、device、batchSize、maxSeconds。/models/AudioOperations/lid/speechbrain_lang-id-voxlingua107-ecapa。torch==2.8.0、torchaudio==2.8.0、speechbrain==1.0.3、HyperPyYAML==1.2.2。AudioFastLangIdText
AudioFastLangIdTextsample["text"]为zh或en,最终导出.txt;同时写入sample["ext_params"]["audio_lid"]["lang"]。modelSource、modelSavedir、device、batchSize、maxSeconds。/models/AudioOperations/lid/speechbrain_lang-id-voxlingua107-ecapa。torch==2.8.0、torchaudio==2.8.0、speechbrain==1.0.3、HyperPyYAML==1.2.2。AudioAsrTranscribe
AudioAsrTranscribesample["data"]音频字节;可读取上游audio_lid.lang选择中文或英文 ASR 模型。sample["text"],最终导出当前输入对应的.txt;运行信息写入sample["ext_params"]["audio_asr_transcribe"]。language、zhModelDir、enModelDir、device、mode、batchSize、maxSegmentSeconds、referenceTextPath、keepArtifacts。/models/AudioOperations/asr/aishell,英文默认模型目录/models/AudioOperations/asr/librispeech。torch==2.8.0、torchaudio==2.8.0、numpy==2.2.6、PyYAML==6.0.2、sentencepiece==0.2.1、loguru==0.7.3;wenet.bin.recognize必须由 DataMate 运行环境提供。AudioAsrPipeline
AudioAsrPipelinesample["text"]并导出.txt;中间产物路径、语言、模型、可选报告写入sample["ext_params"]["audio_asr"]。doDenoise、denoiseModelPath、doAnomalyFilter、minDur、maxDur、silenceRatioTh、lidModelSource、lidDevice、lidMaxSeconds、maxSegmentSeconds、asrDevice、referencePath。pydub、soundfile、numpy、onnxruntime、torch、torchaudio、speechbrain、wenet、系统ffmpeg。情感识别、概括、分类算子
AudioEmotionRecognize
AudioEmotionRecognizesample["data"]音频字节。sample["ext_params"]["audio_emotion"],可在最终文件名追加情感标记。/models/AudioOperations/...。torch==2.8.0、torchaudio==2.8.0、numpy==2.2.6等运行环境包。AudioTextSummarize
AudioTextSummarizesample["text"];为空时可读取 txt/md/json/jsonl 文本文件。sample["text"];运行细节写入sample["ext_params"]["audio_text_summarize"]。method、maxSummaryCharsZh、maxSummaryWordsEn、minSummaryWordsEn、lineMode、preserveKeys、onnxModelDir、cpuThreads。bert_onnx,读取/models/AudioOperations/summary/summary-model下的 ONNX 模型和 tokenizer。numpy==2.2.6、jieba==0.42.1;ONNX 模式需要onnxruntime==1.19.2、transformers==4.57.6。AudioSoundClassify
AudioSoundClassifysample["data"]音频字节。sample["ext_params"]["audio_sound_classify"],可在最终文件名追加__sound_<macro_class>。backend、astCheckpoint、pannsCheckpoint、astMacroMap、macroMap、device、topK、humanSpeechThreshold、segmentSeconds、hopSeconds。/models/AudioOperations/recog/audioset_10_10_0.4593.pth;兼容 PANNs 模型/models/AudioOperations/panns/Cnn14_16k_mAP=0.438.pth。torch==2.8.0、torchlibrosa==0.0.4、timm==1.0.26、librosa==0.10.2.post1、numpy==2.2.6、soundfile==0.12.1、scipy==1.13.1、panns-inference==0.1.1。其他行业单功能音频算子
AudioDcOffsetRemoval
AudioDcOffsetRemovalsample["data"]音频字节。sample["data"],sample["target_type"]为wav。soundfile==0.12.1、numpy==2.2.6。AudioHumNotch
AudioHumNotchsample["data"]音频字节。sample["data"]。freqHz、quality、harmonics等陷波配置。soundfile==0.12.1、numpy==2.2.6、scipy==1.13.1。AudioNoiseGate
AudioNoiseGatesample["data"]音频字节。sample["data"]。soundfile==0.12.1、numpy==2.2.6。AudioPreEmphasis
AudioPreEmphasissample["data"]音频字节。sample["data"]。soundfile==0.12.1、numpy==2.2.6。AudioQuantizeEncode
AudioQuantizeEncodesample["data"]音频字节。sample["data"]。soundfile==0.12.1、numpy==2.2.6。AudioRmsLoudnessNormalize
AudioRmsLoudnessNormalizesample["data"]音频字节。sample["data"]。soundfile==0.12.1、numpy==2.2.6。AudioSimpleAgc
AudioSimpleAgcsample["data"]音频字节。sample["data"]。soundfile==0.12.1、numpy==2.2.6。AudioSoftPeakLimiter
AudioSoftPeakLimitersample["data"]音频字节。sample["data"]。soundfile==0.12.1、numpy==2.2.6。AudioTelephonyBandpass
AudioTelephonyBandpasssample["data"]音频字节。sample["data"]。lowHz、highHz、order,默认约为 300-3400 Hz。soundfile==0.12.1、numpy==2.2.6、scipy==1.13.1。AudioTrimSilenceEdges
AudioTrimSilenceEdgessample["data"]音频字节。sample["data"]。frameMs、hopMs、threshDb、padMs。soundfile==0.12.1、numpy==2.2.6。