I have extracted the audio embeddings from Google Audioset corpus (https://research.google.com/audioset/dataset/index.html). The audio embeddings contain a list of “bytes_lists” which is similar to the following
feature { bytes_list { value: "#226]06(N223K377207r36333337700Y322v935130300377311375215E342377J0000_00370222:2703773570024500377213jd267353377J33$2732673073537700207244Q00002060000312356<R325g30335616N224377270377237240377377321252j357O217377377,33000377|24600133400377357212267300b000000251236002333500326377327327377377223009{" } }
From the documentation and forum discussions, I learnt that these embeddings are the output of a pretrained model (MFCC+CNN) of the 10 second chunks of respective youtube videos. I have also learnt that these embeddings make it easy to work on deep learning models. How does it help the ML engineers?
My confusion is if these audio embeddings are already pre-trained, what are the utilities of these audio embeddings? i.e. How can I use these embeddings to train advanced models for performing Sound Event Detection?
submitted by /u/sab_1120
[visit reddit] [comments]