Voice Modeling Methods for Automatic Speaker Recognition

Stadelmann, Thilo

Titel:	Voice Modeling Methods for Automatic Speaker Recognition
Autor:	Stadelmann, Thilo
Weitere Beteiligte:	Freisleben, Bernd (Prof. Dr.)
Veröffentlicht:	2010
URI:	https://archiv.ub.uni-marburg.de/diss/z2010/0465
DOI:	https://doi.org/10.17192/z2010.0465
URN:	urn:nbn:de:hebis:04-z2010-04657
DDC:	Informatik
*Titel (trans.):*	Methoden der Stimmmodellierung für die automatische Sprechererkennung
Publikationsdatum:	2010-08-02
Lizenz:	https://rightsstatements.org/vocab/InC-NC/1.0/

Dokument

Schlagwörter:
Allgemeine Didaktik, Algorithmus, Sprechererkennung, MFCC, Methode, Temporal Modeling, Automatische Sprechererkennung, GMM, Mustererkennung, Time Model, Eidetic Design, Artificial Intelligence: AI
Referenziert von:

Zusammenfassung:
Building a voice model means to capture the characteristics of a speaker´s voice in a data structure. This data structure is then used by a computer for further processing, such as comparison with other voices. Voice modeling is a vital step in the process of automatic speaker recognition that itself is the foundation of several applied technologies: (a) biometric authentication, (b) speech recognition and (c) multimedia indexing. Several challenges arise in the context of automatic speaker recognition. First, there is the problem of data shortage, i.e., the unavailability of sufficiently long utterances for speaker recognition. It stems from the fact that the speech signal conveys different aspects of the sound in a single, one-dimensional time series: linguistic (what is said?), prosodic (how is it said?), individual (who said it?), locational (where is the speaker?) and emotional features of the speech sound itself (to name a few) are contained in the speech signal, as well as acoustic background information. To analyze a specific aspect of the sound regardless of the other aspects, analysis methods have to be applied to a specific time scale (length) of the signal in which this aspect stands out of the rest. For example, linguistic information (i.e., which phone or syllable has been uttered?) is found in very short time spans of only milliseconds of length. On the contrary, speakerspecific information emerges the better the longer the analyzed sound is. Long utterances, however, are not always available for analysis. Second, the speech signal is easily corrupted by background sound sources (noise, such as music or sound effects). Their characteristics tend to dominate a voice model, if present, such that model comparison might then be mainly due to background features instead of speaker characteristics. Current automatic speaker recognition works well under relatively constrained circumstances, such as studio recordings, or when prior knowledge on the number and identity of occurring speakers is available. Under more adverse conditions, such as in feature films or amateur material on the web, the achieved speaker recognition scores drop below a rate that is acceptable for an end user or for further processing. For example, the typical speaker turn duration of only one second and the sound effect background in cinematic movies render most current automatic analysis techniques useless. In this thesis, methods for voice modeling that are robust with respect to short utterances and background noise are presented. The aim is to facilitate movie analysis with respect to occurring speakers. Therefore, algorithmic improvements are suggested that (a) improve the modeling of very short utterances, (b) facilitate voice model building even in the case of severe background noise and (c) allow for efficient voice model comparison to support the indexing of large multimedia archives. The proposed methods improve the state of the art in terms of recognition rate and computational efficiency. Going beyond selective algorithmic improvements, subsequent chapters also investigate the question of what is lacking in principle in current voice modeling methods. By reporting on a study with human probands, it is shown that the exclusion of time coherence information from a voice model induces an artificial upper bound on the recognition accuracy of automatic analysis methods. A proof-of-concept implementation confirms the usefulness of exploiting this kind of information by halving the error rate. This result questions the general speaker modeling paradigm of the last two decades and presents a promising new way. The approach taken to arrive at the previous results is based on a novel methodology of algorithm design and development called “eidetic design". It uses a human-in-the-loop technique that analyses existing algorithms in terms of their abstract intermediate results. The aim is to detect flaws or failures in them intuitively and to suggest solutions. The intermediate results often consist of large matrices of numbers whose meaning is not clear to a human observer. Therefore, the core of the approach is to transform them to a suitable domain of perception (such as, e.g., the auditory domain of speech sounds in case of speech feature vectors) where their content, meaning and flaws are intuitively clear to the human designer. This methodology is formalized, and the corresponding workflow is explicated by several use cases. Finally, the use of the proposed methods in video analysis and retrieval are presented. This shows the applicability of the developed methods and the companying software library sclib by means of improved results using a multimodal analysis approach. The sclib´s source code is available to the public upon request to the author. A summary of the contributions together with an outlook to short- and long-term future work concludes this thesis.

Bibliographie / References

T. Hofmann. Probabilistic Latent Semantic Analysis. In Proceedings of the 15 th Annual Conference in Uncertainty in Artificial Intelligence (UAI'99), pages 289–296, Stockholm, Sweden, July 1999.
M. Köppen. The Curse of Dimensionality. In Proceedings of the 5 th Online World Conference on Soft Computing in Industrial Applications (WSC5), Held on the internet, September 2000.
Bibliography LASG Forum. Explanations of Misconception in a Wavelet Analysis Pa- per by Torrence and Compo and in EMD-HHT Method by Huang. On- line web resource, September 2004. URL http://bbs.lasg.ac.cn/bbs/ thread-3380-1-1.html. Visited 05. March 2010.
G. Wang, A. V. Kossenkov, and M. F. Ochs. LS-NMF: A Modified Non-Negative Matrix Factorization Algorithm Utilizing Uncertainty Estimates. BMC Bioin- formatics, 7(175), March 2006a. URL http://bioinformatics.fccc.edu/ software/OpenSource/LS-NMF/lsnmf.shtml. Visited 18. March 2010.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufman Publishers, San Francisco, CA, USA, 2 nd edition, 2005.
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 2006.
P. Viola and M. J. Jones. Robust Real-Time Face Detection. International Journal of Computer Vision, 57(2):137–154, 2004.
W.-H. Tsai and H.-M. Wang. Automatic Singer Recognition of Popular Music Recordings via Estimation and Modeling of Solo Voice Signals. IEEE Trans- actions on Audio, Speech, and Language Processing, 14:330–331, 2006.
M. Vlachos, G. Kollios, and D. Gunopulos. Discovering Similar Multidimensional Trajectories. In Proceedings of the 18 th International Conference on Data En- gineering (ICDE'02), pages 673–684, San Jose, CA, USA, February 2002.
K. Grauman and T. Darrell. Fast Contour Matching Using Approximate Earth Mover's Distance. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'04), pages 220–227, Wash- ington DC, USA, June 2004. IEEE.
K. J. Han and S. S. Narayanan. A Robust Stopping Criterion for Agglomera- tive Hierarchical Clustering in a Speaker Diarization System. In Proceedings of the 8 th Interspeech'07—Eurospeech, pages 1853–1856, Antwerpen, Belgium, August 2007. ISCA.
N. Srinivasamurthy and S. Narayanan. Language-Adaptive Persian Speech Recognition. In Proceedings of the 8 th European Conference on Speech Commu- nication and Technology (Eurospeech'03), pages 3137–3140, Geneva, Switzer- land, September 2003. ISCA.
W.-H. Tsai and H.-M. Wang. On the Extraction of Vocal-related Information to Facilitate the Management of Popular Music Collections. In Proceedings of the Joint Conference on Digital Libraries (JCDL'05), pages 197–206, Denver, CO, USA, June 2005.
J. Goldberger and H. Aronowitz. A Distance Measure Between GMMs Based on the Unscented Transform and its Application to Speaker Recognition. In Proceedings of the 9 th European Conference on Speech Communication and Technology (Interspeech'05–Eurospeech), pages 1985–1989, Lisbon, Portugal, September 2005. ISCA.
E. Levina and P. Bickel. The Earth Mover's Distance is the Mallows Distance: Some Insights from Statistics. In Proceedings of the 8 th IEEE International Conference on Computer Vision (ICCV'01), volume 2, pages 251–256, Van- couver, BC, Canada, July 2001. IEEE.
L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
T. Chi, P. Ru, and S. A. Shamma. Multiresolution Spectrotemporal Analysis of Complex Sounds. Journal of the Acoustical Society of America, 118(2):887–906, 2005.
J. P. Campbell. Speaker Recognition: A Tutorial. Proceedings of the IEEE, 85: 1437–1462, 1997.
S. Heinzl, M. Mathes, T. Friese, M. Smith, and B. Freisleben. Flex-SwA: Flexible Exchange of Binary Data Based on SOAP Messages with Attachments. In Proceedings of the IEEE International Conference on Web Services (ICWS'06), pages 3–10, Chicago, USA, September 2006. IEEE Press.
M. Heidt, T. Dörnemann, K. Dörnemann, and B. Freisleben. Omnivore: Integra- tion of Grid Meta-Scheduling and Peer-to-Peer Technologies. In Proceedings of the 8 th IEEE International Symposium on Cluster Computing and the Grid (CCGrid'08), pages 316–323, Lyon, France, May 2008.
S. Guruprasad, N. Dhananjaya, and B. Yegnanarayana. AANN Model for Speaker Recognition Based on Difference Cepstrals. In Proceedings of the International Joint Conference on Neural Networks (IJCNN'03), pages 692–697, Portland, OR, USA, July 2003. IEEE.
W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. Loui. Short-Term Audio- Visual Atoms for Generic Video Concept Classification. In Proceedings of the ACM International Conference on Multimedia (ACMMM'09), pages 5–14, Bei- jing, China, October 2009. ACM. Best paper candidate.
L. Ferrer, H. Bratt, V. R. R. Gadde, S. Kajarekar, E. Shriberg, K. Sönmez, A. Stolcke, and A. Venkataraman. Modeling Duration Patterns for Speaker Recognition. In Proceedings of the 8 th European Conference on Speech Commu- nication and Technology (Eurospeech'03), pages 2017–2020, Geneva, Switzer- land, September 2003. ISCA.
E. Juhnke, D. Seiler, T. Stadelmann, T. Dörnemann, and B. Freisleben. LCDL: An Extensible Framework for Wrapping Legacy Code. In Proceedings of Inter- national Workshop on @WAS Emerging Research Projects, Applications and Services (ERPAS'09), pages 638–642, Kuala Lumpur, Malaysia, December 2009.
R. Ewerth, M. Mühling, T. Stadelmann, J. Gllavata, M. Grauer, and B. Freisleben. Videana: A Software Toolkit for Scientific Film Studies. In Proceedings of the International Workshop on Digital Tools in Film Studies, pages 1–16, Siegen, Germany, 2007b. Transcript Verlag.
T. Stadelmann, S. Heinzl, M. Unterberger, and B. Freisleben. WebVoice: A Toolkit for Perceptual Insights into Speech Processing. In Proceedings of the 2 nd International Congress on Image and Signal Processing (CISP'09), pages 4358–4362, Tianjin, China, October 2009.
T. Stadelmann and B. Freisleben. Unfolding Speaker Clustering Potential: A Biomimetic Approach. In Proceedings of the ACM International Conference on Multimedia (ACMMM'09), pages 185–194, Beijing, China, October 2009. ACM.
M. Mühling, R. Ewerth, T. Stadelmann, B. Shi, C. Zöfel, and B. Freisleben. University of Marburg at TRECVID 2007: Shot Boundary Detection and High- Level Feature Extraction. In Proceedings of TREC Video Retrieval Evaluation Workshop (TRECVid'07). Available online, 2007b. URL http://www-nlpir. nist.gov/projects/tvpubs/tv.pubs.org.htm.
B. Yegnanarayana, K. S. Reddy, and S. P. Kishore. Source and System Fea- tures for Speaker Recognition using AANN Models. In Proceedings of the 26 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'01), pages 409–413, Salt Lake City, UT, USA, May 2001. IEEE.
T. Joachims. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of the 16 th International Conference on Machine Learning (ICML'99), pages 200–209, Bled, Slovenia, June 1999.
W. Verhelst and M. Roelands. An Overlap-Add Technique on Waveform Sim- ilarity (WSOLA) For High Quality Time-Scale Modification of Speech. In Proceedings of the 18 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'93), volume 2, pages 554–557, Minneapolis, MN, USA, April 1993. IEEE.
J. Schmidhuber. Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, At- tention, Curiosity, Creativity, Art, Science, Music, Jokes. In Proceedings of the 12 th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, volume 1 of Lecture Notes in Artificial Intelligence 5177, pages 11–45, Zagreb, Croatia, 2008. Springer.
J. Foote. Visualizing Music and Audio Using Self-Similarity. In Proceedings of the 7 th ACM International Conference on Multimedia (ACMMM'99), pages 77–80, Orlando, FL, USA, October 1999. ACM.
M. Nishida and T. Kawahara. Speaker Indexing for News Articles, Debates and Drama in Broadcast TV Programs. In Proceedings of the IEEE International Bibliography Conference on Multimedia Computing and Systems (ICMCS'99), volume 2, pages 466–471, Los Alamitos, CA, USA, June 1999. IEEE.
B. T. Logan and A. J. Robinson. Enhancement and Recognition of Noisy Speech Within an Autoregressive Hidden Markov Model Framework Using Noise Es- timates from the Noisy Signal. In Proceedings of the 22 nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97), vol- ume 2, pages 843–846, Munich, Germany, April 1997. IEEE.
T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, USA, 1997.
D. W. Griffin and J. S. Lim. Signal Estimation from Modified Short-Time Fourier Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32:236–243, 1984.
Y. Suna, M. S. Kamel, A. K. C. Wong, and Y. Wang. Cost-Sensitive Boosting for Classification of Imbalanced Data. Pattern Recognition, 40(12):3358–3378, December 2007.
C. Joder, S. Essid, and G. Richard. Temporal Integration for Audio Classification With Application to Musical Instrument Classification. IEEE Transactions on Audio, Speech, and Language Processing, 17:174–186, 2009.
E. Yilmaz and J. A. Aslam. Estimating Average Precision with Incomplete and Imperfect Judgments. In Proceedings of the 15 th ACM International Confer- ence on Information and Knowledge Management (CIKM'06), pages 102–111, Arlington, VA, USA, November 2006.
A. Ultsch. U * -Matrix: a Tool to Visualize Clusters in High Dimensional Data. Technical Report 36, University of Marburg, Dept. of Computer Science, DataBionics Research Lab, 2003c.
Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, and B. Xiang. The SuperSID Project: Exploiting High-Level Information for High-Accuracy Speaker Recognition. In Proceedings of the 28 th IEEE Interna- tional Conference on Acoustics, Speech, and Signal Processing (ICASSP'03), volume 4, pages 784–787, Hong Kong, China, April 2003. IEEE.
G. Rigoll and S. Müller. Statistical Pattern Recognition Techniques for Multi- modal Human Computer Interaction and Multimedia Information Processing. In Proceedings of the International Workshop on Speech and Computer, pages 60–69, Moscow, Russia, October 1999. Survey paper.
G. Friedland, O. Vinyals, Y. Huang, and C. Müller. Prosodic and other Long- Term Features for Speaker Diarization. IEEE Transactions on Speech and Audio Processing, 17:985–993, 2009.
Z. Zhao and H. Liu. Searching for Interacting Features. In Proceedings of the 20 th International Joint Conference on Artificial Intelligence (IJCAI'07), pages 1156–1161, Hyderabad, India, January 2007.
M. Pardo and G. Sberveglieri. Learning From Data: A Tutorial With Emphasis on Modern Pattern Recognition Methods. IEEE Sensors Journal, 2(3):203– 217, 2002.
S. B. Davis and P. Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28:357–366, 1980.
S. S. Skiena. The Algorithm Design Manual. Springer, London, UK, 2 nd edition, 2008.
S. E. Tranter and D. A. Reynolds. An Overview of Automatic Speaker Diarization Systems. IEEE Transactions on Audio, Speech, and Language Processing, 14: 1557–1565, 2006.
Z. Goh, K.-C. Tan, , and B. T. G. Tan. Postprocessing Method for Suppress- ing Musical Noise Generated by Spectral Subtraction. IEEE Transactions on Speech and Audio Processing, 6:287–292, 1998.
R. Vogt, S. Sridharan, and M. Mason. Making Confident Speaker Verification Decisions with Minimal Speech. In Proceedings of the International Confer- ence on Spoken Language Processing (ICSLP Interspeech'08), pages 1405–1408, Brisbane, Australia, September 2008b. ISCA.
R. Typke, P. Giannopoulos, R. C. Veltkamp, FransWiering, and R. van Oost- rum. Using Transportation Distances for Measuring Melodic Similarity. In Proceedings of the 4 th International Conference on Music Information Retrieval (ISMIR'03), pages 107–114, Washington, D.C., USA, October 2003.
Y. Freund and R. E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997.
L. Saul and M. Rahim. Markov Processes on Curves for Automatic Speech Recog- nition. In Proceedings of the Conference on Advances in Neural Information Processing Systems II, pages 751–757. MIT Press, 1998.
V. Kartik, D. Srikrishna Satish, and C. Chandra Sekhar. Speaker Change Detec- tion using Support Vector Machines. In Proceedings of ISCA Tutorial and Re- search Workshop on Non-Linear Speech Processing (NOLISP'05), pages 130– 136, Barcelona, Spain, April 2005. ISCA.
R. C. Holte. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning, 11:63–90, 1993.
M. Przybocki and A. Martin. NIST Speaker Recognition Evaluation Chron- icles. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey'04), Toledo, Spain, May 2004. ISCA.
S. Furui. 50 Years of Progress in Speech and Speaker Recognition. In Proceedings of the 10 th International Conferences Speech and Computer (SPECOM'05), pages 1–9, Patras, Greece, October 2005.
C. G. M. Snoek, M. Worring, J. C. van Gemert, J.-M. Geusebroek, and A. W. M. Smeulders. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In Proceedings of the ACM International Confer- ence on Multimedia (ACMMM'06), pages 421–430, Sanat Barbara, CA, USA, October 2006. ACM.
A. F. Smeaton, P. Over, and W. Kraaij. Evaluation Campaigns and TRECVid. In Proceedings of the 8 th ACM International Workshop on Multimedia Infor- mation Retrieval (MIR'06), pages 321–330, Santa Barbara, CA, USA, October 2006. ACM.
W.-H. Tsai, S.-S. Chen, and H.-M. Wang. Automatic Speaker Clustering using a Voice Characteristic Reference Space and Maximum Purity Estimation. IEEE Transactions on Audio, Speech, and Language Processing, 15:1461–1474, 2007.
R. Weber, U. Ritterfeld, and K. Mathiak. Does Playing Violent Video Games Induce Aggression? Empirical Evidence of a Functional Magnetic Resonance Imaging Study. Media Psychology, 8:39–60, 2006.
R. Ewerth and B. Freisleben. Video Cut Detection Without Thresholds. In Proceedings of the 11 th Workshop on Signals, Systems and Image Processing (IWSSIP'04), pages 227–230, Poznan, Poland, September 2004. PTETiS.
T. Ganchev, N. Fakotakis, and G. Kokkinakis. Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task. In Proceedings of the 10 th International Conference on Speech and Computer (SPECOM'05), pages 191–194, Patras, Greece, October 2005.
S.-X. Zhang, M.-W. Mak, and H. M. Meng. Speaker Verification via High-Level Feature-Based Phonetic-Class Pronunciation Modeling. IEEE Transactions on Computers, 56(9):1189–1198, 2007.
H. A. Murthy and V. Gadde. The Modified Group Delay Function and its Ap- plication to Phoneme Recognition. In Proceedings of the 28 th IEEE Interna- tional Conference on Acoustics, Speech, and Signal Processing (ICASSP'03), volume 1, pages 68–71, Hong Kong, China, April 2003. IEEE.
E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards Parameter-Free Data Mining. In Proceedings of the 10 th International Conference on Knowledge Discovery and Data Mining (KDD'04), pages 206–215, Seattle, USA, August 2004. ACM SIGKDD.
Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover's Distance as a Metric for Image Retrieval. International Journal of Computer Vision, 40: 99–121, 2000.
J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, and B. Zhang. A For- mal Study of Shot Boundary Detection. IEEE Transactions on Circuits and Systems for Video Technology, 17(2):168–186, 2007.
T. Merlin, J.-F. Bonastre, and C. Fredouille. Non Directly Acoustic Process for Costless Speaker Recognition and Indexation. In Proceedings of the Interna- tional Workshop on Intelligent Communication Technologies and Applications, With Emphasis on Mobile Communications (COST 254), Neuchatel, Switzer- land, May 1999.
A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry. Feature Selection in Face Recog- nition: A Sparse Representation Perspective. Technical Report UCB/EECS- 2007-99, EECS Department, University of California, Berkeley, August 2007. Bibliography F. Yates. Contingency Table Involving Small Numbers and the χ 2 Test. Journal of the Royal Statistical Society, 1(2):217–235, 1934.
M. J. F. Gales, D. Y. Kim, P. C. Woodland, H. Y. Chan, D. Mrva, R. Sinha, and S. E. Tranter. Progress in the CU-HTK Broadcast News Transcription System. IEEE Transactions on Audio, Speech, and Language Processing, 14 (5):1513–1525, September 2006.
S. Salcedo-Sanz, A. Gallardo-Antolín, J. M. Leiva-Murillo, and C. Bousoño- Calzón. Offline Speaker Segmentation Using Genetic Algorithms and Mutual Information. IEEE Transactions on Evolutionary Computation, 10(2):175–186, 2006.
P. Deléglise, Y. Estève, S. Meignier, and T. Merlin. The LIUM Speech Transcrip- tion System: A CMU Sphinx III-based System for French Broadcast News. In Proceedings of the 9 th European Conference on Speech Communication and Technology (Interspeech'05–Eurospeech), Lisbon, Portugal, September 2005. ISCA. URL http://cmusphinx.sourceforge.net/. Visited 18. March 2010.
D. Wu, J. Li, and H. Wu. α-Gaussian Mixture Modelling for Speaker Recognition. Pattern Recognition Letters, 2009. doi: 10.1016/j.patrec.2008.12.013.
L. Mary and B. Yegnanarayana. Extraction and Representation of Prosodic Features. Speech Communication, 2008. doi: 10.1016/j.specom.2008.04.010.
B. Lindblom, R. Diehl, and C. Creeger. Do 'Dominant Frequencies' Explain the Listener's Response to Formant and Spectrum Shape Variations? Speech Communication, 2008. doi: 10.1016/j.specom.2008.12.003.
T. Kinnunen and H. Li. An Overview of Text-Independent Speaker Recognition: from Features to Supervectors. Speech Communication, 52:12–40, 2010. doi: 10.1016/j.specom.2009.08.009.
A. F. Smeaton, P. Over, and W. Kraaij. High-Level Feature Detection from Video in TRECVid: a 5-Year Retrospective of Achievements. In A. Divakaran, editor, Multimedia Content Analysis, Theory and Applications, pages 151–174.
M. Mühling, R. Ewerth, T. Stadelmann, B. Freisleben, R. Weber, and K. Math- iak. Semantic Video Analysis for Psychological Research on Violence in Com- puter Games. In Proceedings of the ACM International Conference on Image Bibliography and Video Retrieval (CIVR'07), pages 611–618, Amsterdam, The Netherlands, July 2007a. ACM.
P. Macklin. EasyBMP: Cross-Platform Windows Bitmap Library . Online weg re- source, 2006. URL http://easybmp.sourceforge.net/. Visited 22. February 2010.
F. van der Heijden, R. P. W. Duin, D. de Ridder, and D. M. J. Tax. Classification, Parameter Estimation and State Estimation: An Engineering Approach using MATLAB R . John Wiley & Sons, West Sussex, England, 2004.
O. B. Tüzün, M. Demirekler, and K. B. Bakiboglu. Comparison of Parametric and Non-Parametric Representations of Speech for Recognition. In Proceedings of the 7 th Mediterranean Electrotechnical Conference (Melecon'94), volume 1, pages 65–68, Antalya, Turkey, April 1994.
R. Vogt and S. Sridharan. Minimising Speaker Verification Utterance Length through Confidence Based Early Verification Decisions. Lecture Notes in Com- puter Science, 5558/2009:454–463, 2009.
Bibliography A. Ramsperger. Strukturanalyse der Riboflavin Synthase aus Methanococcus jannaschii. PhD thesis, Technischen Universität München, München, Germany, December 2005.
P. Rose. Technical Forensic Speaker Recognition: Evaluation, Types and Testing of Evidence. Computer Speech and Language, 20:159–191, 2006.
S. Young, G. Evermann, M. J. F. Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK Book (for HTK Version 3.3). Cambridge University Engineering Department, Cambridge, UK, 2005. URL http://htk.eng.cam.ac.uk/. Visited 18. March 2010.
V. Moschou, M. Kotti, E. Benetos, and C. Kotropoulos. Systematic Comparison of BIC-Based Speaker Segmentation Systems. In Proceedings of th 9 th Inter- national Workshop Multimedia Signal Processing (MMSP'07), Chania, Greece, October 2007.
T. Su and J. G. Dy. In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering. Intelligent Data Analysis, 11:319–338, 2007. Sun Developer Network. LiveConnect Support in the Next Generation Java TM Plug-In Technology Introduced in Java SE 6 update 10. Online web resource, 2010. URL http://java.sun.com/javase/6/webnotes/6u10/ plugin2/liveconnect/index.html. Visited 18. March 2010.
W.-H. Tsai, D. Rodgers, and H.-M. Wang. Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics. Computer Music Journal, 28(3):68–78, 2004.
B. Schouten, M. Tistarelli, C. Garcia-Mateo, F. Deravi, and M. Meints. Nineteen Urgent Research Topics in Biometrics and Identity Management. Lecture Notes in Computer Science, 5372:228–235, 2008.
L. Lu, H.-J. Zhang, and S. Z. Li. Content-Based Audio Classification and Segmen- tation by Using Support Vector Machines. Multimedia Systems, 8(6):482–492, April 2003.
Y. Ephraim, H. Lev-Ari, and W. J. J. Roberts. A Brief Survey of Speech En- hancement. In The Electronic Handbook. CRC Press, April 2005.
A. Haubold and J. R. Kender. Accomodating Sample Size Effect on Similarity Measures in Speaker Clustering. In Proceedings of the IEEE International Conference on Multimedia & Expo (ICME'08), pages 1525–1528, Hannover, Germany, June 2008. IEEE.
S. E. Bou-Ghazale and J. H. L. Hansen. A Comparative Study of Traditional and Newly Proposed Features for Recognition of Speech Under Stress. IEEE Transactions on Speech and Audio Processing, 8:429–442, 2000.
G. Fant. Acoustic Theory of Speech Production. Mouton & Co, The Hague, The Netherlands, 1960.
D. H. Klatt. A Digital Filterbank For Spectral Matching. In Proceedings of the 1 st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'76), pages 573–576, Philadelphia, PA, USA, April 1976. IEEE.
R. Sedgewick. Algorithms in C. Computer Science. Addison Wesley, December 1990.
M. K. Soenmez, L. Heck, M. Weintraub, and E. Shriberg. A Lognormal Tied Mixture Model of Pitch for Prosody-Based Speaker Recognition. In Proceed- ings of the 5 th European Conference on Speech Communication and Technology (Eurospeech'97), pages 1391–1394, Rhodes, Greece, September 1997. ISCA.
Y. Rubner, C. Tomasi, and L. J. Guibas. A Metric for Distributions with Ap- plications to Image Databases. In Proceedings of the 6 th IEEE International Conference on Computer Vision (ICCV'98), pages 59–66, Bombay, India, Jan- uary 1998. IEEE.
B. T. Logan and A. Salomon. A Music Similarity Function Based On Signal Analysis. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME'01), pages 190–193, Tokyo, Japan, August 2001. IEEE. Bibliography L. Lu and H.-J. Zhang. Real-Time Unsupervised Speaker Change Detection. In Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02), volume 2, pages 358–261, Quebec City, Canada, August 2002.
M. F. Porter. An Algorithm for Suffix Stemming. Program, 14(3):130–137, July 1980.
J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Calculation of Comples Fourier Series. Mathematics of Computation, 19(90):297–301, 1965.
Y. Linde, A. Buzo, and R. M. Gray. An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications, 28(1):84–95, 1980.
G. Friedland. Analytics for experts. ACM SIGMM Records, 1(1), March 2009.
M. J. F. Gales and S. Young. An Improved Approach to the Hidden Markov Model Decomposition of Speech and Noise. In Proceedings of the 17 th IEEE Interna- tional Conference on Acoustics, Speech, and Signal Processing (ICASSP'92), volume 1, pages 233–236, San Francisco, CA, USA, March 1992. IEEE.
J. W. Sammon. A Nonlinear Mapping for Data Structure Analysis. IEEE Trans- actions on Computers, C-18(5):401–4009, May 1969.
M. Hegde, H. A. Murthy, and G. V. R. Rao. Application of the Modified Group Delay Function to Speaker Identification and Discrimination. In Proceedings of the 29 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'04), volume I, pages 517–520, Montreal, QC, Canada, May 2004. IEEE.
Z. Tufekci, J. N. Gowdy, S. Gurbuz, and E. Patterson. Applied Mel-Frequency Wavelet Cofficients and Parallel Model Compensation for Noise-Robust Speech Recognition. Speech Communication, 48:1294–1307, 2006.
D. A. Reynolds and P. Torres-Carrasquillo. Approaches and Applications of Audio Diarization. In Proceedings of the 30 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'05), volume 5, pages 953–956, Philadelphia, PA, USA, March 2005. IEEE.
G. Cybenko. Approximation by Superpositions of a Sigmoidal Function. Mathe- matics of Control, Signals, and Systems, 2:303–314, 1989.
W.-H. Tsai and H.-M. Wang. A Query-by-Example Framework to Retrieve Music Documents by Singer. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME'04), pages 1863–1866, Taipei, Taiwan, June 2004. IEEE.
L. Cosmides and J. Tooby. Are Humans Good Intuitive Statisticians After All? Rethinking Some Conclusions from the Literature on Judgment under Uncer- tainty. Cognition, 58:1–73, 1996.
D. Talkin. A Robust Algorithm for Pitch Tracking (RAPT). In W. B. Klejin and K. K. Paliwal, editors, Speech Coding and Synthesis, chapter 3, pages 495–518. Elsevier Science, Amsterdam, NL, 1995.
B. Goertzel and C. Pennachin. Artificial General Intelligence. Springer, Berlin, Heidelberg, Germany, 2007.
S. Heinzl, D. Seiler, E. Juhnke, T. Stadelmann, R. Ewerth, M. Grauer, and B. Freisleben. A Scalable Service-Oriented Architecture for Multimedia Ana- lysis, Synthesis, and Consumption. International Journal of Web and Grid Services, 5(3):219–260, 2009b. Inderscience Publishers.
N. Otsu. A Threshold Selection Method from Gray Level Histograms. IEEE Transactions on Systems, Man and Cybernetics, 9:62–66, March 1979.
R. D. Patterson. Auditory Images: How Complex Sounds are Represented in the Auditory System. Journal of the Acoustical Society of Japan, 21(4):183–190, 2000.
R. Munkong and B.-H. Juang. Auditory Perception and Cognition. IEEE Signal Prcessing Magazine, pages 98–117, May 2008.
J. Foote. Automatic Audio Segmentation Using a Measure of Audio Novelty. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME'00), volume 1, pages 452–455, New York, NY, USA, July 2000. IEEE. Bibliography I. Foster and C. Kesselman. The Grid 2: Blueprint for a New Computing Infras- tructure. Morgan Kaufmann, 2 nd edition, December 2003.
H. Jin, F. Kubala, and R. Schwartz. Automatic Speaker Clustering. In Proceed- ings of the DARPA Speech Recognition Workshop, pages 108–111, 1997.
E. Jafer and A. E. Mahdi. A Wavelet-based Voiced/Unvoiced Classification Algo- rithm. In Proceedings of the 4 th EURASIP Conference Focused on Video/Im- age Processing and Multimedia Communications (EC-VIP-MC'03), volume 2, pages 667–672, 2003.
S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, 2 nd edition, 2001.
S. Heinzl, M. Mathes, and B. Freisleben. A Web Service Communication Policy for Describing Non-Standard Application Requirements. In Proceedings of the IEEE/IPSJ Symposium on Applications and the Internet (Saint'08), pages 40– 47, Turku, Finland, July 2008a. IEEE Computer Society Press.
H.-J. Sacht. Bausteine für BASIC-Programme. Humboldt-Taschenbuchverlag, 1988.
D. A. Reynolds, E. Singer, B. A. Carlson, G. C. O'Leary, J. J. McLaughlin, and M. A. Zissman. Blind Clustering of Speech Utterance Based on Speaker and Language Characteristics. In Proceedings of the 5 th International Conference on Spoken Languae Processing (ICSLP'98), pages 3193–3196, Sydney, Australia, November 1998. ISCA.
H. Nyquist. Certain Topics in Telegraph Transmission Theory. Proceedings of the IEEE, 90(2):280–305, February 2002. Reprint of classic paper from 1928. A. V. Oppenheim and R. W. Schafer. From Frequency to Quefrency: A History of the Cepstrum. IEE Signal Processing Magazine, pages 95–106, September 2004.
C. C. Sekhar and M. Panaliswami. Classification of Multidimensional Trajectories for Acoustic Modeling Using Support Vector Machines. In Proceedings of the International Conference on Intelligent Sensing and Information Processing (ICISIP'04), pages 153–158, Chennai, India, January 2004.
A. Solomonoff, A. Mielke, M. Schmidt, and H. Gish. Clustering Speakers by Their Voices. In Proceedings of the 23 rd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'98), pages 757–760, Seattle, WA, USA, May 1998. IEEE.
V. Gupta, P. Kenny, P. Ouellet, G. Boulianne, and P. Dumouchel. Combining Gaussianized/Non-Gaussianized Features to Improve Speaker Diarization of Telephone Conversations. IEEE Signal Processing Letters, 14(12):1040–1043, 2007.
F. Zheng, G. Zhang, and Z. Song. Comparison of Different Implementations of MFCC. Journal of Computer Science and Technology, 16:582–589, 2001. Curriculum Vitae Perönliche Daten Name Thilo Stadelmann Geburtsdatum 02.06.1980 in Lemgo Kontakt thilo.stadelmann@gmail.com Studien 10/2004–04/2010 Philipps-Universität Marburg, Promotionsstudium Promotion zum Dr. rer. nat. (sehr gut)
C. A. Pickover. Computers, Pattern, Chaos, and Beauty: Graphics from an Unseen World. St. Martin's Press, New York, NY, USA, 1990. Bibliography C. A. Pickover and A. Khorasani. Fractal Characterization of Speech Waveform Graphs. Computers & Graphics, 10(1):51–61, 1986.
Y. Li, S. S. Narayanan, and C.-C. J. Kuo. Content-Based Movie Analysis and Indexing Based on Audiovisual Cues. IEEE Transactions on Circuits and Sys- tems for Video Technology, 14:1073–1085, 2004.
J. Eulenberg and C. Wood. CSD 232—Descriptive Phonetics Spring Semester 2010. Online web resource, March 2010. URL https://www.msu.edu/course/ asc/232/index.html. Visited 03. March 2010.
U. Küsters. Data Mining Methoden: Einordnung undund¨ und¨Uberblick. In H. Hippner, U. Küsters, M. Meyer, and K. Wilde, editors, Handbuch Data Mining im Mar- keting, Knowledge Discovery in Marketing Databases, pages 95–130. Vieweg, 2001.
MPEG 7 Requirement Group. Description of MPEG 7 Content Set. ISO/IEC JTC1/SC29/WG11/N2467, 1998.
L. R. Rabiner and R. W. Schafer. Digital Processing of Speech Signals. Prentice Hall, Englewood Cliffs, NJ, USA, 1978.
S. W. Smith. Digital Signal Processing—A Practical Guide for Engineers and Scientists. Newnes, USA, 2003.
T. Stadelmann and B. Freisleben. Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition. In Proceedings of the 20 th Interna- tional Conference on Pattern Recognition (ICPR'10), accepted for publication, Istanbul, Turkey, August 2010a. IAPR.
A. S. Malegaonkar, A. M. Ariyaeeinia, P. Sivakumaran, and S. G. Pillay. Discrim- ination Effectiveness of Speech Cepstral Features. Lecture Notes in Computer Science, 5372:91–99, 2008.
F. Pachet and P. Roy. Exploring Billions of Audio Features. In Proceedings of the 5 th International Workshop on Conten-Based Multimedia Indexing (CBMI'07), pages 227–235, Bordeaux, France, June 2007. IEEE, Eurasip.
K. L. Kroeker. Face Recognition Breakthrough. Communications of the ACM, 52(8):18–19, August 2009.
R. Vogt, C. J. Lustri, and S. Sridharan. Factor Analysis Modelling for Speaker Verification with Short Utterances. In Proceedings of the Speaker and Lan- guage Recognition Workshop (Odyssey'08), Stellenbosch, South Africa, Jan- uary 2008a. ISCA.
T. Stadelmann and B. Freisleben. Fast and Robust Speaker Clustering Using the Earth Mover's Distance and MixMax Models. In Proceedings of the 31 st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'06), volume 1, pages 989–992, Toulouse, France, April 2006. IEEE.
A. Erell and M. Weintraub. Filterbank-Energy Estimation Using Mixture and Markov Models for Recognition of Noisy Speech. IEEE Transactions on Speech and Audio Processing, 1(1):68–76, January 1993.
P. Rose. Forensic Speaker Identification. Taylor & Francis, London and New York, 2002.
M. Jessen and M. Jessen. Forensische Sprechererkennung und Tonträgerauswer- tung in Praxis und Forschung—Teil 2. Die Kriminalpolizei, 1:30–33, 2009.
P. Prandoni and M. Vetterli. From Lagrange to Shannon. . . and Back: Another Look at Sampling. IEEE Signal Processing Magazine, pages 138–144, Septem- ber 2009.
L. R. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall, Upper Saddle River, NJ, USA, 1993.
T. Munakata. Fundamentals of the New Artificial Intelligence: Neural, Evolu- tionary, Fuzzy and More. Springer, London, UK, 2 nd edition, 2008.
A. Morris, D. Wu, and J. Koreman. GMM based Clustering and Speaker Sepa- rability in the TIMIT Speech Database. Technical Report Saar-IP-08-08-2004, Saarland University, 2004.
T. Thiruvaran, E. Ambikairajah, and J. Epps. Group Delay Features for Speaker Recognition. In Proceedings of the 6 th International Conferences on Informa- tion, Communications and Signal Processing (ICICS'07), pages 1–5, Singapore, December 2007. IEEE.
A. P. Varga and R. K. Moore. Hidden Markov Model Decomposition of Speech and Noise. In Proceedings of the 15 th IEEE International Conference on Acous- tics, Speech, and Signal Processing (ICASSP'90), pages 845–848, Albuquerque, NM, USA, April 1990. IEEE.
K. Jingqiu, L. Yibing, M. Zhiyong, and Y. Keguo. Improved Algorithm of Corre- lation Dimension Estimation and its Application in Fault Diagnosis for Indus- trial Fan. In Proceedings of the 25 th Chinese Control Conference (CCC'06), pages 1291–1296, Harbin, Heilongjiang, China, August 2006.
R. E. Schapire and Y. Singer. Improved Boosting Algorithms Using Confidence- rated Predictions. Machine Learning, 37(3):297–336, December 1999.
D. A. Keim. Information Visualization and Visual Data Mining. IEEE Transac- tions on Visualization and Computer Graphics, 7(1):100–107, January–March 2002.
R. C. Rose, E. M. Hofstetter, and D. A. Reynolds. Integrated Models of Signal and Background with Application to Speaker Identification in Noise. IEEE Transactions on Speech and Audio Processing, 2:245–258, 1994.
M. Ester and J. Sander. Knowledge Discovery in Databases: Techniken und Anwendungen. Springer, 2000.
S. Rahman. Kristallographische Methoden zur Digitalen Stimmanalyse: Die Stimme als Realstruktur (Teil I). In Proceedings of the 17 th Annual Meet- ing of the German Association for Crystallography, pages 100–101, Hannover, Germany, March 2009a. Oldenbourg Verlag.
B. Schölkopf and A. J. Smola. Learning With Kernels. Adaptive Computation and Machine Learning. MIT Press, Cambridge, Massachusetts, USA, 2002.
I. V. McLoughlin. Line Spectral Pairs. Signal Processing, 88:448–467, 2008.
P. A. Keating and C. Esposito. Linguistic Voice Quality. UCLA Working Papers in Phonetics, 105:85–91, 2007.
S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas. Machine Learning: A Review of Classification and Combining Techniques. Arificial Intelligence Review, 26 (3):159–190, November 2006.
F. Camastra and A. Vinciarelli. Machine Learning for Audio, Image and Video Analysis—Theory and Applications. Springer, Berlin, Germany, 2008.
S. Heinzl, D. Seiler, M. Unterberger, A. Nonenmacher, and B. Freisleben. MIRO: A Mashup Editor Leveraging Web, Grid and Cloud Services . In Proceedings of the 11 th International Conference on Information Integration and Web-based Applications & Services (iiWAS'09), pages 15–22, Kuala Lumpur, Malaysia, December 2009c. ACM and OCG.
M. J. F. Gales. Model-Based Techniques for Noise Robust Speech Recognition. PhD thesis, Cambridge University, UK, 1996.
J. M. Martinez, R. Koenen, and F. Pereira. MPEG-7: The Generic Multimedia Content Description Standard, Part 1. IEEE MultiMedia, 9(2):78–87, April– June 2002.
H.-J. Zhang. Multimedia Content Analysis and Search: New Perspectives and Approaches. In Proceedings of the ACM International Conference on Multi- media (ACMMM'09), page 1, Beijing, China, October 2009. ACM. Keynote talk.
H. Jayanna and S. M. Prasanna. Multiple Frame Size and Rate Analysis for Speaker Recognition under Limited Data Condition. IET Signal Processing, 3 (3):189–204, 2009.
K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic Press, 1979.
A. Erell and D. Burshtein. Noise Adaptation of HMM Speech Recognition Sys- tems Using Tied-Mixtures in the Spectral Domain. IEEE Transactions on Speech and Audio Processing, 5(1):72–74, January 1997.
Bibliography B. A. Mellor and A. P. Varga. Noise Masking in a Transformed Domain. In Proceedings of the 18 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'93), volume 2, pages 87–90, Minneapolis, MN, USA, April 1993. IEEE.
M. Faúndez-Zanuy, G. Kubin, W. B. Kleijn, P. Maragos, S. McLaughlin, A. Es- posito, A. Hussain, and J. Schoentgen. Nonlinear Speech Processing: Overview and Applications. Internation Journal of Control and Intelligent Systems, 30 (2):1–10, 2002.
H. Kantz and T. Schreiber. Nonlinear Time Series Analysis. Cambridge Univer- sity Press, Cambridge, UK, 2 nd edition, 2004.
S. Kuroiwa, Y. Umeda, S. Tsuge, and F. Ren. Nonparametric Speaker Recogni- tion Method Using Earth Mover'sDistance. IEICE Transactions on Informa- tion and Systems, E89–D(3):1074–1081, March 2006.
W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C. Cambridge University Press, 1988.
D. M. J. Tax. One-Class Classification—Concept-Learning in the Absence of Counter-Examples. PhD thesis, Technische Universteit Delft, The Netherlands, 2001.
G. Rilling, P. Flandrin, and P. Gonçalvès. On Empirical Mode Decomposition and its Algorithms. In Proceedings of the 6 th IEEE/Eurasip Workshop on Nonlinear Signal and Image Processing (NSIP'03), Grado, Italy, June 2003.
D. Liu and F. Kubala. Online Speaker Clustering. In Proceedings of the 29 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'04), volume 1, pages 333–336, Montreal, QC, Canada, May 2004. IEEE.
Mozilla Developer Center. LiveConnect. Online web resource, 2009. URL https: //developer.mozilla.org/en/LiveConnect. Visited 18. March 2010.
J. Dabkowski and A. Posiewnik. On Some Method of Analysing Time Series. Acta Physica Plonica B, 29(6):1791–1794, 1998.
R. T. Rato, M. D. Ortigueira, and A. G. Batista. On the HHT, its Problems, and Some Solutions. Mechanical Systems and Signal Processing, 22(6):1374–1394, 2008.
S. Kizhner, T. P. Flatley, N. E. Huang, K. Blank, and E. Conwell. On the Hilbert- Huang Transform Data Processing System Development. In Proceedings of the IEEE Aerospace Conference (IEEEAC'04), pages 1961–1979, 2004.
T. Stadelmann and B. Freisleben. On the MixMax Model and Cepstral Features for Noise-Robust Voice Recognition. Technical report, University of Marburg, Marburg, Germany, July 2010b.
D.-S. Kim. On the Perceptually Irrelevant Phase Information in Sinusoidal Rep- resentation of Speech. IEEE Transactions on Speech and Audio Processing, 9 (8):900–905, November 2001.
L. Liu and J. He. On the Use of Orthogonal GMM in Speaker Recognition. In Proceedings of the 24 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'99), volume 2, pages 845–848, Phoenix, AZ, USA, March 1999. IEEE.
C. A. Pickover. On the Use of Symmetrized Dot Patterns for the Visual Char- acterization of Speech Waveforms and Other Sampled Data. Journal of the Acoustic Society of America, 80:955–960, 1986.
W. Verhelst. Overlap-Add Methods for Time-Scaling of Speech. Speech Commu- nication, 30(4):207–221, April 2000.
A. Ultsch. Pareto Density Estimation: A Density Estimation for Knowledge Discovery. In Innovations in Classification, Data Science, and Information Systems -Proceedings 27 th Annual Conference of the German Classification Society (GfKL'03), pages 91–100. Springer, 2003a. Bibliography A. Ultsch. Maps for the Visualization of High Dimensional Data Spaces. In Proceedings of the Workshop on Self Organizing Maps (WSOM'03), pages 225– 230, Kitakyushu, Japan, September 2003b.
M. Campbell, J. P. Campbell, D. A. Reynolds, D. A. Jones, and T. R. Leek. Phonetic Speaker Recognition with Support Vector Machines. Advances in Neural Processing Systems, 16:1377–1384, 2004.
R. Vergin and D. O'Shaughnessy. Pre-Emphasis and Speech Recognition. In Pro- ceedings of the IEEE Canadian Conference on Electrical and Computer Engi- neering (CCECE/CCGEI'95), volume 2, pages 1062–1065, Montréal, Canada, September 1995. IEEE.
A. Ultsch. Proof of Pareto's 80/20 Law and Precise Limits for ABC-Analysis. Technical Report 02/c, Databionics Research Group, University of Marburg, Marburg, Germany, 2002.
B. C. J. Moore. Psychology of Hearing, Fifth Edition. Elsevier Academic Press, London, UK, 2004.
T. Stadelmann, Y. Wang, M. Smith, R. Ewerth, and B. Freisleben. Rethinking Algorithm Development and Design in Speech Processing. In Proceedings of the 20 th International Conference on Pattern Recognition (ICPR'10), accepted for publication, Istanbul, Turkey, August 2010. IAPR.
J.-F. Bonastre, F. Bimbot, L.-J. Boë, J. P. Campbell, D. A. Reynolds, and I. Magrin-Chagnolleau. Person Authentication by Voice: A Need for Cau- tion. In Proceedings of the 8 th European Conference on Speech Communication and Technology (Eurospeech'03), pages 33–36, Geneva, Switzerland, September 2003. ISCA.
D. A. Reynolds and R. C. Rose. Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, 3:72–83, 1995.
R. Ewerth. Robust Video Content Analysis via Transductive Learning Methods. PhD thesis, University of Marburg, Marburg, Germany, 2008.
J. K. Shah, A. N. Iyer, B. Y. Smolenski, and R. E. Yantorno. Robust Voiced/Un- voiced Classification using Novel Features and Gaussian Mixture Model. In Proceedings of the 29 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'04), Montreal, QC, Canada, May 2004. IEEE.
H. Gish, M.-H. Siu, and R. Rohlicek. Segregation of Speakers for Speech Recog- nition and Speaker Identification. In Proceedings of the 16 th IEEE Interna- tional Conference on Acoustics, Speech, and Signal Processing (ICASSP'91), volume 2, pages 873–876, Toronto, Canada, May 1991. IEEE.
Y. Peng, Z. Lu, and J. Xiao. Semantic Concept Annotation Based on Audio PLSA Model. In Proceedings of the ACM International Conference on Multimedia (ACMMM'09), pages 841–845, Beijing, China, October 2009. ACM.
A. Larcher, J.-F. Bonastre, and J. S. D. Mason. Short Utterance-based Video Aided Speaker Recognition. In Proceedings of the 10 th IEEE Workshop on Multimedia Signal Processing (MMSP'08), pages 897–901, Cairns, Queensland, Australia, October 2008. IEEE.
J. W. Picone. Signal Modeling Techniques in Speech Recognition. Proceedings of the IEEE, 81(9):1215–1247, September 1993.
K. J. Han and S. S. Narayanan. Signature Cluster Model Selection for Incremen- tal Gaussian Mixture Cluster Modeling in Agglomerative Hierarchical Speaker Clustering. In Proceedings of the 10 th Annual Conference of the International Speech Communication Association (Interspeech'09), Brighton, UK, September 2009. ISCA.
Yang. Similar Speaker Recognition using Nonlinear Analysis. Chaos, Solitons and Fractals, 21(21):159–164, 2004.
M. Fink, M. Cevell, and S. Baluja. Social-and Interactive-Television Applica- tions Based on Real-Time Ambient-Audio Identification. In Proceedings of the 4 th European Interactive TV Conference (Euro-ITV'06), Athens, Greece, May 2006. Best paper award.
D. R. Hill. Speaker Classification Concepts: Past, Present and Future. In C. Müller, editor, Speaker Classification I – Fundamentals, Features, and Meth- ods, volume 4343 of LNAI, chapter 2, pages 21–46. Springer, 2007.
S. Zhang, W. Hu, T. Wang, J. Liu, and Y. Zhang. Speaker Clustering Aided by Visual Dialogue Analysis. In Proceedings of the 9 th Pacific Rim Confer- ence on Multimedia (PCM'08), volume 5353, pages 693–702, Tainan, Taiwan, December 2008. Springer.
B. Fergani, M. Davy, and A. Houacine. Speaker Diarization using One-Class Support Vector Machines. Speech Communication, 50:355–365, 2008.
A. N. Iyer, U. O. Ofoegbu, R. E. Yantorno, and B. Y. Smolenski. Speaker Dis- tinguishing Distances: A Comparative Study. International Journal of Speech Technology, 10:95–107, 2007.
D. A. Reynolds. Speaker Identification and Verification using Gaussian Mixture Speaker Models. Speech Communication, 17:91–108, 1995.
D. E. Sturim, D. A. Reynolds, E. Singer, and J. P. Campbell. Speaker Indexing in Large Audio Databases using Anchor Models. In Proceedings of the 26 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'01), pages 429–432, Salt Lake City, UT, USA, May 2001. IEEE. Bibliography D. E. Sturim, W. M. Campbell, Z. N. Karam, D. A. Reynolds, and F. S. Richard- son. The MIT Lincoln Laboratory 2008 Speaker Recognition System. In Pro- ceedings of the 10 th Annual Conference of the International Speech Communi- cation Association (Interspeech'09), Brighton, UK, September 2009. ISCA.
M. Kotti, V. Moschou, and C. Kotropoulos. Speaker Segmentation and Cluster- ing. Signal Processing, 88:1091–1124, 2008b.
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10:19–41, 2000. D. A. Reynolds, W. Andrews, J. P. Campbell, J. Navratil, B. Peskin, A. G.
Texas Instruments. Specifications for the Analog to Digital Conversion of Voice by 2 400 Bit/Second Mixed Excitation Linear Prediction. Draft, May 1998.
W. B. Kleijn and K. K. Paliwal. Speech Coding and Synthesis. Elsevier Science Inc., New York, NY, USA, 1995.
M. Hasegawa-Johnson and A. Alwan. Speech Coding: Fundamentals and Appli- cations. In J. Proakis, editor, Wiley Encyclopedia of Telecommunications and Signal Processing. Wiley, 2002.
W. Wang, X. Liv, and R. Zhang. Speech Detection Based on Hilbert-Huang Transform. In Proceedings of the 1 st International Multi-Symposium on Com- puter and Computational Sciences (IMSCCS'06), volume 1, pages 290–293, Hangzhou, Zhejiang, China, June 2006b.
D. Burshtein and S. Gannot. Speech Enhancement Using a Mixture-Maximum Model. IEEE Transactions on Speech and Audio Processing, 10:341–351, 2002.
H. Huang and J. Pan. Speech Pitch Determination Based on Hilbert-Huang Transform. Signal Processing, 86:792–803, 2006.
A. Nádas, D. Nahamoo, and M. A. Picheny. Speech Recognition Using Noise- Adaptive Prototypes. IEEE Transactions on Acoustics, Speech and Signal Pro- cessing, 37:1495–1503, 1989.
M. Faúndez-Zanuy and E. Monte-Moreno. State-of-the-Art in Speaker Recogni- tion. IEEE Aerospace and Electronic Systems Magazine, 20:7–12, 2005.
V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and L. Besacier. Step- by-Step and Integrated Approaches in Broadcast News Speaker Diarization. Computer Speech and Language, 20:303–330, 2006.
Bibliography K. J. Han, S. Kim, and S. S. Narayanan. Strategies to Improve the Robust- ness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization. IEEE Transactions on Audio, Speech, and Language Processing, 16:1590–1601, 2008.
–09/2004 Fachhochschule Giessen-Friedberg, Studium der Informatik Graduierung zum " Diplom-Informatiker (FH) " (1,0)
H. Stöcker, editor. Taschenbuch Mathematischer Formeln und Moderner Ver- fahren. Verlag Harri Deutsch, Frankfurt am Main, Germany, 3 rd revised and expanded edition, 1995.
D. A. Reynolds, W. M. Campbell, T. P. Gleason, C. B. Quillen, D. E. Sturim, P. A. Torres-Carrasquillo, and A. G. Adami. The 2004 MIT Lincoln Labora- tory Speaker Recognition System. In Proceedings of the 30 th IEEE Interna- tional Conference on Acoustics, Speech, and Signal Processing (ICASSP'05), volume 1, pages 177–180, Philadelphia, PA, USA, March 2005. IEEE. Bibliography R. Rifkin and A. Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, 5:101–141, 2004.
D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition. Addison Wesley, USA, 1998. Bibliography I. Kokkino and P. Maragos. Nonlinear Speech Analysis Using Models for Chaotic Systems. IEEE Transactions on Speech and Audio Processing, 13(6):1098–1109, November 2005.
D. G. Childers, D. P. Skinner, and R. C. Kemerait. The Cepstrum: A Guide to Processing. Proceedings of the IEEE, 65(10):1428–1443, October 1977.
W. M. Fisher, G. R. Doddington, and K. M. Goudie-Marshall. The DARPA Speech Recognition Research Database: Specification and Status. In Pro- ceedings of the DARPA Speech Recognition Workshop, pages SAIC–86/1546, Palo-Alto, CA, USA, February 1986. DARPA.
Y. Rubner, L. Guibas, and C. Tomasi. The Earth Mover's Distance, Multi- Dimensional Scaling, and Color-Based Image Retrieval. In Proceedings of the DARPA Image Understanding Workshop, pages 661–668, New Orleans, MS, USA, May 1997. DARPA.
N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N.-C. Yen, C. C. Tung, and H. H. Liu. The Empirical Mode Decomposition and the Hilbert Spectrum for Nonlinear and Non-Stationary Time Series Analysis. Proceedings of the Royal Scoiety London A, 454:903–995, 1998.
B. B. Mandelbrot. The Fractal Geometry of Nature. Henry Holt, updated edition, 2000.
S. Heinzl, M. Mathes, and B. Freisleben. The Grid Browser: Improving Usability in Service-Oriented Grids by Automatically Generating Clients and Handling Bibliography Data Transfers. In Proceedings of the 4 th IEEE International Conference on eScience, pages 269–276, Indianapolis, IN, USA, December 2008b. IEEE Press.
D. A. Reynolds and P. Torres-Carrasquillo. The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast News and Telephone Conver- sations. In Proceedings of the NIST Rich Transcription Workshop (RT'04). NIST, November 2004.
K. Sjölander. The Snack Sound Toolkit. Online web resource, 2004. URL http: //www.speech.kth.se/snack/. Visited 22. February 2010.
S. Kajarekar, N. Scheffer, M. Graciarena, E. Shriberg, A. Stolcke, L. Ferrer, and T. Bocklet. The SRI NIST 2008 Speaker Recognition Evaluation System. In Proceedings of the 34 th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'09), pages 4205–4208, Taipei, Taiwan, April 2009. IEEE.
S. Heinzl, M. Mathes, T. Stadelmann, D. Seiler, M. Diegelmann, H. Dohmann, and B. Freisleben. The Web Service Browser: Automatic Client Generation and Efficient Data Transfer for Web Services. In Proceedings of the 7 th IEEE International Conference on Web Services (ICWS'09), pages 743–750, Los An- geles, CA, USA, July 2009a. IEEE Press.
R. J. Niederjohn and J. A. Heinen. Understanding Speech Corrupted by Noise. In Proceedings of the IEEE International Conference on Industrial Technology (ICIT'96), pages P1–P5, Shanghai, 1996. IEEE.
L. Lu and H.-J. Zhang. Unsupervised Speaker Segmentation and Tracking in Real-Time Audio Content Analysis. Multimedia Systems, 10:332–343, 2005.
K. Thearling, B. Becker, D. DeCoste, B. Mawby, M. Pilote, and D. Sommerfield. Visualizing Data Mining Models. In U. Fayyad, G. G. Grinstein, and A. Wierse, editors, Information Visualization in Data Mining and Knowledge Discovery, pages 205–222. Morgan Kaufmann Publishers, San Francisco, CA, USA, 2001.
P. Ladefoged. Vowels and Consonants. Blackwell Publishing, 2 nd edition, 2005.
S. Krstulovic and R. Gribonval. MPTK: Matching Pursuit Made Tractable. In Proceedings of the 31 st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'06), volume 3, pages 496–499, Toulouse, France, May 2006. IEEE. URL http://mptk.irisa.fr/. Visited 18. March 2010.
D. Seiler, R. Ewerth, S. Heinzl, T. Stadelmann, M. Mühling, B. Freisleben, and M. Grauer. Eine Service-Orientierte Grid-Infrastruktur zur Unterstützung Medienwissenschaftlicher Filmanalyse. In Proceedings of the Workshop on Gemeinschaften in Neuen Medien (GeNeMe'09), pages 79–89, Dresden, Ger- many, September 2009.
D. Wu. Discriminative Preprocessing of Speech: Towards Improving Biometric Authentication. PhD thesis, Saarland University, Germany, 2006.
M. Kotti, E. Benetos, and C. Kotropoulos. Computationally Efficient and Robust BIC-Based Speaker Segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 16:920–933, 2008a.
S. R. M. Prasanna, C. S. Gupta, and B. Yegnanarayana. Extraction of Speaker- Specific Excitation Information from Linear Prediction Residual of Speech. Speech Communication, 48:1243–1261, 2006.
B. Yegnanarayana and S. P. Kishore. AANN: An Alternative to GMM for Pattern Recognition. Neural Networks, 15(3):459–469, April 2002.
J. C. Platt. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In B. Schölkopf, C. J. Burges, and A. J. Smola, editors, Advances in Kernel Methods: Support Vector Learning, pages 185–208. MIT Press, April 1999.
D. A. van Leeuwen, A. F. Martin, M. A. Przybocki, and J. S. Bouten. NIST and NFI-TNO Evaluations of Automatic Speaker Recognition. Computer Speech and Language, 20:128–158, 2006.
P. Schwarz. Phoneme Recognition based on Long Temporal Con- text. PhD thesis, Brno University of Technology, Czech Re- public, 2009. URL http://speech.fit.vutbr.cz/en/software/ phoneme-recognizer-based-long-temporal-context. Visited 18. March 2010.
S. Furui. Digital Speech Processing, Synthesis, and Recognition. Second Edition, Revised and Expanded. Marcel Dekker, New York, Basel, 2001.
S. Furui. 40 Years of Progress in Automatic Speaker Recognition. In Proceedings of the 3 rd International Conference on Advances in Biometrics (ICB'09), pages 1050–1059, Sassari, Italy, June 2009. IAPR/IEEE.
M. Mühling, R. Ewerth, T. Stadelmann, B. Shi, and B. Freisleben. University of Marburg at TRECVID 2008: High-Level Feature Extraction. In Proceedings of TREC Video Retrieval Evaluation Workshop (TRECVid'08). Available online, 2008. URL http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org. htm.
R. Ewerth, C. Behringer, T. Kopp, M. Niebergall, T. Stadelmann, and B. Freisleben. University of Marburg at TRECVID 2005: Shot Boundary Detection and Camera Motion Estimation Results. In Proceedings of TREC Video Retrieval Evaluation Workshop (TRECVid'05). Available online, 2005. URL http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.htm.
Bibliography T. Stadelmann. Free Web Resources Contributing to the Movie Au- dio Classification Corpus. Online web resources, 2006. URL http://www.acoustica.com/sounds.htm;http://www.alcljudprod.se/ english/ljud.php;http://nature-downloads.naturesounds.ca/;http: //www.ljudo.com/;http://www.meanrabbit.com/wavhtml/wavepage.htm; http://www.partnersinrhyme.com/;http://www.stonewashed.net/sfx. html;http://www.soundhunter.com/. Visited 24. February 2010.
B. C. J. Moore. World of the mind: hearing. Online web resource, 1987. URL http://www.answers.com/topic/hearing-7. Visited 03. March 2010.
B. Neumann. C++ Implementierung der MD5-Prüfsumme. Online web resource, 2007. URL http://www.ben-newman.de/com/MD5.php. Visited 23. February 2010.
V. Dellwo, M. Huckvale, and M. Ashby. How is Individuality Expressed in Voice? An Introduction to Speech Production and Description for Speaker Classifica- tion. In C. Müller, editor, Speaker Classification I, Lecture Notes in Artificial Intelligence, chapter 1, pages 1–20. Springer, 2007. Bibliography Encyclopaedia Britannica. eidetic reduction. Online web resource, 2009. URL http://www.britannica.com/EBchecked/topic/180957/ eidetic-reduction. Visited 19. February 2010.
C. Tomasi. Code for the Earth Movers Distance (EMD) . Online web resource, 1998. URL http://www.cs.duke.edu/ ~ tomasi/software/emd.htm. Visited 22. February 2010.
P. Delacourt and C. J. Wellekens. DISTBIC: A Speaker-Based Segmentation for Audio Data Indexing. Speech Comminucation, 32:111–126, 2000.
Bibliography Globus Alliance. The Globus Toolkit Homepage. Online web resource, 2010. URL http://www.globus.org/toolkit/. Visited 23. February 2010.
M. Böhme. Using libavformat and libavcodec. Online web resource, 18. February 2004. URL http://www.inb.uni-luebeck.de/ ~ boehme/using_libavcodec. html. Visited 19. February 2010.
NIST/SEMATECH. e-Hanfbook of Statistical Methods. Online web resource, 2003. URL http://www.itl.nist.gov/div898/handbook/. Visited 10. March 2010.
Linguistic Data Consortium. The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. Online web resource, 1990. URL http://www.ldc.upenn.edu/ Catalog/readme_files/timit.readme.html. Visited 24. February 2010.
E. de Castro Lopo. Secret Rabbit Code. Online web resource, 2010. URL http://www.mega-nerd.com/SRC/index.html. Visited 22. February 2010.
R. Hegger, H. Kantz, and T. Schreiber. Practical Implementation of Nonlinear Time Series Methods: The TISEAN Package. Chaos, 9:413–435, 1999. URL http://www.mpipks-dresden.mpg.de/ ~ tisean/. Visited 18. March 2010.
Y. Hu and P. C. Loizou. Subjective Comparison of Speech Enhancement Algo- rithms. In Proceedings of the 31 st IEEE International Conference on Acous- tics, Speech, and Signal Processing (ICASSP'06), volume 1, pages 153–156, Toulouse, France, May 2006. IEEE.
J. He. Jialong He's Speaker Recognition (Identification) Tool. Online web resource, 1997. URL http://www.speech.cs.cmu.edu/comp.speech/ Section6/Verification/jialong.html. Visited 19. February 2010.
J.-M. Valin. Speex: A Free Codec For Free Speech. Online web resource, 2010. URL http://www.speex.org/. Visited 22. February 2010.
O. Parviainen. Time and Pitch Scaling in Audio Processing. Software Developer's Journal, 4, 2006. URL http://www.surina.net/soundtouch/. Visited 18. March 2010.
R. Ewerth, M. Mühling, and B. Freisleben. Self-Supervised Learning of Face Appearances in TV Casts and Movies. International Journal on Semantic Computing, 1(2):185–204, 2007a.
B. Fergani, M. Davy, and A. Houacine. Unsupervised Speaker Indexing Using One-Class Support Vector Machines. In Proceedings of the 14 th Eurpoean Sig- nal Processing Conference (EUSIPCO'06), Florence, Italy, September 2006. Eurasip.
B. Milner and X. Shao. Clean Speech Reconstruction from MFCC Vectors and Fundamental Frequency using an Integrated Front-End. Speech Communica- tion, 48:697–715, 2006.
B. Milner and X. Shao. Speech Reconstruction from Mel-Frequency Cepstral Coefficients using a Source-Filter Model. In Proceedings of the 7 th Inter- national Conference on Spoken Language Processing (ICSLP Interspeech'02), pages 2421–2424, Denver, CO, USA, September 2002. ISCA.

Das Dokument ist im Internet frei zugänglich - Hinweise zu den Nutzungsrechten