Мы используем файлы cookie.
Продолжая использовать сайт, вы даете свое согласие на работу с этими файлами.

Phonetics

Speech corpus

Другие языки:

Speech corpus

Подписчиков: 0, рейтинг: 0

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or speaker identification engine). In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of Speech Corpora:

Read Speech – which includes:
- Book excerpts
- Broadcast news
- Lists of words
- Sequences of numbers
Spontaneous Speech – which includes:
- Dialogs – between two or more people (includes meetings; one such corpus is the KEC);
- Narratives – a person telling a story (one such corpus is the Buckeye Corpus);
- Map-tasks – one person explains a route on a map to another;
- Appointment-tasks – two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with foreign accent.

See also

Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data – Transcription and Coding in Discourse Research. Hillsdale: Erlbaum.
Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.

External links

Santa Barbara Corpus of Spoken American English
Buckeye Corpus The Buckeye Corpus of Conversational Speech
The KEC -- The Karl Eberhards Corpus of spontaneously spoken southern German in dialogues - audio and articulatory recordings
Spoken Language Corpora at the Research Center on Multilingualism
The Spoken Turkish Corpus at METU Ankara
Spoken Corpus Klient with the Corp-Oral Corpus at ILTEC Lisbon
VoxForge – open source speech corpora
OLAC: Open Language Archives Community
BAS Bavarian Archive for Speech Signals
Simmortel Speech Recognition Corpus for Indian English and Hindi
ELRA: the European Language Resources Association
The PELCRA Conversational Corpus of Polish
The Arabic Speech Corpus
Corpus of Political Speeches : Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library
Large Multimodal Corpus of Human Speech

Natural language processing

General terms

AI-complete
Bag-of-words
n-gram
- Bigram
- Trigram
Computational linguistics
Natural-language understanding
Stop words
Text processing

Collocation extraction
Concept mining
Coreference resolution
Deep linguistic processing
Distant reading
Information extraction
Named-entity recognition
Ontology learning
Parsing
Part-of-speech tagging
Semantic role labeling
Semantic similarity
Sentiment analysis
Terminology extraction
Text mining
Textual entailment
Truecasing
Word-sense disambiguation
Word-sense induction

Text segmentation	Compound-term processing Lemmatisation Lexical analysis Text chunking Stemming Sentence segmentation Word segmentation

Automatic summarization

Multi-document summarization
Sentence extraction
Text simplification

Machine translation

Computer-assisted
Example-based
Rule-based
Statistical
Transfer-based
Neural

Distributional semantics models

BERT
Document-term matrix
Explicit semantic analysis
fastText
GloVe
Language model (large)
Latent semantic analysis
Seq2seq
Word embedding
Word2vec

Language resources,
datasets and corpora

Types and standards	Corpus linguistics Lexical resource Linguistic Linked Open Data Machine-readable dictionary Parallel text PropBank Semantic network Simple Knowledge Organization System Speech corpus Text corpus Thesaurus (information retrieval) Treebank Universal Dependencies
Data	BabelNet Bank of English DBpedia FrameNet Google Ngram Viewer UBY WordNet

Automatic identification
and data capture

Speech recognition
Speech segmentation
Speech synthesis
Natural language generation
Optical character recognition

Document classification
Latent Dirichlet allocation
Pachinko allocation

Computer-assisted
reviewing

Automated essay scoring
Concordancer
Grammar checker
Predictive text
Pronunciation assessment
Spell checker
Syntax guessing

Natural language
user interface

Chatbot
Interactive fiction
Question answering
Virtual assistant
Voice user interface

Related

Hallucination
Natural Language Toolkit
spaCy