Abstract In speech technology and linguistics, corpora acquisition is the crucial and often costly process in engineering and language science. The experiments need to collect a huge speech data in...

1 answer below »

Abstract







In speech technology and linguistics, corpora acquisition is the crucial and often costly process in engineering and language science. The experiments need to collect a huge speech data in order to analyze the data speech corpora. However, there are several kinds of corpora tool, but each type of the tool has a specific advantages and disadvantages. As a result, the analyzer should use the suitable tool depend on speech corpus to gather the data without losing the time and spending high cost. For this purpose, the focus of this project is to build the usable web-based tool for allowing the collection of a speech corpus of ‘Read-Speech’. The system asked the volunteers to read a text prompt that appears on the webpage and their speech signal is recorded with collecting the personal information profile and pointing out the environment that uses to record the data speech corpus. Consequently, the experimenters are determined the text that is used to read by participant depending on the type of speech corpora




Acknowledgments









First and foremost, I would like to thank * for helping me to finish this work and for giving me patience to complete this dissertation.
Then, I would like to thank my supervisor Dr * for his enthusiasm; wise words and continuing support throughout this project. Consequently, his wide knowledge and his logical way of thinking have been of great value for me. His understanding, encouraging and personal guidance have provided a good basis for the present project.
Additionally, many thanks to the University of * for being such a great university offering anything a student might need especially academic excellence.
Finally, I am really grateful and like to thank the K-R -G of I that gave me this opportunity and supported me commercially.






Chapter two: Background Review

To illustrate the text-speech corpora tool, it is needed to deal with the significance of the corpora tools and the domains that are used. Each domain has some significant roles in the realm of various scientific technologies. Therefore, there are several types of speech corpora which are implemented. In spite of that this chapter is devoted towards demonstrating and investigating the previous works that are related to this work, and it also describes the factors that classified the data speech according to the data collection.
This chapter is classified into several crucial sections; the section 2.1 is allocated to demonstrate the meaning of term corpora and definition of some expressions which are used in the research, such as text corpora and speech corpora. Subsequently, the significance of this section is that, it is dedicated to argue the structure of text corpora and speech corpora that are used in the previous tools, in speech technology and in linguistics study. In addition, the section 2.2 is devoted to discuss the data speech classification and the factors that influenced the data speech collection. Furthermore, the section 2.3 explains the types of speech that is recorded during data collection. The section 2.4 is focused on the method of collecting data and the tools that have been implemented for this purpose. Pre-ultimately, the section 2.5 points out the importance of text-speech corpora tool and it determines the fields and technologies that have been used. In speech technology, speech corpora are used to createacoustic models, but in linguistic technology, speech corpora are used for studying transcription, photonic and Conversation analysis in the specific language. The section 2.6 concentrates on some significant tools that have been implemented before and impact of development corpora tool. Finally, this chapter has the specific section to summarize all evaluations and results that have been received.


  1. Definition and Terminology



This section discusses
corpora
word and the expression related to this term such as text corpus and speech corpus because both of them have to go through a critical situation to study speech recognition and linguistics in a particular language.
The word ‘Corpus’ is the singular form of ‘Corpora’. It is a Latin word which signifies body. The word ‘Corpus’ has many meanings which are changing depending on the field in which it is used. According to the
Farlex, Inc, (2004) in literary and Literary Critical Terms, it means “a collection or body of writings.” Nonetheless, in linguistics, it means “the body of data”. Also, it is the main part of an organ or structure asstatedin Life Science and Anatomy. Consequently, this word in
Farlex, Inc, (2004) indicates “a capital or principal sum, as contrasted with a derived income” as described in Economics, Accounting and Finance. Moreover, this word is used for several other purposes too.
The word Corpora has many different scientific definitions, but all of them possess nearly the same meaning. According to
Crystal
(1992) the most famous definition in linguistics is “A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech”. The basic aim for this expression is to check out a hypothesis about language, for instance, for determining how the usage of a specific sound, word, or syntactic construction differs (McArthur,1992). This is showed that the corpora include text and speech. For this reason, it needs to discuss and introduce the text and the speech corpora intensively.


  1. Text Corpora


It is clear that text corpus is included these types of texts that have been used for reading during recording audio speech. As a result, the type or method of the corpus that will be collected is relied on this process. For this purpose, this section is devoted to indentify these texts that are used to collect speech corpus. Generally, the word ‘Text’ represents words, phrases, sentences, numbers, codes, and paragraphs. In accordance with
freetechexams.com
(2005) “Text corpus is the technique which is used in linguistics and mainly it is used for the purpose of referring to the texts which had been stored and processed with the help of some electronically held”. Moreover, most of linguists have the same idea about the meanings of this word andtext corpusis a gigantic and structured set of texts and it is usually electronically stored and processed.
There are unclosed questions in corpus to choose or select text corpus in text- corpora tool, and the purpose of the text that is provided in the corpora tools. Some references and philosophers mention that lexicographers, linguists and researchers can be chosen the texts to read (Aijmer and Altenberg 1991). For instance, the recording audio from
http://www.voxforge.org/home/read
is the best example of a tool that determines the text to be read by the participants. It is crucial to indicate the method of selecting the text corpus because RS corpus completely depends on the text and has been chosen to deal with the sound which is recorded in the speech technology and linguistics. As a result, the text structure is impacted the audio speech will be collected because the structure context of text is difference and still changed according to the language. For instance, the structure of an English text is shown in appendix-A- to explain the classification of text in this language because all languages nearly have still similar context according to these previous studies which have been done.
In spite of the significance of text structure context in the domain corpora tools, some atomic features are influenced in the type of the tools and result of the researchers. For instance, gender and level age of author, period and location of publication of text, language variety and etc (Holmes-Higgin et al., 2004).


  1. Speech-Corpus


ASpeech-Corpus is also called a ‘spokencorpus’; such as
Corpus
word, and there are several expressions that define it. Speech-Corpus is a collection of speeches preserved in a captured audio. These collections are helpful for performing linguistic studies and for growing speech software (wiseGEEK, 2003). Furthermore, Gibbon, Moore and Winski (1997) stated that “any collection of speech recordings which is accessible in computer readable form and which comes with annotationand documentationsufficient to allow re-use of the data in-house, or by scientists in other organizations” is known as the spoken corpus. Also, this term is defined by Crystal (1991) as “a collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language”. This definition shows that speech corpora contain adatabase that collects speech audio files and texttranscriptions. In many other references this definition has been mentioned and validated. Generally, Speech corpus are used in two specific aspects, firstly, it is beneficial for recognizing speech in speech technology. On the other hand, in Linguistics, spoken corpus is used to study phonetic,conversation analysis, dialectologyand some other fields that are related to this science (McEnery & Wilson, 1996). For this purpose, this section is discussed the advantage and some features which are used the speech corpora.
In general, the propose behind the use of speech corpora such as a spontaneous speech is to supply and generate a basis example of real-life for a particular language and provide as a useful tool to access the actuality of language . Besides, the pedagogic alteration analysis of the corpus and the development of teaching linguistic materials are facilitated by digitized speech data. This is presented a speech corpus is a method that helps learners language who intent to learn a foreign language. The method is includes designing compiling and annotating corpora. As a result, speech corpora have been used as material example in the classroom as well as examine the learners (Moreno-Sandoval, Campillos, Dong et al., 2012).
Moreover, Trippel (2008) indicated that the speech corpora are used for in several feathers and fields in linguistics. For examples, it used for developing and researching from phonetics and phonology, morphology, lexicography, syntax, semantics, pragmatics and sociolinguistics in particular language. However, it can be researched the language as quantitative statements by advocating the number of words in a language. In phonetics, speech corpora help to deal with real speech sounds, properties of speech sounds, physical properties, possibilities of sound combinations and transcription tasks.
On the other side, corpus may also be used for researching in other direction. As discussed, speech corpora are also relied on in speech technology. For this purpose, Gibbon et al. (1997) summarized some information sources that may present in a speech corpus commonly such as transuded signals (i.e. acoustic speech signal), assessment parameter as assessment result, descriptor as a profile of speakers and some extra common information. It also speech corpora used for constructing acoustic model for recognizing speech engine.
For these reasons, this report will be discussed to develop the web-based system tool to collect a huge speech corpus by reading the text. As Gibbon et al. (1997) focused that the speech corpora is depended on the profile of the speakers who participate as a volunteer (such as: gender, age, and place of birth and etc.), the place of recording, the type of microphone, and the speech martial type. The system should be developed according to these conditions and can retrieve this information to help the researcher during analyzing the data.


  1. Data Speech Classifications



In the previous section it has been confirmed that the text-speech corpora include both speech and text corpora which contain the database that are used for storing text and audio capture. As a result, there are several types of techniques to collect the data for creating Corpora. Moreover, the type of corpora is classified according to the system, to accumulate the data speech and information.
According to Vianaz et al. (1998) speech data collection is divided into several types, but this classification is generally decided by some factors which relate to the method through which data is collected. This section investigates some of these scopes that affect the process of collecting data.


  1. Data speech clarity or visibility


It is no doubt that data clarity and data visibility is a significant object for collecting each type of data. In general, data clarity is a scientific approach to endlessly improve and keep the health of data. The analyzing of data corpus is focused on diction clarity on the record audio. For this purpose, Moreno-Sandoval et al. (2012) is supported the data diction clarity is impacted on analyzing the data. Consequently, the data clarity and visibility in corpus is measured by speed of data and the profile of person who participate the collection data; therefore, it could be developed the tool that capture audio by high-quality to assist analyzer.
On the other hand, the aspect of data clarity or visibility has directly a relation with the data policy. It means that the recipient has permission for the data or keeping the data a secret. For instance,
NeoSpeech, (Inc.) recognizes that privacy is important. The privacy policy applies to participant’s use of the NeoSpeech Text-To-Speech Web Service offered by NeoSpeech, subject to the web service terms of use. By using the NS TTS web service, participants are allowed to collect and use your personal information on
NeoSpeech
(2011).


  1. Data speech environment


The quality of the collected data is dependable on the atmosphere in which the data has been recorded. For example: the data may be collected in a laboratory, studio, over telephone, etc. The data speech may be recorded in an anechoic room, in private homes or offices or somewhere else, as per choice. Certainly, the location is impacted on the data speech (Vianaz et al., 1998).
Experiments are going on for finding ways for using of standard speech corpora for developing and evaluating purposes. Aijun, Xiaoxia, Guohua et al. (2000) focused on the method and the environment to collect the data corpus according to the goal of analyzing the data corpora. For instance, it determined that the casual speech is gathered and annotated which is recorded in the normal rooms. It is one of the major factors behind the progress in automatic speech processing, particularly in speech and speaker recognition. Possibly the major advantage of using normal corpora is that it allows researchers to compare performance of dissimilar techniques on general data, hence making it easier to locate which mechanisms are most favorable to pursue (Campbell and Reynolds, 1999). For this purpose, an acoustic environment is crucially needed to record data speech. The sound recording process is often changed according to the technique that has been used to collect the data. For example, collecting data in the lab or studio differs greatly from the data collected over telephone or in general rooms not specifically built for this purpose (I'd Rather Be Writing, 2010). The technique of collecting corpora speech in lab or studio helps experiments to control the audio according to his/her purpose and keep from gathering noisy data. However, noise can be added after the data has been collected.
Consequently, the types of microphone such as a high quality microphone and transferable recorder are used to capture the audio are examples from a range of different environments (Ma, Milner and Smith, 2006). According to the website of
I'd Rather Be Writing
(2010) for obtaining accurate data sound, through experiments, efforts have been rendered to find and prepare the possible environment which is known as acoustic room. This room has some specialty to record the audio sound, for instance: cloth panelling on walls,isolation from other people,lockable,and windowless.


  1. Data speech control


This dimension is focused on the way to record audio or interact between participants and tool, for instance, the speech collecting by random, spontaneous dialogue, interview or recording of readings. It is clear that random and spontaneous speech has a great role in speech corpora to recognize the speech technology. Thus, increasing thetoolfor recognizing speech depends significantly on raising the performance of recognition of spontaneous speech. Accordingly, it ought to build huge spontaneous speech corpora for establishing acoustic and language models.
In addition, there are various tools that are provided that have worked according to the method development of voice-application system. Lane et al. (2010) has discussed and gave examples for the corpora collection tool. For example, web-based system, lab-based system and smart-phone based system. There are versus and comparisons between these voice-application system during the capturing speech. For instance, the web-based system is allowed recordings to be collected remotely on PCs by accessed the URL of the tool from the volunteer’s location in any time. However, this type of system helps for experiments to collect a huge data in any place but it cannot be controlled. Nonetheless, the lab-based system is controlled for collecting during the collecting the data, but is hard for experiments to find a great numbers of volunteers to participate the collection corpora tool.
That’s why, experiments focus on various achievements in the realm of Spontaneous Speech and the method to collect the random and spontaneous speech. Because different spontaneous-speech has specific aspects, such as random pauses, rectifications, hesitations and repetitions, various new technologies are required to recognize it. For example, the new technique named automatic summarization practically includes indexing and a process extracting vital and dependable parts of the automatic transcription (Furui, 2005).


  1. Data speech monitoring and validation


Monitoring of data speech is implemented through online or offline processes. This task is helpful to control data during the data recorder and modifying technical and phonetic characteristic on-line, i.e. during the process of capturing. Consequently, the task is performed after data collection to check the data and separate the possible data from useless data. The offline monitoring process is checking the recording speech beyond uploaded by the users (Vianaz et al., 1998).
Moreover, Validation in software system is happened in the process of testing after developing the system. This is similar as the validation of data corpora because stakeholder is needed to validate the data after the data corpus is captured (LREC Workshop, 2004). For this purpose, analyzer should be checked offline the data while the process is completed such as web-based system tool. On the other hand, monitoring is look at the system and checked during the process (Tarala, 2011). The monitoring in speech corpus collectction is the same as software system, stakeholder check out the system during the collection process. This is possible for lab-based system because volunteers are controlled by the developer during the procedure.
For this purpose, validation greatly related to posteriori evaluation of the recorded material. For instance, the data speech of read corpora is recorded in studio and is monitored by someone from outside the recording room; in spite of the monitoring, the monitor has audio contact with the speaker. But this situation is very difficult to be performed in the software system because software system cannot control by individual monitors and the time of recording is not limited. Thus, the phonetic controlling characteristics are also limited to cases in which the pronunciation is not naturally assumed. The only difference is in reading sentences while self-monitoring process took place. In this case the participant has the chance to listen to the recorded sentences and to rectify by rereading the incorrectly read part. Validation was used instead of monitoring in telephone recording and dialogue (Russell, Corley, and Lickley, 2011).


  1. Data speech channel


Generally, the channel that is depending on recording speech data is divided into two main types. Firstly, the single channel is used with single microphone to record the speech, but the second one, the multiple channels, utilizes more than a single microphone (Vianaz et al., 1998). Thus, anaudio channelis a single path of audio data.Multi-channel audiois any audio which uses more than one channel simultaneously, allowing the transmission of more audio data than single-channel audio. Supporting the driver of user is a problem because the standard sound interfaces for many operating systems are designed before the multi-channel recording is performed, and just only allowing for up to two channels of recording (Microsoft, 2012). Also, audio-visual channel is as a multi-channel because it has a video and audio quality. However, each of them in this shared channel is impacted to collecting the data corpus, but the audio part is able to use directly as corpora data (Belmudez et al., 2009). The table 2.1 shows the description of some corpora method that is used in collecting data.






















































Tool Name



Domain



Type of speech



Microphones



Channels



Acoustic



Environment

TIMIT and Derivatives (LDC)The TIMIT corpus was designed to supply speech data for acoustic-phonetic studies and for developing and evaluating automatic speech recognition systems.Read sentencesFixed wideband headsetWideband /cleanSound booth
SIVA (ELRA)The Italian speech corpus Speaker Identification and Validation Achieve includes of male and female users and imposterspromoted digits, short questions and read textVariable telephone handsetsPSINHome/ office
POLYCOST (ELRA)In consist of read and spontaneous speech in France and Swiss speechRead and promoted digits, words, sentences, questions and spontaneous speechVariable telephone handsetsPISN (possibly ISDN)Home/ office
YOHO (LDC)It is designed to support text-dependent speaker verification evaluation for Government secure access applications.Promoted digit and phrasesFixed high-quality in headset3,8KHZ/ cleanHome
Speaker Recognition corpus (OGL)The Center for Spoken Language Understanding is collecting a large speech database for speaker recognition research.Promoted phrases, digit and monologueVariable telephone handsetsPSTNHome/ office

Table 2.1: Comparison some corpora tools (Adapted from

Campbell and Reynolds, 1999
)


  1. Speech Type


Typically, speech corpus is divided into common parts as reported by the method to receive the data. As argued in 2.2.3, data speech is classified according to the control of the speech during the collection. As a result, this control is impacted on the data collection and determining the speech type. Also, the tool is used for gathering the data is allocated the control of the speech. For example, collecting the data by phone-based tool is difference from web-based tool. The later sub sections is discussed the two significant types of speech including read speech and spontaneous Speech (Helgason, 2010).



  1. Read speech


Read speech is used for collecting data in speech corpora. For this purpose, some speech corpora tools are developed to read the text and recording the speech. Read speech corpusincludes professional high-quality recordings of a speaker’s voice (Alam et al., 2010). This type of speech corpus includes:

  1. Book excerpts: based on a manuscript that usually consists of a passage and quote or a piece of text that have been taken from books and journal.

  2. Broadcast news: read speech can also be collected from TV and radio broadcasts in which case the text is not constructed particularly for the collection of data.

  3. Lists of words: some tools ask speakers to read out words that have been chosen specifically for the experiment.

  4. Sequences of numbers: tools have specified to ask the participant to read the sequence of number.


There are several tools have been developed to gather read corpus speech. These tools are developed as web-based or lab-based system. For instance, voxForge is gave the text to read by the volunteers and developed as web-based system (Campbell and Reynolds, 1999). As a result, this project will be focused on to developing the tool that collect read corpus speech in order to assist the experiments.


  1. Spontaneous speech


Thoughspeechis, approximately in any situation,spontaneous, recognition ofspontaneous speechis a domain which has only in recent times emerged in the realm of automaticspeechrecognition. It is significant to develop theapplicationof speechrecognition because it is needed crucially for raising recognition performance forspontaneous speech. As a result, it should analyze and modelspontaneous speechusingspontaneous speechdatabases, because spontaneous speechand readspeechare significantly different (Furui et al., 2005).
Spontaneous speech can also be said non-careful speech. This could cause a very complex understanding of how speech and communication work. Spontaneous speech often includes sequences with such strong decreasing phenomena that one could never have forecasted them, and is rather surprised to look at them when someone studies the spectrogram (Prins and Bastiaanse, 2004) as figure 2.1 shows the spontaneous speech and multiple deletions, reduction of stops to fricatives, and changes to vowel qualities.

Figure 2.1: show spontaneous speech before and after deletion stops



(Extracted from

Warner, 2009
)


This type of speech corpus also includes according to Richey (2002):

  1. Dialogs and meetings: contain entirely the free conversations between two or more speakers. In Almost all cases, includes impromptu speech (i.e. no manuscript).

  2. Narratives: this speech is specific for one person who narrates a story or an incident.

  3. Appointment-tasks: this conversation happens between two people or more.

  4. Telephone conversation: No writing or reading of particular text, but it can still be directed towards a Specific subject.


For accuracy, Linguists usually try to collect their data in a phonetics laboratory where there is high-quality equipment recorder. Hence, it is hard to get the perfect numbers of participants and collect actual sample which is necessary for their research.
On the other hand, for the remaining read speech corpora, the text is presented on a paper or on the computer screen. While the technology has developed, most of the tool uses web interface to present the text and it is also helpful to the experimenter for collecting data easily. In some previous tools such as (Voxforge, 2006), (Gruenstein, 2009), (Schultz, 2007) have been provided to collect speech data remotely via web-based interfaces.


  1. The collection of data speech corpora



To collect and label naturally spoken language may be time consuming and more expensive for experimenter. Typically, recordings are created of subjects, implementing a task that evolves speech, which should be transcribed later. This section provides some approaches to speech data collection via online or via cell phone.
Linguists and researchers are more focused on the technique of collecting data and the type of the data which is collected. Computer and electronic equipments are used to collect the data. Hence, the technology’s development has been initiated and hence, now corpora can contain a huge storing data. In spite of that, Statistics, according to Durant and Smith (2007), discovered that 87% of readers primarily depend on the Internet. This factor motivated developers to create online corpora tools for collecting data.
There are many kinds of tools is developed to gather the speech corpora. As Lane et al. (2010) discussed that In (Gruenstein, 2009) is a web-based memory Game that collects audio as spontaneous speech, speech prompts have not supplied in this system, but rather a voice-based memory game have been used for gathering and incompletely annotate spontaneous speech. In contrast, the SPICE project (Schultz, 2007) presents a set of web-based tools to make possible for developers in order to create voice-based applications for the languages is not common perfectly. This system also consists of tools to collect speech data from volunteers via web page. In addition, table of 2.2 is indicated several tools that used for collecting data.
According to LumenVox (2011) to make speech corpora one needs to have at least thousand different speakers, divided evenly between male and female speakers. The number of speaker for collecting the data corpus is various depending on using the data corpus for analyzing the data. To ensure the greatest diversity of speaking styles, the speakers should represent a variety of ages. It shows that creating data corpus has been different in speech technology and each linguists has own condition to collect the data. As a result, the developer should be provided with the specific tool which will be able to get personal information of participants along with recording his/her speech.
In addition, in linguistics and lexicography, a body of texts, utterances, or other specimens are considered more or less representative of a language, and usually stored as an electronic database. Currently, computer corpora may store many millions of running words, whose features can be analyzed by means of tagging (McArthur, 1992).
Firstly, it is crucial to use computers and electronic equipments to recognize speech and determine phoneme in studying language, since all of them offer a variety of benefits for handling text and speech. In spite of that, researchers manipulate data easily and can perform rapid searching, sorting, and etc and also process data accurately, consistently, reliably, without human bias and automatically annotate data. This reason shows that electronic corpora tool also is useful and helpful to collect examples for linguists, data resource for lexicographers, and training material for natural language processing (NLP) applications (Dickinson, 2008).
On the other hand, collecting data has some drawbacks and problems for linguists while they are doing this process. Linguists must keep all captured audios because it can help functional analysis out of the speech data, and situational information must be kept as much as possible. As a result, a detailed log keeping is required for every piece of recording. Also, interview in sociolinguistic does not supply a perfect picture of natural speech interaction. However, some researchers argue that interviews can also perfectly represent talk in action (Labov, 2008).


  1. Significance of Corpora tools


Because of collecting the corpora tool is very expensive, there are several tools that have been developed for this purpose. Each tool has implemented to performing the specific task and provided as a voice application system. Generally, according to Kilgarriff and Kosem (2012) the tool of corpora generally needs to save, manipulate the data and also some tool is able to annotate the speech corpus. However, the design and method of development of the systems could be difference, but each tool has significant to performing the analyzing data corpus.
The tools of speech corpora are more significant because linguists are able to collect the data effortlessly. Thus, speech corpora depend on these tools to be generally used for creating aspeechcorpusvia eitheraudio recordingsor text-based transcriptions. Recordings may be made via sound storage technologies and stored to create acorpus. All experts agreed that the speech corpora can be used for two significant fields. First, the speech corpora are useful in speech technology that is helpful among other things to make acoustic model. Secondly, inLinguistics, spoken corpora are assisting to do the research onphonetic,conversation analysis, dialectology, and also translating and transcripting auditory speech (voxforge, 2006).
It is obvious that for constructing the huge speech corpora with its linguistic and signal annotations corpora constructors use annotation recorder tools that are able to decrease tardy and time-consuming tasks. The annotation recorder tools help the constructors and make gigantic and linguistically annotated corpora and use a set of functions accurately, for instance, linguistic processing and signal processing, grapheme-to-phoneme conversion, automatic phonetic alignment, and even language model generation. Nevertheless, earlier speech tools only study phonetic level and signal level, and later have been built up for just single type of application (Kim et al. 2000).


  1. The development previous corpus tool


    1. The impact of development corpus tool




First, the term of corpus in linguistics have been appeared at the beginning of 1980 (Leech and Fligelstone, 1992). The corpus’s history dates back to the pre-Chomskyan period and it had been used by linguists’ domain, for example, Boas (1940) and linguists of the structuralize tradition, Newman, Bloomfield and Pike (Biber and Finegan, 1991). Even though, shoeboxes filled with paper slips had been used by linguists rather than computers, and it shows that the 'corpora' contained a simple collections of written or transcribed texts.
With evolutions in technology, and especially the growth of more powerful computers offering ever developing processing power and enormous storage capacity at comparatively low cost, the exploitation of huge corpora became suitable. Moreover, depending on the progress in electronic technology, corpus-based studies have increased dramatically because there are many projects that have been implemented and developed for collecting speech corpus.
Moreover, the Speech Resources Consortium (NII-SRC) has been created to collect and distribute speech data for research on speech recognition and speech synthesis. NII system has been enabled to cope with the speech of different people from different locations and situation. According to this project it is necessary to collect as many samples as possible and it should be on different aspects of the data such as environment, age, language, etc. NII system shows that speech corpus uses speech technology, Language education, and Linguistics. All data used in the experiment depended on the collected data, as explained in the figure 2.2 (Itahashi, 2005).



Figure: 2.2 describes several various aspects of the data

The tools have been developed has a specific tasks which are used for speech corpus. Several types are responsible for collecting large speech corpora and others adapted to annotate a large speech corpus. Generally, annotation tool is adjusted to deal with and analyze the huge speech corpora that have been gathered by collection tools (Bird & Harrington, 2001). The collection corpus tools have deference types that discussed in previous section and also it will be compared some crucial collection tools in the next section.


  1. The significant compare between previous corpus tool


As discussed previously, there are several tools have been developed. Some tools are devoted for collecting speech corpus and others are allocated for analyzing the collection data. These tools have difference from designing, implementing and the infrastructures that has been used for implementing. For instance, VoxForge is implemented as a web-based system to collect corpus speech. In addition, table 2.2 is identified and discussed several tools that have been developed. Clearly, all tools has a great impact for developing new corpus tool such as this project which devoted to implement web-based corpus tool.
As discussed, Each project has a particular method to collect and analyze system. It is obvious the main purpose behind developing this tool is to assist experiments to collect natural data and obtain it in an easier way as web-based system. As discussed, before the technology was developed, it was hard for linguists or experiments to collect huge data. Fortunately, when the technology developed, the corpus tools were used in a widespread manner. The tools collect data and analyze data through techniques, for instance, The CHILDES Project is a tool for analyzing speech and it has been developed by Brian MacWhinney in Carnegie Mellon University.
On the other hand, some experiments are focused on the projects that run on the server. Since, participants can access the project easily and it is not hard for researcher to obtain the data, the programmers develop the program according to the experimenter’s requirement. It seems users can participate from all places in the world and assist the researcher to receive the data as massive samples within a short time and without incurring huge costs. The client-server tool assists users to collect information of all participants and stores in the database because experimenters deal with individual’s speech according to the personal data.
For instance, the VoxForge tool runs on server and collect audio and the participant information. This tool can accumulate several languages such as English, French, Bulgarian, Dutch, etc. The tool generates the text, for the user, to be read randomly, and play and record audio so that the user can be sure to record his/her speech and then submit it. The important parameter in the tool is that it collects the information that are related to the user, for example: gender, range of age, pronunciation dialect and microphone type. However, user should select the type of microphone as shown in the drop-down box (see figure 2.3).

Figure 2.3: view drop-down box that shows types of microphone in VoxForge

In general, most of the previous tools have been established for a single type of application domain. Meanwhile, the technology progressed, several tools developed to collect and analyze data, but few of them ran on client server and according to the experts’ request the tool was developed. Each tool has been built up by a particular developer, organization or a university. Other aspects that can bring comparisons among tools are Natural language processingwhich is a subject incomputer science,machine learning, andlinguisticsrelated to interaction between computersandhuman languages (Charniak, 1985). The table 2.2 shows some previous tools and their project developers with the method to collect the data as well as the domains used.








































Tool Name

Developer

Collecting data

Domain Covered
VoxForgeVoxForge organizationDeveloping by Client-server. The user can participate from any location. Participant reads sentences and records his/her speech. And then submit]ed itTo view Subversion repository containing VoxForge Speech Audio files, Acoustic Models, and Scripts.
the speech accent archive

(Weinberger, 2012)
Steven Weinberger
(George Mason University)
Website, upload the audio and then submit. The data includes user information and the audio file which recorded before.Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish tocompareand analyze the accents of different English speakers.
A Self-Labeling Speech Corpus

(McGraw, Gruenstein and Sutherland, 2009)
Ian McGraw ,Alexander Gruenstein
and Andrew Sutherland
Collecting and transcribing speech data by using online educational games.Using the AMT (stand for Amazon Mechanical Turk) majority labels as a reference transcription.

Gruenstein,2009Alexander GruensteinSpontaneous speech is collected via a web-based memory game.Partially annotate and gather spontaneous speech.
NeoSpeech, (Neospeech, 2011)
NeoSpeechTTS Web Service, this server automatically record information and data speechProviding of speech-enabled solutions based on a suite of best-of-breed core capabilities in speech recognition, speech synthesis, speaker verification and voice animation.

Table 2.2: the previous text-speech corpora tools


  1. Summary


In conclusion to this chapter, several various aspects were discussed, and focus was given on some works that had been implemented. The reviewed literature of works concentrated mainly on the benefits in providing the best tool in the future. However, the beginning of chapter introduced the topic to the reader and the advantage of the issue. Consequently, it was argued that the tool was applicable in which domains.
Moreover, it discussed the significance of collecting data classification which led to help the researchers to investigate the speech sound in speech technology and linguistics. As a result, several points were concluded to collect the audio sound with the best tools and different environments. It determined the method to collect the data as well as changing according to the investigation. Ultimately, this review showed that developing text-speech corpora has played the greatest role in speech technology and language pronunciation. These tools were able to collect thousand hours of speech as well as analyze the collection of data speech.

Chapter three: Technical Background Review


In the previous chapter it has been discussed how the data is collected, and the types of tools which is developed to create speech corpus. This chapter discusses the technical method that is utilized for developing these tools. Also, it includes several sections, section 3.1 technically discusses and evaluates two previous tools that help to collect speech corpus. The section 3.2 discusses the programming language that is used to developing the corpora tools. The section 3.3 is allocated for discussing java applet and the significance of java applet to developing corpora tools in the web server. The section 3.4 is devoted to indicate how the java applet runs on the server and the performance of client-server side of the corpora tools. The final section is explained the data store in the tools of corpora generally.

3.1 The technical corpora tools overview

There are several tools that have been developed to gather data corpus and annotate data corpus, but each tool has provided in various programs and used the different programming language. Generally, this paper should be focused on the tools that are running on via client-server as a web-based system. However, there are other approaches have been developed as an application which can be used as lab-based system or phone-based system. For this process, it is necessary to deal with the technique of these tools that are needed for development. Additionally, the technique studies program language and the type of the database to save the data, and it also focuses on the type of the server that is used for running the system.
However, table 2.1 and table 2.2 has been discussed and compared several kinds of speech corpus collection tools, but this report just focused on Voxforge tool and Speech
Accent Archive
system tool because both approaches has been implemented as a web-based system tool. As a result, it is showed some pros and cons of these tools and compared between them in order to get the useful points in the project.
The significant example is VoxForge that has been developed to collect speech data via web-based interfaces on client-server which has been discussed in the earlier chapter, and established by Ken MacLean who is the creator, maintainer and administrator of the VoxForge.org website. This website collects the data and creates a repository of transcribed speech audio files, acoustic models and scripts for using with open source speech recognition software. It means that VoxForge is built to gather transcribed data for using in Open Source Speech Recognition Engines. According to the system user can submit his/her speech audio recordings to VoxForge for creatingGPLSpeech CorporaandAcoustic Models. GPL stands for General Public License and it is free license software. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License has been prepared to guarantee your freedom to share and change all versions of a program and make sure it stays as free software for all its users (Smith, 2007). According to MicrosoftResearch (2012) “a modeling of speech typically refers the process ofestablishing statisticalrepresentations for the feature vector sequencescomputed from the speech waveform”.
The acoustic model concentrates on a label called a phoneme which assigns each of these statistical representations. About forty distinct sounds have been based on the English language, and it has forty differentphonemes that are useful and helpful for process speech recognition in the speech technology. For this purpose, the system of VoxForge.org is used as java applet (figure 3:1 shows the java applet in VoxForge) to develop the system. To record speech and get participant’s information in the system, the developer provide the system by java applet because it can be easily integrated with web-server and can be embedded into any web environment to present high performance image viewing and without client installation that can be manipulated. In addition, it is very easy to integrate with HTML files since developer can be embedded with the system by a few lines of code. As a result, most of the developer depends on it to establish a perfect system (Oracle, 2012).



Figure 3:1 is shown the java applet in VoxForge

The
Speech Accent Archive
system is another web example tool that has been developed by Steven Weinberger in George Mason University and that has been discussed in the previous chapter. This website system is provided via PHP programming language. Generally, the
Speech Accent Archive
is established to collect data speech corpus and transcript data but it is different from
VoxForge
when it comes to comparison because the second one collects data via online and uses java applet but the first one uses the program called
PolderbitS Sound Recorder
which is shown in the figure 3:2. The program must be downloaded by participants to record their speech. After user completes recording, they should upload the audio capture via website. There are numbers of differences between the two systems, for instance,
VoxForge
gives the text randomly without the need of participant choosing the specific or without having any idea about the type of text. It is shown in the figure 3.1. However,
Speech Accent Archive
obliges user to download the particular paragraph to be read by each participant.

Figure 3:2 PolderbitS Sound Recorder program interface to record audio



(Weinberger, 2012)

Thus, these two tools have some advantages and disadvantages which can assist the developers to implementing the new tool. It should be used the strong points in this project. For instance, the voxForge is completely depend on to the Java applet to show the text which is read by user during the record of audio, also use for recording as audio recorder and get the profile of volunteer and the record environment. However, it is fine to relied on Java applet for recording but it should be run on some browser and the volunteer’ PC must be includes the plug-in for running this applet. Conversely, the second one should be download the program application as
Polder
bitS Sound Recorder program
after the record user should download the audio. This approach is not good because take more time for volunteers and it should not take part the system. For this purpose, the first one is best than second. However, the second application it has a nice interface for user. Consequently, it seems both to tools have lake of user interface design.

3.2 The Programming Language for implementing corpora tool

Speech corpora tools should be collecting speech with participant’s information and profile depend on the requirements and the type of corpus tool. For this purpose, it needs the tool to be able to capture audio and store in the server in order to collect easier by analyzer. There are several programming languages to make a system that has a capability for doing this task. For example: Java, C, C++, C#, MATLAB, etc. Since most of the web-based tools are implemented depend on java programming language, flash recorder, Java Script and etc. This research is focused on Java Programming Language and how to capture the audio speech in the web browser and sent to the database store.

3
.2.1 Capturing Audio by Java programming Language

To capture sound by java language, it needs to be familiar with a method which depends on programmers to capture the sound. At the beginning, the developer needs to know the construction of sound which has been recorded.
The point of view of the Java Sound API, the sound of word takes on somewhat various meaning. Nonetheless, it may be fair to say that the purpose behind the sound API is to help a developer to write programs which leads to sound pressure waves that impinges upon ears at particular times (Solution Inc, 2009). Baldwin (2003) is brought from the Sun the Java Sound “API is a low-level API for effecting and controlling input and output of audio media. It provides explicit control over the capabilities commonly required for audio input and output in a framework that promotes extensibility and flexibility."
Java Sound API supplies the lowest level of audio support on the Java platform. It provides the application programs with a high level of control over specific audio efficiency and it is expandable. For example, it provides installation by specific mechanisms, operating and accessing system resources such as audio mixers, MIDI devices, writing or reading files, and sound format converters. In addition, it does not include advanced sound editors and GUI tools. However, it provides a set of abilities upon which such usages can be performed. It concentrates on a low-level control which is beyond the normal predictability by the client. Consequently, the users get advantages from higher-level interfaces built on top of Java Sound (Oprea, 2005). There are several of quite multifaceted issues concerned with the use of the Sound API. Therefore, the specific part for this section is briefly devoted to supply an introduction for some of those issues which are related to capture audio in Java.


  1. Packages


At the binging, Packages should use the particular java API which is used for sound. According to Baldwin (2003) there are two significant types of audio(or sound)data that are consolidated by the API:

First: Sampled audio data


Sampled audio data may include a series of digital values which illustrate the capacity, consistency or vehemence of sound pressure waves. For instance, In Figure 3:3, the graph might represent a set of sampled audio data which is generated by a wide-band noise generator, such as the noise at an airport.



Figure 3:3 Sampled audio data

Thus, this type of audio data is supported by the following two main Java packages:

  • javax.sound.sampled

  • javax.sound.sampled.spi


These two packageshave specified interfaces for capturing, mixing, and playing back of digital (sampled) audio.

Second: MIDI data

MIDI stands for Musical Instrument Digital Interface. MIDI data may be used as usual musical sound or special sound effects. This kind of audio data is used by the following two Java:

  • packages:javax.sound.midi

  • javax.sound.midi.spi


These two packagesare supplied with interfaces for MIDI synthesis, sequencing, and event transport.
Moreover, to permit service suppliers and create custom mechanisms which are able to be installed on the system, supplier should depend on spi packages. In the processing of sampled audio data another expression should be included in Java which is called entitledDigital Signal Processing (DSP) and this is because DSP techniques are often used in the processing of sampled audio data. It takes actual signals such as audio, video, pressure, or positions that have been digitized, and then mathematically manipulate them. A DSP is considered for presenting functions of mathematics very quickly, for example, addition, subtraction, multiplication and division (Analogue Devices, 1995).


  1. Mixers and Lines


The Java Sound API is based on the concept of mixers and lines. According to astral­sound(2003) “An audio mixer is a device that mixes two or more separate signals. Mixers range from a couple of variable resistors with knobs to the big and complicated-looking consoles used in the largest multi-performanceevents.” In fact, mixer combines real audio that has multiple input lines and at least one output line. The former are often samples of classes which implementSourceDataLine, and the latter,TargetDataLine Portobject also are either source lines or target lines (Pfistere and Bomers, 2005).
Thus, a mixer is really a traffic manager. A signal is joined to an input, and the former point it to one of several possible outputs. Some mixers have implemented several stages to mix it, where inputs are mixed to sub mixes, or groups, and then the groups are further mixed to a stereo output. Moreover, Petrauskas (2005) defined a line as “an element of the digital audio "pipeline," such as an audio input or output port, a mixer, or an audio data path into or out of a mixer. The audio data flowing through a line can be mono or multichannel (for example, stereo). ... A line can have controls, such as gain, pan, and reverb.”
To input a simple audio from input system, it usually uses four parts that have been shown in the figure 3:4. Also, it shows thatMixerobject is put together with one or more ports, some controls, and aTargetDataLineobject.

Figure 3:4 an audio input system

Thus, data flows into the mixer from one or more input ports, usually the microphone or the line-in jack. Control (Gain and pan) are stratified, and the mixer transports the captured data to an application program via the mixer's target data line. A target data line is an output of mixer, including the mixture of the streamed input sounds. The simplest mixer has only one target data line, but some mixers are able to deliver captured data to multiple target data lines concurrently (Wang, 2001). The data provided by theTargetDataLineobject can be pushed into some other program constructed in real time. The actual destination of the audio data can be any of a variety of destinations such as an audio file, a network connection, or a buffer in memory. This means that TargetDataLineis a sub-interface ofDataLine, which in turn, is a sub-interface ofLine. So, line of types is defined by sub-interfaces of the basicLine interface. The interface hierarchy is shown in the figure 4:1 for more explanation in respect of the relation between dataLine and TargetDataLine.

Figure 3:5 the Line Interface Hierarchy

The Java Sound API does not suppose a particular audio hardware composition; it is provided to permit various kinds of audio components to be installed on a system and accessed by the API. The Java Sound API supports usual functionality, for example, input and output from a sound card (such as for recording and playback of sound files) as well as mixing of multiple streams of audio. For more explanation, figure 3:6 is shown an example of a typical audio architecture.

Figure 3:6 A Typical Audio Architecture

In this example that is shown in the figure 3:6, a device such as a sound card has various input and output ports through which mixing is supplied in the software. The mixer might obtain data which has been read from a file, flowed by a network, developed on the fly by an application program, or generated by a MIDI synthesizer. The mixer joins all its audio inputs into a single stream, which can be sent to an output device for rendering (Oracle, 1995).
To conclude, ATargetDataLineobtains audio data from a mixer. Generally, as mentioned before that audio data from a port such as a microphone has been captured by a mixer; it might develop or mix this captured audio previous to placing the data in the target data line's buffer. TheTargetDataLineinterface supplies techniques for reading the data from the target buffer of data line and for locating how much current data is available for reading. Furthermore, ASourceDataLinereceives audio data for playback which must be available for all simple sound system. It provides methods for writing data to the source data line's buffer for playback which is opposite to TargetDataLine, and it is used for determining how much data the line is ready to receive without jamming.
Ultimately, AClipis a data line into which audio data can be loaded prior to playback. The clip's period is recognized before playback, and users can choose any starting point in the media because the data is not loaded perfectly. Also, Clips can be looped; it means that upon playback, all the data between two specified loop positions will reiterate a particular number of times, or indefinitely (Oracle Java Technology, 1993).
The purpose behind the dissection about the capturing audio is to develop the application can be captured the audio speech and control via the GUI in order to run on as applet in the web browser to create the corpora speech collection web-based tool.


  1. GUI


To control Java Sound API; GUI should be used and GUI stands for "Graphical User Interface” and also it is pronounced "gooey." It returns to the graphical interface of a computerwhich permits users to click and drag objects on the button (Gladden, 2000).
Baldwin (2003) pointed out that in the simple Java Sound API, simple GUI is managed by three buttons, as shown in figure 3:7. By clicking on theCapturebutton input data from a microphone is captured and saved in aByteArrayOutputStreamobject. Data capture stops if the user clicks on theStopbutton. Playback of the captured data begins when thePlaybackbutton is clicked.



Figure 3:7 Simple GUI example for controlling Java Sound API


3.2.2 Java Applet


In spite of that java applet and GUI has a significant relation because Java applet is usually designed by GUI. Many references specifically has discussed and defined Java Applets. There are some differences in the expression, but all of them have the same goal and they focus on the applets that are able to perform in the web browser and are written by java programming language. Bishop (2006) identified that Java applet are programs which are written in the Java programming language that can be embedded into web pages. AJavaappletis a Java program capable of doing more complex tasks than a JavaScript. Thus, the applet still necessitates to be run in a WebBrowser, but does not havea perfect accessto a machine that a stand alone Java program does.
Consequently, Oracle Cooperation for Sun Developer Network(2010) indicated that “An applet is a program written in the Java programming language that can be included in an HTML page, much in the same way an image is included in a page. When you use a Java technology-enabledbrowsertoviewa page that contains an applet, the applet's code is transferred to your system and executed by the browser's Java Virtual Machine (JVM).”

These descriptions have proved that applets have their own advantages and disadvantages. The significant task of java applet is providing sturdy security because java resolves this issue by restricting applets to Java’s execution environment and prohibitingaccessto system resources (Janalta Interactive Inc, 2012). Moreover, the java applets provide high performance because of running on crossing platform and being capable of running on Windows, Mac OS and Linux platform as well as working on all the version of Java Plug-in. Most web browsers such as firefox, explorer, Google-chrome, safari, etc. support applets. Nonetheless, applets have some problems but till now the experts cannot tackle them. For example, for running applets, it should be required installing java plug-in and java applet requires JVM. So initially it takes important start up time. It’s tricky for developers to design and build nice user interface in applets compared to HTML technology (Rose India, 2007).


For instance, the java applet interface depends on GUI. It is difficult to design nice interface by GUI, i.e., it is clear in Voxforge interface which appeared to lack the design interface.
Java Applet is more helpful for web developers to design web voice recording and allows participant to capture the voice from web site. It can compress the voice and send to the web server via HTTP. Moreover, it can playback the recorded voice from the server, the embedded voice streaming player, or from a separate player. The functionality of recording applet is started by capturing voice fromsound card, then it uploads the voice file to web server via HTTP. The kind of server script - such as apache, apache tomcat servlet, or etc. - is used to receive voice file on the server. Consequently, the captured sound can be saved and played back from web server. These general steps happen in all web application which is devoted to capture sound speech by java applet (VIMAS Technologies, 2007). Also, Java applet can be constructed from JARSIGNER to keep the applet securely and make the java policy. That’s why, it is a good idea if used this type of applet for this project.

3.3 The client-server task to run on corpora-tool

Client server technology is directly related to a network which is used to share and distribute computing system in which the tasks and computing power are split between the servers and clients. Usually, the servers store processed data common to the clients across the organisation and these data can be accessed by any user. Generally, server also assists applications that have been developed for specific reason to run on the system. In this networking,requests are made by different clients totheserver. Server then processes therequest andprovides the desired result to the client. The client serverarchitecture is multilateral, supports GUI (i.e. java Applet) and has modular infrastructure. The technology is illustrated as a technology for reducing cost (MakeUseOf, 2010).
According to FreeFeast.info (2012), “Client-server computing is a software engineering technique often used within distributed computing that allows two independent processes to exchange information, through a dedicated connection, following an established protocol”. For example, functions such as recording client’s voice and playback or download, also the process of accessing the weband database access which will be discussed in the next section, are built on the client-server model.
The first way to hook client applets up to any server is via a URL page. The basic scenario is that it creates an HTML document for your client, based on responses to a form or simply via a URL. For this purpose, it should embedelement that can be usedwith each browsers such as in HTML 4 or HTML 5. There is some difference between the versions of HTML in embedding java applet to browser. Such as html codes is taken from (Shayne Steele, 2011):

  1. Java applet in HTML 4.01


Name of Class that create jar file” archive="name of jar file.jar" Height="300" Width="550"> Applet failed to run. No Java plug-in was found.

  1. Java Applet in HTML5 (usesobjecttag)


Applet failed to run. No Java plug-in was found. Java supports multithreaded applications built within the language. Running an applet on browser can control multiple threads for a Java applet. Java threads are mapped by the browser into the dependent operating system threads whether the dependent system supports it or not (wisdomjobsgulf.com, 2010). Hopson and Ingram (1996) indicated that Applets reside on servers and are downloaded to clients when referred via a URL. Applets execute on isolated sites under the support of the browser that downloaded them. The code execution and address space have modelled in a process called the Java Virtual Machine (JVM). The JVM describes how to implement Java applet code and addresses issues regarding the ways to parse the byte codes in an applet. In general, applets are run under the scrutiny of a security manager and aren't permitted to record sound. Therefore, java applet can capture sound to the server and saves the record in the server. However, Applets work on the client part. Technologies such as VIMAS Technologies (2011) proved the technique of recording audio on the web and said “We offer a few web audio recorders to record audio from the web site and uploading audio files to web server via HTTP. They can be easy integrated to web page. The audio recorders implemented as Java applets which use native code to audio capture/playback and encoding/decoding.” Moreover, VoxForge system runs on server and it has been created by java applet for developing TTS corpora tool that can collect speech sound and send information to the database system.


  1. Database storage corpora tool


As discussed in chapter two, the speech corpus tools include the huge data which is contained in the text and the speech with personal information about the users depend on the corpus tool. For this purpose, the developers allocate the suitable space to store the data. As a result, most of the tools have been designed that collects the data in the database to assist in exchanging the data easier. Elmasri and Navathe (2010), proved that using database management system (DBMS) is advantageous because it assists to provide concurrent, distributed access to large volumes of data, and also uniform, logical model for representing data (relational data model). Also it is used for supplying a powerful, uniform language for querying and updating data (i.e. SQL) and permitting influential optimisations for query efficiency such as indexing and query transformation. The significant task of the database offers to the system which concentrates the insurance of the data integrity within single applications. For example, checking and recovery, and it is used for ensuring data integrity when the system caresses multiple applications i.e. concurrency control. There is no doubt that the text-speech corpora tool needs a special database because it collects data for more than hundred hours. For example, Shmyrev (2010) indicated that the database of voxForge itself is about seventy five hours of read speech taken from various sources. The speech was collected by web collection application. Since the voxForge tool uses the text file as a database to collect personal and information which is related to the audio capture, it is hard to control the data for modify and delete the data. This test showed that the corpora tool needs to provide a particular database which has the capability of storing gigantic data. Thus, this project will be used other database such as a last version of MySQL5 to collect the data and easy to control and update it. The database technology speech corpus collection tool will be discussed and described in architecture’s section in chapter five. The two brothers YeSim Aksan & Mustafa Aksan (2009) has implemented and designed Corpora tool and they have indicated that corpora is “a collection of texts stored in an electronic database. A corpus represents varieties of spoken or written text types, sampling language in use, providing researchers the most fundamental database upon which they can search various aspects of language.”Also, Building electronic database contains collection, computerizing and checking of corpus data.In conclusion, it can be said that all the information has been stored within the server. Server performs as a database which extracts the relevant information to the client.


  1. Summary



To summarize, this chapter is focused on the technical review for some tools that has been developed. It has discussed some strong and week points that were used for developing speech corpora tool especially web-based system. Accordingly, it has deliberated some significant aspects which are related to the technical overview, and the programming language for developing the system such as Java Language to capture audio and assist the developer to develop the system according to the requirement. Besides, this chapter has showed how the previous web-bases tools have been run on the server and stored the data without losing the speech data. In conclusion, this chapter will guide the new project to collect data corpus as a web-based system which has a lot of functionality.
Answered Same DayDec 20, 2021

Answer To: Abstract In speech technology and linguistics, corpora acquisition is the crucial and often costly...

Robert answered on Dec 20 2021
109 Votes
Abstract
    In speech technology and linguistics, corpora acquisition is a crucial and most often a costly process of engineering and language science. The experiments need to collect a huge amount of speech data in order to analyze the data speech corpora. However, there are several kinds of corpora tools, but each type of tool has a specific advantages and disadvantages. As a result, the analyzer should use the suitable tool depending on the speech corpus to gather the data without indulging in time wastage and undue expenditure. For this purpose, the focus of this project is to build a usable web-based tool for allowing the collection of a speech corpus of ‘Read-Speech’. The system asks the volunteers to read a text prompt that appears on the webpage and their speech signal is recorded along with the collection of the personal information profile and finding out the environment in which the data speech corpus has been recorded. Consequently, the experimenters determined the text that is to be read by the participants, depending on the type of speech corpora
Acknowledgments
    First and foremost, I would like to thank * for helping me to finish this work and for encouraging me to keep my patience to complete this dissertation.
Then, I would like to thank my supervisor Dr * for his enthusiasm; wise words and continuous support throughout this project. Consequently, his wide arena of knowledge and his logical way of thinking have been of great value for me. His understanding, encouragement and personal guidance have provided a strong basis for the present project.
Additionally, many thanks to the University of * for being such a great university offering anything a student might need especially in the realm of academic excellence.
Finally, I am really grateful and like to thank the K-R -G of I that gave me this opportunity and supported me commercially.
Chapter two: Background Review
To illustrate the text-speech corpora tool, it is needed to deal with the significance of the corpora tools and the domains that are used. Each domain has some significant
roles in the realm of various scientific technologies. Therefore, there are several types of speech corpora which are implemented. In spite of that this chapter is devoted towards demonstrating and investigating the previous works that are related to this work, and it also describes the factors that classified the data speech according to the data collection.
This chapter is classified into several crucial sections; the section 2.1 is allocated to demonstrate the meaning of term corpora and definition of some expressions which are used in the research, such as text corpora and speech corpora. Subsequently, the significance of this section is that, it is dedicated to argue the structure of text corpora and speech corpora that are used in the previous tools, in speech technology and in linguistics study. In addition, the section 2.2 is devoted to discuss the data speech classification and the factors that influenced the data speech collection. Furthermore, the section 2.3 explains the types of speech that is recorded during data collection. The section 2.4 is focused on the method of collecting data and the tools that have been implemented for this purpose. Pre-ultimately, the section 2.5 points out the importance of text-speech corpora tool and it determines the fields and technologies that have been used. In speech technology, speech corpora are used to create acoustic models, but in linguistic technology, speech corpora are used for studying transcription, photonic and Conversation analysis in the specific language. The section 2.6 concentrates on some significant tools that have been implemented before and impact of development corpora tool. Finally, this chapter has the specific section to summarize all evaluations and results that have been received.
2.1 Definition and Terminology
This section discusses corpora word and the expression related to this term such as text corpus and speech corpus because both of them have to go through a critical situation to study speech recognition and linguistics in a particular language.
The word ‘Corpus’ is the singular form of ‘Corpora’. It is a Latin word which signifies body. The word ‘Corpus’ has many meanings which are changing depending on the field in which it is used. According to the Farlex, Inc, (2004) in literary and Literary Critical Terms, it means “a collection or body of writings.” Nonetheless, in linguistics, it means “the body of data”. Also, it is the main part of an organ or structure as stated in Life Science and Anatomy. Consequently, this word in Farlex, Inc, (2004) indicates “a capital or principal sum, as contrasted with a derived income” as described in Economics, Accounting and Finance. Moreover, this word is used for several other purposes too.
The word Corpora has many different scientific definitions, but all of them possess nearly the same meaning. According to Crystal (1992) the most famous definition in linguistics is “A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech”. The basic aim for this expression is to check out a hypothesis about language, for instance, for determining how the usage of a specific sound, word, or syntactic construction differs (McArthur, 1992). This is showed that the corpora include text and speech. For this reason, it is needed to discuss and introduce the text and the speech corpora intensively.
2.1.1 Text Corpora
It is clear that text corpus is includes such types of texts that are used for reading during the recording of audio speech. As a result, the type of method of collecting the corpus will be relying on this process. For this purpose, this section is devoted to indentify such texts that are used in collecting speech corpus. Generally, the word ‘Text’ represents words, phrases, sentences, numbers, codes, and paragraphs. In accordance with freetechexams.com (2005) “Text corpus is the technique which is used in linguistics and mainly it is used for the purpose of referring to the texts which had been stored and processed with the help of some electronically held”. Moreover, most of the linguists render the same idea about the meanings of this word, and text corpus is a gigantic and structured set of texts and it is usually electronically stored and processed.
There are unclosed questions in corpus to choose or select text corpus in text- corpora tool, and the purpose of the text that is provided in the corpora tools. Some references and philosophers mention that lexicographers, linguists and researchers can be chosen the texts to read (Aijmer and Altenberg 1991). For instance, the recording audio from http://www.voxforge.org/home/read is the best example of a tool that determines the text to be read by the participants. It is crucial to indicate the method of selecting the text corpus because RS corpus completely depends on the text and has been chosen to deal with the sound which is recorded in the speech technology and linguistics. As a result, the text structure impacts on the audio speech that will be collected and this is because the structure and context of the text is different and changes according to the language. For instance, the structure of an English text is shown in appendix-A- to explain the classification of text in this language because all languages nearly have similar contexts and this is according to those previous studies which have been performed.
In spite of the significance of text structure context in the domain corpora tools, some atomic features are influenced in the type of the tools and result of the researchers. For instance, gender and level age of author, period and location of publication of text, language variety and etc (Holmes-Higgin et al., 2004).
2.1.2 Speech-Corpus
A Speech-Corpus is also called a ‘spoken corpus’; such as Corpus word, and there are several expressions that define it. Speech-Corpus is a collection of speeches preserved in a captured audio. These collections are helpful for performing linguistic studies and for growing speech software (wiseGEEK, 2003). Furthermore, Gibbon, Moore and Winski (1997) stated that “any collection of speech recordings which is accessible in computer readable form and which comes with annotation and documentation sufficient to allow re-use of the data in-house, or by scientists in other organizations” is known as the spoken corpus. Also, this term is defined by Crystal (1991) as “a collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language”. This definition shows that speech corpora contain a database  that collects speech audio files and text transcriptions. In many other references this definition has been mentioned and validated. Generally, Speech corpus are used in two specific aspects, firstly, it is beneficial for recognizing speech in speech technology. On the other hand, in Linguistics, spoken corpus is used to study phonetic, conversation analysis, dialectology and some other fields that are related to this science (McEnery & Wilson, 1996). For this purpose, this section discusses the advantages of speech corpora and some features which are being used by the speech corpora.
In general, the purpose behind the use of speech corpora, such as a spontaneous speech, is to supply and generate a basic real-life example for a particular language and to act as a useful tool to access the actuality of language. Besides, the pedagogic alteration analysis of the corpus and the development of teaching linguistic materials are facilitated by digitized speech data. This is presented through a speech corpus as a method that helps those individuals learn the language who intend to learn a foreign language. The method includes designing, compiling, and annotating corpora. As a result, speech corpora have been used as material example in the classroom as well as to examine the learners (Moreno-Sandoval, Campillos, Dong et al., 2012).
Moreover, Trippel (2008) indicated that the speech corpora are used in several features and in several fields of linguistics. For example, it is used for developing and researching about phonetics and phonology, morphology, lexicography, syntax, semantics, pragmatics and sociolinguistics of a particular language. However, it can analyze the language as quantitative statements by advocating the number of words in a language. In phonetics, speech corpora help to deal with real speech sounds, properties of speech sounds, physical properties, possibilities of sound combinations and transcription tasks.
On the other side, corpus may also be used for researching in other direction. As discussed, speech corpora are considered as reliable in speech technology. For this purpose, Gibbon et al. (1997) summarized some sources that may provide information in a speech corpus commonly such as transuded signals (i.e. acoustic speech signal), assessment parameter as assessment result, descriptor as a profile of speakers and some extra information. It is also the speech corpora which are used for constructing acoustic model for recognizing speech engine.
For these reasons, this report will discuss how to develop a web-based system tool to collect a huge speech corpus by reading the text. As Gibbon et al. (1997) focused, the speech corpora is depended on the profile of the speakers who participate as a volunteer (such as: gender, age, and place of birth and etc.), the place of recording, the type of microphone, and the speech martial type. The system should be developed according to these conditions and should be able to retrieve this information to help the researcher during analysis of the data.
2.2 Data Speech Classifications
In the previous section it has been confirmed that the text-speech corpora include both speech and text corpora which contain the database that are used for storing text and audio capture. As a result, there are several types of techniques to collect the data for creating Corpora. Moreover, the type of corpora is classified according to the system, to accumulate the data speech and information.
According to Vianaz et al. (1998) speech data collection is divided into several types, but this classification is generally decided by some factors which relate to the method through which data is collected. This section investigates some of these scopes that affect the process of collecting data.
2.2.1 Data speech clarity or visibility
It is no doubt that data clarity and data visibility is a significant object for collecting each type of data. In general, data clarity is a scientific approach to endlessly improve and keep the health of data. The analysis of data corpus is focused on diction clarity of the recorded audio. In this relation, Moreno-Sandoval et al. (2012) conveyed that the data diction clarity impacts on analyzing the data. Consequently, the data clarity and visibility in corpus is measured by speed of data and the profile of person who participate in the collection of data; therefore, a tool could be developed to capture audio by high-quality to assist analyzer.
On the other hand, the aspect of data clarity or visibility has directly a relation with the data policy. It means that the recipient has permission for the data or keeping the data a secret. For instance, NeoSpeech, (Inc.) recognizes that privacy is important. The privacy policy applies to participant’s use of the NeoSpeech Text-To-Speech Web Service offered by NeoSpeech, subject to the web service terms of use. By using the NS TTS web service, participants are allowed to collect and use your personal information on NeoSpeech (2011).
2.2.2 Data speech environment
The quality of the collected data is dependable on the atmosphere in which the data has been recorded. For example: the data may be collected in a laboratory, studio, over telephone, etc. The data speech may be recorded in an anechoic room, in private homes or offices or somewhere else, as per choice. Certainly, the location is impacted on the data speech (Vianaz et al., 1998).
Experiments are going on for finding ways for using of standard speech corpora for developing and evaluating purposes. Aijun, Xiaoxia, Guohua et al. (2000) focused on the method and the environment needed to collect the data corpus according to the goal of analyzing the data corpora. For instance, they determined that the casual speech recorded in the normal rooms is gathered and annotated. It is one of the major factors behind the progress in automatic speech processing, particularly in speech and speaker recognition. Possibly the major advantage of using normal corpora is that it allows researchers to compare performance of dissimilar techniques on general data, hence making it easier to locate which mechanisms are most favorable to pursue (Campbell and Reynolds, 1999). For this purpose, an acoustic environment is crucially needed to record data speech. The sound recording process is often changed according to the technique that has been used to collect the data. For example, collecting data in the lab or studio differs greatly from the data collected over telephone or in general rooms not specifically built for this purpose (I'd Rather Be Writing, 2010). The technique of collecting corpora speech in lab or studio helps experiments to control the audio purposefully and it minimizes the chances of gathering noisy data. However, noise can be added after the data has been collected.
Consequently, the types of microphone such as a high quality microphone and transferable recorder are used to capture the audio are examples from a range of different environments (Ma, Milner and Smith, 2006). According to the website of I'd Rather Be Writing (2010) for obtaining accurate data sound, through experiments, efforts have been rendered to find and prepare the possible environment which is known as acoustic room. This room has some specialty to record the audio sound, for instance: cloth panelling on walls, isolation from other people, lockable, and windowless.
2.2.3 Data speech control
This dimension is focused on the way to record audio or interact between participants and tool, for instance, the speech collecting by random, spontaneous dialogue, interview or recording of readings. It is clear that random and spontaneous speech has a great role in speech corpora to recognize the speech technology. Thus, increasing the tool for recognizing speech depends significantly on raising the performance of recognition of spontaneous speech. Accordingly, it ought to build huge spontaneous speech corpora for establishing acoustic and language models.
In addition, there are various tools that are incorporated to develop the methods of voice application system. Lane et al. (2010) has discussed and has given examples for the corpora collection tool. For example, he expressed corpora collection tools as a web-based system, lab-based system and smart-phone based system. There are differences and similarities in methods of capturing speech in different voice application systems. For instance, the web-based system allows recordings to be collected remotely on PCs through accessing the URL of the tool from the volunteer’s location at any time. However, this type of system helps experiments to collect a huge data from any place but it cannot be controlled. Nonetheless, the lab-based system is controlled during the collection of the data, but it is hard for experimenters to find a great number of volunteers to participate in the process of collecting corpora tool.
That’s why, experiments focus on various achievements in the realm of Spontaneous Speech and the method to collect the random and spontaneous speech. Because different spontaneous-speech has specific aspects, such as random pauses, rectifications, hesitations and repetitions, various new technologies are required to recognize it. For example, the new technique named automatic summarization practically includes indexing and a process extracting vital and dependable parts of the automatic transcription (Furui, 2005).
2.2.4 Data speech monitoring and validation
Monitoring of data speech is implemented through online or offline processes. This task is helpful to control data during the data recorder and modifying technical and phonetic characteristic on-line, i.e. during the process of capturing. Consequently, the task is performed after data collection to check the data and separate the possible data from useless data. The offline monitoring process is checking the recording speech beyond uploaded by the users (Vianaz et al., 1998).
Moreover, Validation in software system occurs during the process of testing after developing the system. This is similar as the validation of data corpora because stakeholder is needed to validate the data after the data corpus is captured (LREC Workshop, 2004). For this purpose, analyzer should check offline the data while the process is being completed as a web-based system tool. On the other hand, monitoring is looking at the system and checking during the process (Tarala, 2011). The monitoring in speech corpus collection is the same as software system, stakeholder check out the system during the collection process. This is possible for lab-based system because volunteers are controlled by the developer during the procedure.
For this purpose, validation greatly related to posteriori evaluation of the recorded material. For instance, the data speech of read corpora is recorded in studio and is monitored by someone from outside the recording room; in spite of the monitoring, the monitor has audio contact with the speaker. But this situation is very difficult to be performed in the software system because software system cannot control by individual monitors and the time of recording is not limited. Thus, the phonetic controlling characteristics are also limited to cases in which the pronunciation is not naturally assumed. The only difference is in reading sentences while self-monitoring process took place. In this case the participant has the chance to listen to the recorded sentences and to rectify by rereading the incorrectly read part. Validation was used instead of monitoring in telephone recording and dialogue (Russell, Corley, and Lickley, 2011).
2.2.5 Data speech channel
Generally, the channel that is depending on recording speech data is divided into two main types. Firstly, the single channel is used with single microphone to record the speech, but the second one, the multiple channels, utilizes more than a single microphone (Vianaz et al., 1998). Thus, an audio channel is a single path of audio data. Multi-channel audio is any audio which uses...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here