Minimal Transcript Coding(1) (revised 7/24/01)

Clifton Pye
© 2001


Coding language transcripts for computer analysis can be a daunting task. Researchers can code phonetic, morphological, syntactic, semantic and pragmatic information and still not capture every nuance of spontaneous language use. A single five-minute section of a language sample can easily absorb hours of time in the coding process with no guarantee that the coding will ever be complete. What is worse, the attempt to capture the maximal amount of information in the code will lead to the maximum number of coding errors as the coders (human or computer) try to determine which codes apply to a particular section of the sample. Coding errors come in two types--omission and commission. Checking for such errors adds further hours to the coding process.


I have developed a technique for transcript coding that I refer to as 'minimal coding'. The goal of minimal coding is to reduce transcript coding to the minimum necessary to extract the maximal amount of information via computer. My interests have centered around the morphological and syntactic structure of children's sentences. Minimal coding allows me to assess the production of all morphology in obligatory contexts as well as the syntactic and semantic realization of various verb argument structures. These basic results can be analyzed further to determine what interactions might exist between morphology and syntax or between semantics and discourse.


In this paper I describe the minimal coding approach and provide examples of coded language transcripts and the resulting output. I compare the minimal coding approach with more exhaustive approaches (e.g., chat) to provide readers with a better understanding of the strengths and weaknesses inherent in minimal coding.


Minimal Coding


The primary goal of minimal transcript coding is to provide a computer-readable transcript with a minimum of coding. This approach minimizes the amount of coding by relying on computer-intensive transcript processing. Fortunately, computer programs are now widely available that allow researchers to perform these analyses fairly easily. I developed minimal coding in the course of developing my own transcript analysis software (PAL). This software is available online, and will perform the types of analyses I describe.


Minimal coding starts with a transcription of the language sample that includes at least a speaker identification label for each speaker in the sample. I have adopted the convention that all utterances in the transcript begin with a speaker code that consists of a series of upper or lower case letters followed by a colon, and separated by a following space from the speaker's utterance. I provide an example of an English child language transcript in (1) that follows these conventions. The label 'chi:' identifies the child's original utterances and 'res:' identifies another speaker. Researchers commonly add other information to their transcripts such as information about the recording and comments. All lines that begin with an asterisk ('*') indicate information that will not be processed by the analysis programs, e.g., comment lines and information about the child and the transcription. The transcription can also include blank lines. Information about doubtful transcriptions or interpretations should be enclosed in parentheses or angle brackets ('< >'). Comments about the utterances should be enclosed in square ('[ ]') or curly ('{ }') brackets as shown after the child's first utterance in (1).


(1) Sample of a raw transcript (modified from a sample in Brown 1973)


*Testfile


*This file can be used to test the minimal coding programs.


chi: one (busses). [? bosses]
res: one.
chi: two busses.
chi: I saw two training.
chi: I talk about some trains.
chi: I ( ) cut dis.


I prefer to add coding to a copy of the subject's original utterances rather than marking the original. This practice allows me to easily read the original utterances and quickly assess the accuracy of the coding. The easiest way to produce a code for the original utterance is to copy the utterance into a new line below the original. This line requires its own label to allow the computer to find it quickly. I have adopted the convention of adding the label '%mor:' for 'morphology tier', but any label will be adequate. This step is easily accomplished by computer. The program 'addtier.exe' simply copies a speaker's original utterance and inserts a '%mor:' label in place of the speaker's label. It also deletes any comments enclosed in square or curly brackets that follow the speaker's utterance. Adding '%mor:' tiers for all the subjects allows all of the utterances in the transcript to be analyzed, but does not require that the utterances for every speaker be coded. The researcher should determine which speakers are to be coded for any given analysis. I provide an example of an English child the language sample that is ready for coding in (2).


(2) Sample of a transcript with an additional coding tier


*Testfile


*This file can be used to test the minimal coding programs.


chi: one (busses). [? bosses]
%mor: one (busses).
res: one.
%mor: one.
chi: two busses.
%mor: two busses.
chi: I saw two training.
%mor: I saw two training.
chi: I talk about some trains.
%mor: I talk about some trains.
chi: I ( ) cut dis.
%mor: I ( ) cut dis.


The transcript in (2) is ready for minimal coding. I limit coding to identifying the underlying morphemes and verbs as well as making some distinctions between different types of morpheme errors. I insert a hyphen ('-') between bound morphemes and add a slash ('/') in front of verb roots. I use an asterisk ('*') to indicate a morpheme that was omitted in an obligatory context and an exclamation mark ('!') to indicate inappropriate or overgeneralized morphemes. I use the pound sign ('#') to indicate a morpheme that appears in the obligatory context for another morpheme. The final utterance in the transcript shows that it is sometimes necessary to add material to the original utterance. In this case, I interpret the child's production of the irregular verb 'cut' to be in the past tense. Since this verb does not have an overt past tense morpheme, I added a past tense marker in the morpheme tier to indicate that the verb has past tense. I did not regularize the child's production of the word 'this'. Such alterations are only necessary if the morpheme cannot be distinguished from another morpheme by its surface form. By this standard, it would be necessary to distinguish between the present and past tense forms of the verb 'cut,' but not necessary to alter the child's production of 'this.' I regularized the past tense form of the verb 'saw' to enable the analysis program to group all uses of the verb 'see' together in one place. All irregular uses of past tense verbs will be grouped together under the heading 'PAST' while all regular uses of past verbs will be grouped together under the heading 'ed'. Applying these principles to the sample transcript in (2) yields the minimally coded transcript shown in (3).


(3) Transcript with minimal coding


*Testfile


*This file can be used to test the minimal coding programs.


chi: one (busses). [? bosses]
%mor: one (bus-!s).
res: one.
%mor: one.
chi: two busses.
%mor: two bus-s.
chi: I saw two training.
%mor: I /see-PAST two train-#s.
chi: I talk about some trains.
%mor: I /talk-*ed about some train-s.
chi: I ( ) cut dis.
%mor: I ( ) /cut-PAST dis.


The transcript in (3) is now ready for analysis. One typical analysis is to determine the proportion of grammatical morphemes that were produced in their obligatory contexts. The best way to determine this proportion from a minimally coded transcript is to produce a morphological concordance from the morphological tier. The concordance should display every morpheme the subject produced as well as each of the subject's utterances that contained the morpheme. Adding a transcript identifier makes it easier to locate the concordance forms in the original transcript. I also like to display the coded tier as an aid to the analysis. This makes it easy to see where the child substituted one morpheme for another. A concordance for the transcript that provides information about the child's inflectional morphology is shown in (4).


(4) Output from a morphological concordance


PAST
testfil chi: I saw two training. (= I /see-PAST two train-#s.)
testfil chi: I ( ) cut dis. (= I ( ) /cut-PAST dis.)


ed
testfil chi: I talk about some trains. (= I /talk-*ed about some train-s.)


s
testfil chi: one (busses). [? bosses] (= one (bus-!s).)
testfil chi: two busses. (= two bus-s.)
testfil chi: I saw two training. (= I /see-PAST two train-#s.)
testfil chi: I talk about some trains. (= I /talk-*ed about some train-s.)


This output can be copied to a spreadsheet program for further analysis. I provide an example of such an analysis in (5). The spreadsheet program can then compute the total number of times the morpheme was used correctly and the total number of obligatory contexts for the morpheme. Researchers can easily return to these spreadsheets to check particular utterance codes or to refine their analysis. For example, a simple cut and paste with this spreadsheet would allow an investigator to quickly separate the plurals into regular and irregular forms.


(5) Spreadsheet analysis of concordance data
Plural Use
Correct Omitted Inappropriate Utterance
1 one busses. (= one bus-!s.)
1 two busses. (= two bus-s.)
1 I saw two training. (= I /see-PAST two train-#s.)
1 I talk about some trains. (= I /talk about some train-s.)
2 2 Totals
100% Percent correct in obligatory contexts ( = C / C + O)

Researchers often need to combine data from two or more transcriptions to supply a sufficient number of contexts for morpheme use. I often merge my concordances together for this purpose. I provide the program 'merge.exe' to combine the data in two concordances. For illustration purposes I copied the information in the file shown in (3) to a new file that I labeled 'testfil2'. I relabeled the original file 'testfil1'. The merge program produced the output shown in (6) for the plural morpheme.


(6) Output of the merger for concordance files 'testfil1.con' and 'testfil2.con'


s
testfil1 chi: one (busses). [? bosses] (= one (bus-!s).)
testfil1 chi: two busses. (= two bus-s.)
testfil1 chi: I saw two training. (= I /see-PAST:irr two train-#s.)
testfil1 chi: I talk about some trains. (= I /talk-*ed about some train-s.)
testfil2 chi: one (busses). [? bosses] (= one (bus-!s).)
testfil2 chi: two busses. (= two bus-s.)
testfil2 chi: I saw two training. (= I /see-PAST:irr two train-#s.)
testfil2 chi: I talk about some trains. (= I /talk-*ed about some train-s.)


Adding a marker in front of verb roots allows investigators to quickly extract basic information about the syntactic and semantic structure of the subject's sentences. Such information can be obtained from a concordance of the subject's utterances that contain verbs like that shown in (7).


(7) Output of verb concordance


cut
testfil chi: I ( ) cut dis. (= I ( ) /cut-PAST:irr dis.)


see
testfil chi: I saw two training. (= I /see-PAST:irr two train-#s.)


talk
testfil chi: I talk about some trains. (= I /talk-*ed about some train-s.)


If one were interested in the argument structure of the subject's utterances it would again be possible to copy the verb concordance to a spreadsheet for additional coding (see 8). I provide separate columns in the table for where the subject produced or omitted the subject. Other options can be added where necessary. For example, it would be easy to add another column to count the number of times the subject used pronouns versus noun phrases as sentence subjects. Analyzing syntactic data in this fashion also reveals rather quickly if specific verbs are more productive or more restricted in the subject's grammar.


(8) Analysis of verb argument structure
Verb argument structure
SVO *SVO
saw
1 I saw two train. (= I /saw two train-*s.)
talk
1 I talk about some trains. (= I /talk about some train-s.)
cut
1 I ( ) cut dis. (= I ( ) /cut-ed dis.)
3 Total
100% Percent SVO ( = SVO / SVO + *SVO)

*Indicates the S was not produced in an obligatory context.


Minimal Versus Maximal Coding


The coding scheme any investigator employs is ultimately determined by the type of analysis the investigator wishes to conduct. Minimal coding is flexible enough to provide quick analyses for a wide range of research issues. I have used the approach to investigate children's productions of morphemes, basic word orders, lexical development, and interactions between aspect and verb agreement. Minimal coding will also work with a wide array of language types, from the isolating types such as English, Mandarin and Thai, to polysynthetic languages like K'iche' Maya.


It is best to consider the advantages of minimal coding in contrast with maximal coding approaches. Maximal coding would require the investigator to develop codes for each aspect of the transcript they wished to investigate. In (9) I provide an example of what a maximally coded version of the sample transcript might look like.


(9) A Maximally Coded Transcript


chi: one (busses). [? bosses]
%mor Num (N-!pl_reg).
res: one.
%mor Num.
chi: two busses.
%mor Num N-pl.
chi: I saw two training.
%mor Pro_1sgnom/subj V_past_irreg Num N-nom_for_pl/obj.
chi: I talk about some trains.
%mor Pro_1sgnom/subj V_pres Part Quant N-pl/obj.
chi: I ( ) cut dis.
%mor Pro_1sgnom/subj ( ) V_past_irreg Dem/obj.


Maximal coding is extremely tedious to produce and very difficult to check against the subject's original utterances. It is easy to omit indicators of omitted or inappropriate morphology and very difficult for a coder to keep track of the individual coding conventions. The advantage of maximal coding is that once it has been completed, it is easy to use a computer to count the number of first person singular nominative pronouns used as subjects, etc. Researchers must decide whether it is more efficient to produce a maximum amount of coding at the beginning of transcript analysis to allow simple computer searches, or minimal coding that requires more complex computer searches. It may be possible to provide the computer with a lexicon that permits some automation of the coding process, but this step requires the production of the lexicon and subsequent checking of the computer's output for coding errors of both types. In practice, computer tagging still requires the investigator to check every line of a transcript for coding errors so there is no clear advantage to the use of computer coding methods. Another advantage of maximal coding is that it allows the investigator to preserve all of the coding in the transcription. I think this quickly turns into a disadvantage if the transcript cannot be read easily. It is just as easy to keep the analyses produced by minimal coding together with the minimally-coded transcription in a folder on a computer.


I find it necessary to remind investigators who are using the minimal coding technique for the first time to think of a way that they can perform an analysis with minimal coding. It is difficult for investigators to break a habit of supplying a new code or gloss for every morpheme in a transcript, but maximal coding can add years of time to transcript analysis with little to show for the investment. In adopting the minimal coding practice it is necessary to adopt the philosophy of trying a minimal coding before resorting to any further coding distinctions.


References


Brown, R. 1973. A First Language. Cambridge, MA: Harvard University Press.


MacWhinney, B. 1995. The Childes Project. Hillsdale, NJ: Erlbaum.


Pye, C. 1985. PAL: Pye Analysis of Language. Ms. The University of Kansas.


Endnote


1. During the development of the PAL routines I had the benefit of many conversations with David Ingram on the analysis of child language and practical methods for establishing phonological, morphological and syntactic productivity. The PAL routines directly incorporate his work in this area of research. I would thank Robert Hsu for introducing me to the Spitbol programming language and its many advantages over BASIC in the area of text analysis. Conversations with Penny Brown, Sonia Eisenbeiss; and Bhuvana Narasimhan at the Max Planck Institute for Psycholinguistics encouraged me to write down my ideas about minimal coding. John Haviland suggested some additional features to enhance the programs' output.