Table of Contents
- Language Word List Database
- Query Word Lists
- Submitting Word Lists
- Editing Word Lists
- Regular Expressions
- Bug/Feature Tracker
Didn't find the answer?
Visit our FAQ page to seeanswers to common
questions. Or contact us with
your question.
ComparaLex Help
Language Word List Database
Metadata - Data about data
The website database stores specific information about each language word list. This information is provided by the registered users of ComparaLex and checked by the administrators. Here is an explanation of each field:
| 1 | Language name | Usually the language name will be used here. You can have more than one word list for each language but the name field for each must be unique. |
| 2 | Country | The country where the language is spoken |
| 3 | Region | Description of the region where the language is spoken (eg. north-east corner, near a city, etc.) |
| 4 | ISO code | The 3 letter ISO 639-3 code for the language (see www.ethnologue.org) |
| 5 | Language family | The name of the greater language family that the language belongs to (see www.ethnologue.org/family_index.asp) |
| 6 | Standard list Used | The standard word list used for elicitation (eg. Swadesh, Comparative African Word List, etc.) |
| 7 | Date elicited | Approximate date the word list was elicited |
| 8 | Acknowledgments | Organizations and/or individuals who should be acknowledged regarding this word list |
| 9 | Elsewhere published | Any other places where this word list is published |
| 10 | Comments | Any other information that may be helpful to understand the data. For example:
|
| 11 | Contributing researcher(s) | The name of the researcher who prepared this data. |
| 12 | Copyright holder | Individual/organization that owns the copyright for the word list data and media files. |
| 13 | May we display the researcher's name? | If 'Yes' then the name of the researcher will appear with the language word list information. Otherwise it will be hidden. |
| 14 | May we display the audio recordings? | If 'Yes' then the audio recordings will be displayed for the public. Otherwise they will be hidden. |
| 15 | Do you have permission of the copyright holder to submit this word list? | If 'No' please provide an explanation why this data should be considered for publishing in #17. |
| 16 | Do you have written permission from your data sources to distribute transcriptions and recordings of their utterances? | Answer 'Yes' or 'No'. If 'No' please provide an explanation why this data should be considered for publishing in #17. |
| 17 | If 'No' to #15 or #16 please provide an explanation why this data should be considered for publishing | Here you provide an explanation for #15 and/or #16. |
| 18 | Are you finished editing the word list? | Answer 'Yes' or 'No'. If 'No' then you cannot answer 'Yes' to #19. |
| 19 | Do you agree to the terms of service and grant permission to CanIL to publish your word list and associated sound files? | Answer 'Yes' or 'No'. Your word list cannot be approved for publishing until you answer 'Yes'. To review the terms of service click here |
Note: ComparaLex reserves the right to NOT publish submitted data on the ComparaLex website for any reason. Refer to the ComparaLex Terms of Service for more information.
Status
Before your word list can be published it needs to go through the approval process. You can view the status by clicking the "Edit My Account" button viewing the status column of your word list. There are three stages:
Not finished | Indicates that you have not completed your data submission/edits and have not given us permission to publish the data on ComparaLex. |
Finished but not approved | Indicates that you have finished editing and given consent to publish your data but it has not yet been approved by the ComparaLex staff. |
Finished and approved | Indicates that a ComparaLex reviewer has approved the project. The project is now available for public use in ComparaLex. |
Note: If for some reason you want to remove your project from public access, you can change the status from green to red after logging in.
Data Fields
The website database stores specific information about each word. This information is provided by the registered users of ComparaLex and checked by the administrators. Here is an explanation of each field:| 1 | Standard Word List ID | This is the reference code of the standard word list that was used for elicitation. |
| 2 | Gloss | The definition (in English) used for collection. |
| 3 | Audio | An audio file (.WAV and/or .MP3) of the someone pronouncing the word. For more information, see the section on audio formats.
|
| 4 | Phonetic | Phonetic representation of the word (segmental form only). |
| 5 | Phonemic | Phonemic representation of the word (segmental form only). |
| 6 | Phonetic Pitch | Surface pitch transcribed as numbers 1-7 (1=lowest, 7=highest). The numbers will be displayed graphically as lines in between square brackets. For example:
|
| 7 | Phonemic Tone | Analyzed Surface tone (representation is flexible but please explain). |
| 8 | Word Category | Noun, verb, etc. |
| 9 | Noun/Verb Class | Identify the noun or verb class of the word if the language has a noun or verb class system. |
| 10 | Phonetic Plural | Plural segmental form only. |
| 11 | Phonemic Plural | Plural segmental form only. |
| 12 | Plural Phonetic Pitch | Surface pitch. For more information, see the section on phonetic pitch |
| 13 | Plural Phonemic Tone | Analyzed surface tone. |
| 14 | Noun/Verb Class Plural | |
| 15 | Orthographic | Orthographic representation. |
| 16 | Comments |
Query Word Lists
In order to access the lexical data in the ComparaLex database you need to perform a query. This is done by specifying the criteria you want in the query form.
Language Data Selection
Here you see a list of all the languages that are in the ComparaLex database. The part in brackets is the code we have assigned for the standard word list that was used to elicit the data. The languages are sorted alphabetically and grouped by language family.
Languages marked with an asterisk (*) have not been approved yet by a ComparaLex administrator. These unapproved languages only appear to the registered user who submitted it. Once it has been approved, the asterisk will not appear and the language will be available to all visitors.
You can select one or more language word lists to compare/view. To select more than one hold down the <ctrl> key and click on the ones you want. If you hold down the <shift> key you can select a beginning and ending and it will select all the items in between. ComparaLex will allow searches of up to 5 language word lists at once.
Quick Tip: If you just want to view all the words for one language, double-click on the language name.
Column/Field Selection
Here you select the fields you want to display for each language. For a description of all the fields see above. You can select more than one by holding down the <ctrl> key. If you hold down the <shift> key you can select a beginning and ending and it will select all the items in between.
Filter Selection
Here you can choose how you want the query results filtered. There are four options:
- Word List - This filters the results to match the standard word list you chose from the list. If the language doesn't have a word that matches the standard word list, blanks are inserted. In some cases there may be entire rows that don't have any language data. To hide these blank lines put a check on either Hide partially empty rows or Hide completely empty rows. You can select the English list gloss and/or French list gloss. You can restrict the results even more by typing in the standard word list id numbers in the input box. Ranges are accepted (eg. 3,6,12, 25-75, 77).
- Domain - This filters the results according to the semantic domain you choose from the list. These semantic domains are taken from the Dictionary Development Process.
- Search - This filters the results to show only records that contain a search string that you specify. You select the field to search on and the text to look for. Regular expressions are supported.
- None - This doesn't apply any filter and shows the language word lists 'as is'. If only one language is selected, then the list is displayed according to the standard word list that was used for elicitation. If more than one language is selected, they are referenced together and sorted alphabetically by gloss.
Download Query Results
By clicking one of the download links ComparaLex will generate a tab-separated file of the query results and send it to your computer. If you click 'Text' it will try to open it on your computer using your default text editor. If you click 'Excel' it will instruct your browser to try to open it with Microsoft Excel. In either case, the results will depend on your computer and web browser are set up.
Submitting Language Word Lists
Preparing Word List Data for Submission
Word list data must be organized following one of the standard word lists in the ComparaLex database (e.g., SIL Comparative African Word List, Swadesh 100 Word List, etc. See all standard word lists). Transcription should be carried out using IPA symbols, with any deviations from the IPA spelled out in the “Additional Comments” section of the metadata form. For additional information, see the data encloding and data format sections below. Your data must minimally contain the following fields for each item:
- Standard list number - Data in this field are the numbers for each word employed by the standard word list the data has been gathered against.
- Gloss - Data in this field are the glosses of the standard word list against which the data has been gathered.
- Phonetic - Data in this field are phonetic transcriptions of each word in the list, including phonetic pitch if language is tonal or has a pitch-accent system. If the “Phonetic Pitch” field is employed, then phonetic pitch does not need to be included in the “Phonetic” field.
In addition, the following fields are highly recommended:
- Audio - Data in this field are words that correspond identically to the file names of the corresponding individual digital recordings of each word. The preferred filename is one that includes both the standard list number and the gloss (e.g., 0001_body.wav). Each word must have a separate audio file. Ideally, the recordings should include the gloss and one token of each word. For additional information, see the audio format section below.
- Phonetic Pitch - Data in this field are preferably encoded using the numerals 1-7 as follows. For each level pitch, type in one numeral (1 = lowest pitch, 7 = highest pitch) with a single space between each of the pitches. For more information, see the section on phonetic pitch
Submission Procedure
Submitting data to the server involves four steps:
- Metadata form - Supply the details about the language word list such as name, country, iso code, etc. See above for a complete description of all the fields.
- Upload data file(s) - The word list data and sound files are uploaded and stored in a temporary location on our server.
- Verify import process - Here you specify the field definitions for your data and verify that it imported correctly and make adjustments if necessary.
- Await approval - Your word list will wait in a queue until one of our staff approves it and enters it into the database.
To find out what stage your language submission is at, click here.
Data File Format
ComparaLex can accept files in the following formats:
- Field Linguist’s Toolbox (.db) - See the Resources menu tab to download a sample database
- Comma Separated (.txt, .csv)
- Tab Separated (.txt, .tsv)
Files may be zipped to save space. The maximum upload file size is 10 mb. You can upload multiple data files and it will add each data set to the language.
Tab separated is recommended over comma separated because it is more likely that the comma will be used in the data in addition to serving as the data separator. If you nevertheless still choose to use comma separated and you have fields that contain commas, please enclose the entire field in double quotation marks (").
Example: 26,"voice box, larynx, Adam's apple"
If your data is stored in a spreadsheet like Microsoft Excel then you can convert it to tab or comma separated format by clicking File:Save As and changing the format to "Unicode Text" or something similar.
Data Encoding
Unicode files are preferred but ANSI/Windows 1252 and SIL IPA93 data can also be imported and automatically converted to UTF8.
Warning: If you have been using a hacked font based on another encoding standard, your data cannot be imported into this system. You will have to convert it to UTF8. A good tool to help with this is SIL TECKit.
Audio File Format
ComparaLex can accommodate both MP3 and WAV audio files.
MP3 - lossy, low bandwidth format for listening online.
WAV - lossless format to download to your computer for detailed analysis.
Recordings are preferred in WAV format with a sampling rate of 44 KHz or higher and a bitrate of 16 bits. ComparaLex will automatically generate MP3 files from your WAV files. MP3 files are created at 32khz sampling, variable bit rate, quality=4, mono. The maximum allowed size of an audio clip for a word is 300 kB. If you only have MP3 files, please upload them anyway.
Multiple files can be zipped together to save uploading time. The maximum upload file size is 10 mb. You can upload files more than once and each one will be added to the collection that already exists for that language. If a file with the same name already exists, it will be overwritten.
When you upload audio files, ComparaLex will automatically search through the audio field of the language word list and look for matches between sound file names and audio field names. If a match is found, then a link is created. If an audio file is specified but cannot be found, then a question mark
will be displayed.
Viewing & Editing Your Language Word Lists
Editing Word List Metadata
If you have already submitted a language but need to make some changes, it’s easy with ComparaLex. To edit your language metadata, login and click on the Edit your account button at the top right of the screen. This will take you to your user account page and at the bottom you should see a table of all the languages that belong to you.
Finished | This shows a check mark when you have clicked 'Yes' to the question "... are you finished editing this word list?". |
Consent | This shows a check mark when you have clicked 'Yes' to the question "...grant permission to CanIL to publish". |
Status | This shows the publishing status of the word list. For more information about publishing status, click here. |
Edit Details | Takes you to the edit window where you can edit the metadata for the language (fields like name, country, iso code... etc.) |
Edit Words | Takes you to a page where you can edit all the fields of this language. Use this page to add or delete records. |
Upload | Use this to upload new data and/or audio files to the language. |
Delete Words | Deletes all the words from this word list. |
Delete Audio | Deletes all the audio files from this word list. |
Delete All | Deletes the entire word list and all related files. |
Editing Words
ComparaLex has a built in editor that allows you to make changes to your word list after you have uploaded your data. We think you'll find the editor quite easy to use. In fact, you could even create an entire word list from scratch without uploading anything.
If you are logged in and you have submitted a language word list then anytime the language appears on ComparaLex the data should have a light-green background. Double-clicking on one of these cells will switch it to editing mode.
Double click the cell to switch to edit mode. Click
or press 'ENTER' to save. Click the
icon or press 'ESC' to cancel. If you're editing the word list from the 'Edit your account' page then you'll see a column at the far right with two more icons.
Special Behavior
Standard List ID - The standard list ID field will be turned into a drop-down list selector of the standard word list being used. Select the appropriate list record and click the save button.
Phonetic Pitch - Editing phonetic pitch fields is straightforward. Double-click the cell and the idealized pitch trace will be converted to a sequence of numbers 1-7. Make the changes you need and save. The values will be converted back to an idealized pitch trace. You don't have to type the square brackets as these will be added automatically when you save. For more information, see the section on phonetic pitch.
Audio - Double click the audio cell and it will switch to editing mode. Click the browse button to upload a new audio file for the word. The maximum allowed size of an audio clip for a word is 300 kB. If you upload a WAV file it will automatically create an MP3 for you. For more information, see the section on audio formats.
Adding or Deleting records
The only way to add or delete records is from your Account Details page. Click the Edit your account button at the top right corner of the page. Click the
icon to edit words in that language. The word list with all fields will appear and at the far right will be two icons:
to add a new word/row below the current row.
to delete the word/row.
Regular Expressions
Regular expressions are a powerful way of specifying a pattern for a complex search. Here is a chart to help you get started with understanding the codes:
| Category | Code | Description | Examples | |
| Counts Applies to the previous character | ||||
| + | One or more occurrences | a+rt | matches art, aart, aaart... etc | |
| * | Zero or more occurrences | da*d | matches 'dd', 'dad', 'daaaad'... etc | |
| ? | Either zero or one occurrence | be?an | matches 'ban', 'bean' and 'beean' | |
| {min,max} | Match between min and max occurrences | n{1,3} | matches on 'n', 'nn', 'nnn' but NOT 'nnnn' | |
| Note: The ? also modifies any of the above to be 'non-greedy' Useful when used with wildcards like . or classes [...] | .+?z | matches on any number of character until it reaches a 'z' | ||
| Position | ||||
| ^ | Beginning of a string | ^The | matches on 'The' when it occurs at the beginning of the string (Note: ^ must be the first letter of the search string) | |
| $ | End of a string | beard$ | matches on 'beard' when it occurs at the end of the string (Note: $ must be the last letter of the search string) | |
| \b | Word boundary | \barm | matches 'arm' and 'army' but NOT 'farm' | |
| Class & Group Any one character within the range | ||||
| . | Any character (including carriage return and newline) | b.d | matches 'bad', 'bed', 'bZd' | |
| [...] | Any single character within the brackets | [4-9]th | matches '4th', '5th', '6th' etc. | |
| [^...] | Any single character except those within the brackets | b[^ae]d | matches 'bid', 'bud' but NOT 'bed' or 'bad' | |
| (...) | Treat the contents of (...) as a single unit Also stores the contents to be referred to later | band(stand)? | matches 'band' and 'bandstand' | |
| Other | ||||
| | | Separates alternate possibilities | jogg(ing|ed) | matches 'jogging' and 'jogged' | |
| \ | Literal. When used before one of the special characters (above) it treats it as a literal | 1\+1 | matches '1+1' | |
| \s | Whitespace characters (space, tab, line break, carriage return) | \s{2,} | matches 2 or more spaces, tabs or new lines | |
| \d | Digits 0-9 | \d\d | matches 10-99 | |
Examples
As you can see, when you combine the power of the above codes, you can do some amazing searches. For example:
- '\b\d{1,3}\b' matches any number between 1 and 999
- 'sep[ae]rate' finds seperate and separate
- '\.|\? [a-z]' finds sentences starting with lowercase letter
- ' {2,}' finds double (or more) spaces
For more information, there are many regular expression resources on the web. Please note that while different variations of regular expressions exist, they all basically share the same syntax.
Bug/Feature Tracker
The Tracker is a like a 'to do' list for ComparaLex. It where you go when you want to report a problem or make a suggestion for a change/improvement to the site. It's where the developers go when they want to know what to work on next. It's also a place where you can discuss changes and offer comments on the development of ComparaLex.
Anyone can browse the list of tracker items and read the comments. Only registered users can contribute new items and make comments on others. Click here to register for a new account.
How does it work?
Here's a sample scenario of how the tracker feature should be used:
- Someone is using ComparaLex and encounters a bug or has a suggestion for change/improvement
- The person logs in and creates a new item in the tracker, filling out all the fields (see below)
- An email is automatically sent to the administrator notifying of a new tracker item
- The administrator checks over the tracker item and...
- edits it for clarity and accuracy
- emails the user for more info if needed
- assigns a priority for this item
Over time, the tracker will accumulate more and more items. This continues until a milestone is reached. Then a developer is contracted to implement the tracker items in order of priority. Once the item has been completed/implemented it will be marked as closed.
Tracker Fields
- Type - A tracker item can be one of five types:
- Bug - A problem, error, or crash has been found
- Change - A suggestion for a change in behaviour
- New Feature - A completely new feature that would be useful
- Appearance - A suggestion for change in appearance or wording
- Other - Anything else
- Title - A short summary/title of what you want you're suggesting
- Object - The page/part of ComparaLex that this tracker item applies to
- Description - A full description of what you're suggesting. Here's some tips:
- Please be as specific as possible. Don't use vague/general terms (ie. "Once in awhile when I click on it it gives me an error message")
- Provide a url to the page your refering to (the http://comparalex.canil.ca... part in your browser).
- Provide exact steps to reproduce the problem. If you can't reproduce it, we probably can't either.
- Browser - The web browser you are using (this will usually be filled in automatically)
- Priority - The priority level is assigned by the ComparaLex administration team. There is no timeline here so the specifications are a bit vague.
- Urgent - Critical item that must be fixed ASAP
- Next minor release - Quite important item that's not too difficult to implement, but it can wait awhile
- Next major release - Big changes that will take some time to work out
- Rainy day - Only if there's nothing else important to do
- Status - The status is assigned by the ComparaLex administration team. Can be one of two values:
- Open - This item is being considered and has not been implemented yet. This is the default for all new items
- Closed - This item is no longer being considered (either because
Please!
Before creating a new item it would helpful if you would browse the tracker items to see if someone else has already reported the same thing. This cuts down on our work.