###################################### # # # The SamPDA cursive word database # # # # - release 1.0 - # # containing sets SamPDA_[0,1,2] # # # # - DOCUMENTATION - # # # ###################################### ## - Contact information - ## Collector: Jakob Sternby Centre for Mathematical Sciences Lund Institute of Technology jakob@maths.lth.se ## - SamPDA Concept - ## For evaluating a cursive recognition engine one would optimally want to try all plausible letter combinations. Aiming at this bigram-, trigram-, and tetragram letter statistics were generated from the dictionary. This was then used to produce a compact list of words containing as many as possible of the statistical letter combinations. The SamPDA data sets collected here are samples from this list. The SamPDA collection strategy also involves playing audio files of the words to be written instead of displaying a font. This is to avoid influencing the way the user writes by showing the word in a computer font. Each of the sets SamPDA_[0,1,2] contains 20 different words from the list described above. All of the data was collected on a COMPAQ iPAQ PDA, with a SamPDA software that played an audio file of the word to be written as described above. Only a baseline in a box was displayed for reference. This baseline corresponds to the y-coordinate 64. ## - Database contents - ## This release of the SamPDA contains three different datasets based on lists of 20 different words. Due to some misinterpretation of the audio file a few other words have also been written, such as 'lamp' or 'lamb' instead of 'land' , 'wart' instead of 'want', 'rage' instead of 'range'. Some misspellings are also present such as 'business' written as 'buisness' and 'address' as 'adress'. ------------------------------------------------------- | Dataset | Numer of writers | Unique words | Total | |------------------------------------------------------- | SamPDA_0 | 29 | 23 | 580 | | SamPDA_1 | 14 | 23 | 280 | | SamPDA_2 | 7 | 23 | 140 | ------------------------------------------------------- Since this database was not originally created in the UNIPEN format it is basically only the coordinate information that has been included into each datafile. However some extra information can be extracted from the filenames which have the following structure. ## - Filename structure - ## The number of the extension _[0, 1, 2] in the SamPDA_[0, 1, 2] set corresponds to the [wordlist] marker in each filename. sampda_[lang]_[type][wordlist]_[writername].unipen # lang # ae = American English # type # mw = users were instructed to write neatly in 'as much cursive writing as possible'. Plenty of writers without cursive tradition are present. # wordlist # The number [0, 1, 2] corresponds to the three different wordlists used. Each wordlist contains 20 words. # writername # The initials of the test subject.