blog
Posts Tagged ‘data’
水曜日, 4月 1st, 2009
Recent work in the Ubiquity internationalization realm has focused on the upcoming Ubiquity parser which will bring some great new features to Ubiquity, including support for overlord verbs and semi-automatic localization of commands via semantic roles. It’s possible, though, that these new features will break backwards compatibility of the current command specification and noun types. Creative destruction for the win.
As we look to move forward with incorporating the next generation parser into Ubiquity proper, it thus becomes important to take a look at the current command ecosystem to see how possibly disruptive this move will be. To this end last night I wrote a quick perl script to scrape the commands cached on the herd and get some quantitative answers to my questions.
(続きを読む…)
Tags: arguments, code, data, herd, localization, Mozilla Planet, nountypes, parser, ubiquity, verbs
Posted in projects | No Comments »
火曜日, 3月 24th, 2009
At the end of my blog post yesterday I hinted at an exciting possible approach to Ubiquity’s localization:
In the future we ideally could build a web-based system to collect these “utterances.” We could … generate parser parameters based on those sentences. That would essentially reduce the parser-construction process to a more run-of-the-mill string translation process.
If we build this type of “command-bank” of common Ubiquity input translated into various languages, we could build a tool to learn various features of each language and generate each parser, essentially learning the language based on data. Today I’ll elaborate on how I believe this could be possible, by analogy to another language learning device: the human.
(続きを読む…)
Tags: analogy, automation, code, data, deduction, Dutch, linguistics, Mozilla Planet, parser, patterns, ubiquity
Posted in projects | 3 Comments »
月曜日, 3月 23rd, 2009
I recently have traveled a fair deal and have met many people excited about the Ubiquity project and its localization efforts. “I want to help,” say the people, but many are unsure where to start.
As a linguist, studying a language involves looking at instances of that language as data. To this end, we as Ubiquity internationalizers need to get at some examples of target utterances. Here’s an example survey which could be a good starting point for native speakers who want to contribute information on their language, based on Blair’s list of common Ubiquity verbs.
(続きを読む…)
Tags: collaboration, commands, contribute, data, linguistics, Mozilla Planet, parser, survey, ubiquity
Posted in projects | 21 Comments »
水曜日, 2月 18th, 2009
Earlier today I blogged on three different strategies languages use to mark the roles of different arguments: word order, marking on the arguments, and marking on the verbs.
I gathered some data from the fantastic World Atlas of Language Structures to put together a survey of many of the languages on the Internet. For each of the languages, I got the canonical word order and whether the language marks the role of its argument on the verb and/or the arguments themselves.
As you can see, there are a number of data points that are still missing. Please contribute information on the languages you speak! You can edit the spreadsheet on Google Docs. Thanks!
Tags: arguments, coding properties, contribute, data, grammatical relations, language, linguistics, Mozilla Planet, ubiquity
Posted in projects | 15 Comments »
木曜日, 9月 18th, 2008
Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb “collect.” I intuitively came up with a hypothesis to explain the distinction:
- 回収 may take things away from others when collecting while 収集 does not have that implication.
- Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.1
Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: Google.2 To test my hypothesis, I chose two “objects of collection”, one you can take away (and often is distributed first) and one you can’t take away: アンケート (ankēto “survey,” from the French enquête) and 意見 (iken “opinion”). I then took the four resulting collocations3 on Google in quotes (“•”) and recorded how many hits there were.
(続きを読む…)
Tags: Bailey, cognitive linguistics, corpora, corpus, data, English, frame semantics, Google, Japanese language, language, language learning, linguistics, synonymy, translation
Posted in life, observation | 1 Comment »