Abstract: |
The goal of this paper is to describe a method to automatically extract all basic attributes namely actor, action, object, time and location which belong to an activity, and the relationships (transition and cause) between activities in each sentence retrieved from Japanese CGM (consumer generated media). Previous work had some limitations, such as high setup cost, inability of extracting all attributes, limitation on the types of sentences that can be handled, insufficient consideration of interdependency among attributes, and inability of extracting causes between activities. To resolve these problems, this paper proposes a novel approach that treats the activity extraction as a sequence labeling problem, and automatically makes its own training data. This approach has advantages such as domain-independence, scalability, and unnecessary hand-tagged data. Since it is unnecessary to fix the positions and the number of the attributes in activity sentences, this approach can extract all attributes and relationships between activities by making only a single pass over its corpus. Additionally, by converting to simpler sentences, removing stop words, utilizing html tags, google map api, and wikipedia, the proposed approach can deal with complex sentences retrieved from Japanese CGM. |