Current Version: 1.0
Developed by: Claude Pasquier
Copyright: © 2010 The University of nice Sophia antipolis
GeniaJ  is a Java implementation of the Genia tagger (Part-of-speech tagging and shallow parsing for biomedical texts) version 3.0.1 of April 16 2007 available here. The original version was developped in C++ by Yoshimasa Tsuruoka from the the Tsujii Laboratory at the University of Tokyo and distributed under the modified BSD licence. The datasets are identical to the original C++ version. The output from this java version should be identical to the output of the original C++ version.
For more information about the original software, see:
- Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii, Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382-392, 2005 (pdf).
- GeniaJ.jar (java version 1.0 of july 1, 2010) is available freely by request.
Prepare a text file containing one sentence per line, then execute the program with:
java -Xmx500m -jar GeniaJ.jar < RAWTEXT > TAGGEDTEXT
The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags, and named entity (NE) tags in the following tab-separated format.
word1 base1 POStag1 chunktag1 NEtag1 word2 base2 POStag2 chunktag2 NEtag2 : : : : :
Chunks are represented in the IOB2 format (B for BEGIN, I for INSIDE, and O for OUTSIDE).
> echo "Inhibition of NF-kappaB activation reversed the anti-apoptotic effect of isochamaejasmin." | java -Xmx500m -jar GeniaJ.jar Inhibition Inhibition NN B-NP O of of IN B-PP O NF-kappaB NF-kappaB NN B-NP B-protein activation activation NN I-NP O reversed reverse VBD B-VP O the the DT B-NP O anti-apoptotic anti-apoptotic JJ I-NP O effect effect NN I-NP O of of IN B-PP O isochamaejasmin isochamaejasmin NN B-NP O . . . O O
You can easily extract four noun phrases ("Inhibition", "NF-kappaB activation", "the anti-apoptotic effect", and "isochamaejasmin") from this output by looking at the chunk tags. You can also find a protein name with the named entity tags.
1. Pasquier C: Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation. In 5th international workshop on semantic evaluation. Uppsala: Association for Computational Linguistics; 2010:154–157.