Teraman: A Tool for Word N-gram Extraction
|Inserted by||Ing. Zdeněk Češka, Ph.D.|
|Date last modified||29.12.2013|
|Number of downloads||4|
Teraman is a tool for word N-gram extraction from large text datasets. Our approach is based on batch processing and therefore it is able to process text documents that are much larger than the available memory. The process composes of three steps: text pre-processing & indexing, counting N-grams and de-indexing. The tool is developed in C# under the .NET Framework 2.0 which is required for running. More details about Teraman are available in our paper "Teraman: A Tool for N-gram Extraction from Large Datasets", published at the IEEE ICCP 2007 international conference.