1) Download trunk from http://seman.sourceforge.net/. I check out it to p:\SEMAN\. Build Debug. Don't try to build other configs, they will fail even worse than in Debug configuration.
2) Download http://aot.ru/download.php — RusLemmatizer.zip and MorphWizard.zip. Install with default paths.
3) Delete all from c:\Rml\Dicts\.
4) Copy p:\SEMAN\Dicts\Morph\ to c:\Rml\Dicts\.
5) Copy p:\SEMAN\Dicts\SrcMorph\ to To c:\Rml\Dicts\.
6) From p:\SEMAN\Source\MorphGen\Debug take MorphGen.exe and replace it into c:\Rml\Bin\.
7) Now just run eng_gen.bat, ger_gen.bat and rus_gen.bat. It is slow, schedule 2-4 hours.
8) Use the resulting binaries with Lemmatizer.NET, which is compilable from p:\SEMAN\Source\LemmatizerNET.sln .
9) But introduce a little workaround to Lemmatizer.NET. In Lemmatizer.cs, in LoadDictionariesRegistry function, replace the following:
_useStatistic = true;
_statistic.Load(this, "l", manager);
to:
if (Language == InternalMorphLanguage.morphRussian)
{
_useStatistic = true;
_statistic.Load(this, "l", manager);
}
else
{
_useStatistic = false;
}
10) To work with dictionaries, make separate folder like p:\Lemmatize\Rml and put there c:\Rml\Bin and c:\Rml\Dicts.
11) The minimal test program to do lemmatizing is the following.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Enter Russian word (0 to exit)");
ILemmatizer lem = LemmatizerFactory.
Create(MorphLanguage.Russian);
var manager = FileManager.
GetFileManager(@"p:\Lemmatize\Rml"); // make it relative!
lem.LoadDictionariesRegistry(manager);
string word;
do
{
Console.Write("> ");
word = Console.ReadLine();
var paradigmList = lem.
CreateParadigmCollectionFromForm(word, false, true);
for (var i = 0; i < paradigmList.Count; i++)
{
var paradigm = paradigmList[i];
Console.WriteLine ("\t" + paradigm.Norm);
}
}
while (word != "0");
}
}
2) Download http://aot.ru/download.php — RusLemmatizer.zip and MorphWizard.zip. Install with default paths.
3) Delete all from c:\Rml\Dicts\.
4) Copy p:\SEMAN\Dicts\Morph\ to c:\Rml\Dicts\.
5) Copy p:\SEMAN\Dicts\SrcMorph\ to To c:\Rml\Dicts\.
6) From p:\SEMAN\Source\MorphGen\Debug take MorphGen.exe and replace it into c:\Rml\Bin\.
7) Now just run eng_gen.bat, ger_gen.bat and rus_gen.bat. It is slow, schedule 2-4 hours.
8) Use the resulting binaries with Lemmatizer.NET, which is compilable from p:\SEMAN\Source\LemmatizerNET.sln .
9) But introduce a little workaround to Lemmatizer.NET. In Lemmatizer.cs, in LoadDictionariesRegistry function, replace the following:
_useStatistic = true;
_statistic.Load(this, "l", manager);
to:
if (Language == InternalMorphLanguage.morphRussian)
{
_useStatistic = true;
_statistic.Load(this, "l", manager);
}
else
{
_useStatistic = false;
}
10) To work with dictionaries, make separate folder like p:\Lemmatize\Rml and put there c:\Rml\Bin and c:\Rml\Dicts.
11) The minimal test program to do lemmatizing is the following.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Enter Russian word (0 to exit)");
ILemmatizer lem = LemmatizerFactory.
Create(MorphLanguage.Russian);
var manager = FileManager.
GetFileManager(@"p:\Lemmatize\Rml"); // make it relative!
lem.LoadDictionariesRegistry(manager);
string word;
do
{
Console.Write("> ");
word = Console.ReadLine();
var paradigmList = lem.
CreateParadigmCollectionFromForm(word, false, true);
for (var i = 0; i < paradigmList.Count; i++)
{
var paradigm = paradigmList[i];
Console.WriteLine ("\t" + paradigm.Norm);
}
}
while (word != "0");
}
}
No comments:
Post a Comment