This week your task is to implement a TF-IDF based search engine. You are given 3078 recipes (in Finnish) from Pirkka recipe web site. Each document is preprocessed so that it is represented as a list of word indices and their counts in the document. You are also given a program that turns any document into (word number, word counts)-pairs. This is useful for translating the query string into the document presentation format.
You program should use the already traditional policy of reading from stdin and writing to stdout. The input is the query, and the output is the document numbers in ranked order, most relevant in the first line, second most relevant in the second line, and so on. Please, print only the document numbers, nothing else. You may feed the numbers to the show_txts.sh-script to see the corresponding texts.
(The answer may well be wrong, this just illustrates the format)
./hako.sh teriyaki porsaankyljykset | head -1 | ./show_txts txt/ KYLJYS-VIHANNESPANNU Naturel porsaankyljyksiä suolaa ja mustapippuria sipuli purjo porkkanaa lanttua ... ...
Submit your program with source code as a gzipped tar file (or ZIP file) in Moodle. Name the gzipped tar file as "yourusername"-hw6.pdf, where "yourusername" is your username in the CS system.
All the available material should be in this gzipped tar-file. You may extract it in the department CSL-system by "tar xvfz hako.tgz". Consult README file for details. If there is something you do not understand, please do not hesitate to ask Tomi Silander for details and assistance.