GIGAWORDCORPUS
Goal of the Project:
Retrieving top ranked words based on the TF*IDF measure from a corpus.
Brief description of the problem:
The problem is about retrieving out a ranked list of R interesting distinct terms (interesting measured using the TF*IDF measure) of lengths varying from M to N (N>M) from a huge corpus where R, M, N are input parameters to the program.
Hardware Details:
Software Details:
- C language with MPI Package for parallelization.
Project Guide:
Dr.Ted Pedersen
Keywords:
- Suffix Arrays
- Quick Sort
- Binomial Reduction
The way we implemented.....
- To start off with we create one suffix array per file and then use the quick sorting algorithm to sort the positions in the suffix array based on the words they point to in the files.
- After the suffix array is sorted we eliminate duplicate positions(different positions but they point to identical words).
- Once the final suffix array has been calculated all the processors do a binomial reduction so that, at the end processor with id zero gets the merged suffix arrays of all the processors.
- After this we retrieve top R interesting terms.
Downloads
Authors