Fmars Blog: find 20th popularist URLs

find 20th popularist URLs

Given a file containing a huge volume of URLs, find the 20th popularist URL.

First use hash ( std::map ) to count the occurrence time of each URL. Then the problem is transformed into the problem that give an unorderred array, find out the 20th largest element in the array.

Two intuitive ideas are formed.
Let's assume in total N distinct URLs and find out the Kth URL

QuickSelect

Direct use quick select algorithm to find out the 20th URL.
Time complexity: O(N) for average time complexity
Could be O(N*N) in the worst case.

Heap

First build up the maximum heap and then pop up the top element K-1 times

Then the next top element will be the Kth URL

Build up heap runs in O(N) time if use divide and conquer strategy

Each time pop up runs in O(logN)

Total time complexity is O(N)+O(KlogN)

(If we maintain a heap with fixed size of K elements, then there is no need to build up the entire heap and each insertion runs in O(logK). Thus the total complexity is O(NlogK)

Compare the two method, we can find if the K is small like 20, the second method using a heap would be better. If the N is big and so is the K like 100000 then QuickSelect could be better.

Following is the implementation for both algorithms.

No comments:

Post a Comment

Enter you comment

Subscribe to: Post Comments (Atom)