My Blog List

find 20th popularist URLs

Given a file containing a huge volume of URLs, find the 20th popularist URL.

First use hash ( std::map ) to count the occurrence time of each URL. Then the problem is transformed into the problem that give an unorderred array, find out the 20th largest element in the array.

Two intuitive ideas are formed.
Let's assume in total N distinct URLs and find out the Kth URL


  • QuickSelect 


    • Direct use quick select algorithm to find out the 20th URL.
    • Time complexity: O(N) for average time complexity
    • Could be O(N*N) in the worst case.

  • Heap
    •  First build up the maximum heap and then pop up the top element K-1 times
    • Then the next top element will be the Kth URL
    • Build up heap runs in O(N) time if use divide and conquer strategy
    • Each time pop up runs in O(logN)
    • Total time complexity is O(N)+O(K*logN)
    • (If we maintain a heap with fixed size of K elements, then there is no need to build up the entire heap and each insertion runs in O(logK). Thus the total complexity is O(N*logK)
Compare the two method, we can find if the K is small like 20, the second method using a heap would be better. If the N is big and so is the K like 100000 then QuickSelect could be better.

Following is the implementation for both algorithms.


No comments:

Post a Comment

Enter you comment