Results 1 to 4 of 4
  1. #1

    Creating a site link list

    I've been trying to create a test/function that will crawl through my site return ALL the links on a site domain. This is more for the learning experience and fun then for any specific job-related function. Here are the criteria I set:
    a. All links within the domain (e.g. mydomain.com) must be visited.
    b. All links (internal and external to the domain) within those pages visited must be recorded to a master list.
    c. All links within the master list must be unique (check for duplication before entry)
    d. Once all links have been found the script must exit gracefully (no spinning around endlessly on a set of common menu links) and print the links to the log.

    I've created some functions (with the help of this forum [img]/images/graemlins/wink.gif[/img])
    and a base test and I get a lot of the links but I'm having trouble creating the logic to navigate through all of the site links. This current (attached) rendition doesn't meet the above requirements (a and b) Anybody got any ideas?
    Attached Files Attached Files
    A good rule of thumb is to never measure with your thumb.

  2. #2

    Re: Creating a site link list

    We discussed this many times in this forum and I think we have many solutions
    I am attaching 3 different solutions.
    try them and see if they are helpful.

    BTW: I don't think I wrote any of the attached functions my self. I guess Tarun wrote most or all of them.
    Attached Files Attached Files
    "I realize it's an error, but no one is going to try to do that!"
    From "Top 10 Stupid Comments from Developers".

  3. #3

    Re: Creating a site link list

    Thanks for the post chilly, yeah that CheckLinks function is what sent me down this rabbithole to begin with. [img]/images/graemlins/smile.gif[/img]

    I think I've actually solved my own problem, I was struggling with the logic in getting all the site page links. I was trying to do some for/next looping on all the site links found on each page and then moving to the next deeper level, however you never know how many levels there may be. So instead I used a loop until structure and it seems to work pretty good in the meager testing I've done on it. It (currently) won't handle links that open in new browsers however.

    Here's the meat of the code for anyone who's read this far and is interested:

    <font class="small">Code:</font><hr /><pre>Set BO = Browser("micClass:=Browser","application version:=internet explorer 6")
    Set PO = BO.Page("micClass:=Page")
    Reporter.Filter= 3' disable Log for now
    '************************************************* *****************
    ReDim AllLinks(0)
    ReDim SiteLinks(0)
    increment = 0
    MyHome = Browser("micclass:=Browser").Page("micclass:=Page" ).GetROProperty("url")
    rooturl = "my.testdomain.com"
    MyPageLinks = ReturnLinks()

    moreLinks = InsertLinks(MyPageLinks,AllLinks) ' inserts any unique links into the master list (AllLinks)
    result = GetSiteLinks(MyPageLinks,rooturl,SiteLinks) 'insert any unique links that contain rooturl into SiteLinks


    Browser(PO).Navigate(SiteLinks(increment)) ' move to next page
    CloseJSDialog 'close any javascript error popups that may occur
    increment = increment + 1
    Erase MyPageLinks
    MyPageLinks = ReturnLinks() 'grab all the links on that page
    moreLinks = InsertLinks(MyPageLinks,AllLinks)'insert any unique links
    result = GetSiteLinks(MyPageLinks,rooturl,SiteLinks)'insert any unique site links

    loop while ubound(SiteLinks)&gt; increment</pre><hr />
    A good rule of thumb is to never measure with your thumb.

  4. #4

    Re: Creating a site link list

    I have started creating my own variant, but am getting a "Permission Denied" error after a while when the script tries to access the element.tagname method. It works at first, but then gives this error. Any thoughts?

    <font class="small">Code:</font><hr /><pre>Dim Visited
    Visited = Array() 'Array to store all visited links (so we can check for duplicates)

    'Array for links from a given page
    ReDim LinkList(0,2) ' Set the array to 1 row, 2 columns

    Function CheckLinks (theBrowser, BrowserPage, searchDomain)
    'Save the current page as visited
    ReDim Preserve Visited(UBound(Visited)+1)
    Visited(UBound(Visited)) = theBrowser.GetROProperty("url")
    CheckLinks = True

    Dim i
    IHTML = BrowserPage.Object.Body.innerHTML
    ' If page is not valid
    If (InStr(IHTML,"HTTP 404") &lt;&gt; 0) Or (InStr(IHTML,"cannot be displayed") &lt;&gt; 0) Or (Instr(IHTML,"Error") &lt;&gt; 0) Then
    ' Browser back
    ' Return false
    CheckLinks = False
    Exit Function
    End If

    Set taglist = Nothing
    ' getAllLinks
    Set taglist = BrowserPage.Object.links

    'for each link
    For Each element In taglist
    If Ucase(element.tagname)="A" AND (element.InnerText) &lt;&gt;""Then
    'Create a path for the current link
    Dim hostname, pathname, fullpath
    fullpath = element.hostname
    If element.pathname &lt;&gt; "" Then
    fullpath = fullpath &amp; "/" &amp; element.pathname
    End If
    'if in search domain and not visited
    If (InStr(fullpath, searchDomain) &gt; 0) And UBound(Filter(Visited, fullpath)) &lt; 0 Then
    'add to visited list
    ReDim Preserve Visited(UBound(Visited)+1)
    Visited(UBound(Visited)) = fullpath
    'click link
    BrowserPage.Link("micClass:=Link","index:=" &amp; i).Click
    MsgBox("Visiting " &amp; fullpath)
    'if myFunction(newPage)
    If Not CheckLinks(theBrowser, BrowserPage, searchDomain) Then
    'log link as no good
    End If
    'If it's Not a genevish.org link, or it's already been visited
    End If
    'If it's a link, increment i so we keep track of the index number
    i = i + 1
    End If

    End Function

    SystemUtil.Run "http://www.google.com/webhp","","open"
    Set theBrowser = Browser("index:=0")
    Set BrowserPage = theBrowser.Page("micClass:=Page")

    CheckLinks theBrowser, BrowserPage, "google.com"

    MsgBox("All Done!")
    <font color="#6B6B6B">Scott Genevish
    Principal Consultant
    Designed Quality</font>



Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
BetaSoft Inc.
All times are GMT -8. The time now is 05:55 PM.

Copyright BetaSoft Inc.