Can a script be written for this task??

rickenjus

In the zone
I don't know how to write a script, I wonder if a script is possible for this task..??


I want to fetch all links from a webpage and then I want to edit them except few in the following way -

say, this is the link --> http:/www.*.com/abc/xyz

replace "xyz" portion of the link with another word like "pqr"

All the link of the webpage are uniform. I want to repeat this process with all the links and want them to open in another tab of the browser or atleast save them in a file.



sorry if this sounds little weird, but I want to edit nearly 1000 links, not all at once, but in bunches.
 
OP
rickenjus

rickenjus

In the zone
well it has to be right click and then copy link.. text is there...

like this
 
Last edited:

cute.bandar

Cyborg Agent
Easy peesy.
Fetch page using curl - use your favorite language to regex the links out. IF regex is not your cup of tea try something like querypath (for php)
 

vickybat

I am the night...I am...
I don't know how to write a script, I wonder if a script is possible for this task..??


I want to fetch all links from a webpage and then I want to edit them except few in the following way -

say, this is the link --> http:/www.*.com/abc/xyz

replace "xyz" portion of the link with another word like "pqr"

All the link of the webpage are uniform. I want to repeat this process with all the links and want them to open in another tab of the browser or atleast save them in a file.



sorry if this sounds little weird, but I want to edit nearly 1000 links, not all at once, but in bunches.

I think harshil is right about jsoup. Its a library that has the right set of methods to parse HTML.

The technical term of the thing that you want is termed as a "Web Crawler" in computer science.
You simply want to crawl through the number of url's attached to anchor tags in html and beginning with 'href' attribute.
Then with each URL, you want to perform an operation, like replacing a portion with a different string.

During my initial days of learning programming, i started with the basics of python and had studied a web crawler.
Python has amazing well defined methods,and perhaps the easiest language to implement a web crawler.

I guess using jsoup, that Harshil showed, the same can be done with java. Now i haven't used jsoup ever and no have no idea of the API and its methods.

But i have the python code that i had learned way back. Will share the same here:

PHP:
#Finish crawl web

def get_page(url):
  
    try:
        if url == "*xyz.com":
            return '..... ' #The entire html tags of the page goes here
        
    except:
        return ""
    return ""

def get_next_target(page):
    start_link = page.find('<a href=')
    if start_link == -1: 
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return url, end_quote

def union(p,q):
    for e in q:
        if e not in p:
            p.append(e)


def get_all_links(page):
    links = []
    while True:
        url,endpos = get_next_target(page)
        if url:
            links.append(url)
            page = page[endpos:]
        else:
            break
    return links

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
           union(tocrawl,get_all_links(get_page(page)))
           crawled.append(page)
          
            
    return crawled

What this program does is simply crawls or goes through all the Url's from a supplied page and stores them in an arraylist.
I was taught this in one of udacity's courses.

It does not mutate or change the individual url's with a different value, that you want. I will try and re factor this code to perform the required function.
Will python be ok for you or you need a different implementation?
 
Last edited:
plz elaborate a little.. ive never used jsoup before

Jsoup is a Java based HTML parsing library. It will allow you very easily get a URL's source code and parse it like a breez. You can then do anything with the obtained source code. Supposed you extracted another URL form the obtained code, you can get it's code too in a similar way. Some programming skill is needed.
 
OP
rickenjus

rickenjus

In the zone
@vickybat.. thanks alot buddy.. no prblm with python..

@harshil... thanks..
I will give it a try .. well I know c,c++ and little java... that will be sufficient??
 

nisargshah95

Your Ad here
As mprahladka said, this can be achieved using Python (and urllib2 library). In fact, there does exist a script to extract images from 9gag which implements extracting links from an HTML page. You might want to look at this GitHub page.

Sorry to bump into this thread late :D
 
Top Bottom