Can a script be written for this task??

rickenjus · Jan 23, 2014

I don't know how to write a script, I wonder if a script is possible for this task..??

I want to fetch all links from a webpage and then I want to edit them except few in the following way -

say, this is the link --> http:/www.*.com/abc/xyz

replace "xyz" portion of the link with another word like "pqr"

All the link of the webpage are uniform. I want to repeat this process with all the links and want them to open in another tab of the browser or atleast save them in a file.

sorry if this sounds little weird, but I want to edit nearly 1000 links, not all at once, but in bunches.

harshilsharma63 · Jan 23, 2014

Do you know all the links or are they to be discovered?

rickenjus · Jan 23, 2014

well it has to be right click and then copy link.. text is there...

like this

harshilsharma63 · Jan 23, 2014

You can do this in Java using Jsoup. There are other ways too but I'm mentioning what I've tried.

rickenjus · Jan 23, 2014

plz elaborate a little.. ive never used jsoup before

cute.bandar · Jan 23, 2014

Easy peesy.
Fetch page using curl - use your favorite language to regex the links out. IF regex is not your cup of tea try something like querypath (for php)

vickybat · Jan 23, 2014

rickenjus said:
I don't know how to write a script, I wonder if a script is possible for this task..??

I want to fetch all links from a webpage and then I want to edit them except few in the following way -

say, this is the link --> http:/www.*.com/abc/xyz

replace "xyz" portion of the link with another word like "pqr"

All the link of the webpage are uniform. I want to repeat this process with all the links and want them to open in another tab of the browser or atleast save them in a file.

sorry if this sounds little weird, but I want to edit nearly 1000 links, not all at once, but in bunches.

I think harshil is right about jsoup. Its a library that has the right set of methods to parse HTML.

The technical term of the thing that you want is termed as a "Web Crawler" in computer science.
You simply want to crawl through the number of url's attached to anchor tags in html and beginning with 'href' attribute.
Then with each URL, you want to perform an operation, like replacing a portion with a different string.

During my initial days of learning programming, i started with the basics of python and had studied a web crawler.
Python has amazing well defined methods,and perhaps the easiest language to implement a web crawler.

I guess using jsoup, that Harshil showed, the same can be done with java. Now i haven't used jsoup ever and no have no idea of the API and its methods.

But i have the python code that i had learned way back. Will share the same here:

PHP:

#Finish crawl web

def get_page(url):
  
    try:
        if url == "*xyz.com":
            return '..... ' #The entire html tags of the page goes here
        
    except:
        return ""
    return ""

def get_next_target(page):
    start_link = page.find('<a href=')
    if start_link == -1: 
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return url, end_quote

def union(p,q):
    for e in q:
        if e not in p:
            p.append(e)


def get_all_links(page):
    links = []
    while True:
        url,endpos = get_next_target(page)
        if url:
            links.append(url)
            page = page[endpos:]
        else:
            break
    return links

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
           union(tocrawl,get_all_links(get_page(page)))
           crawled.append(page)
          
            
    return crawled

What this program does is simply crawls or goes through all the Url's from a supplied page and stores them in an arraylist.
I was taught this in one of udacity's courses.

It does not mutate or change the individual url's with a different value, that you want. I will try and re factor this code to perform the required function.
Will python be ok for you or you need a different implementation?

harshilsharma63 · Jan 23, 2014

rickenjus said:
plz elaborate a little.. ive never used jsoup before

Jsoup is a Java based HTML parsing library. It will allow you very easily get a URL's source code and parse it like a breez. You can then do anything with the obtained source code. Supposed you extracted another URL form the obtained code, you can get it's code too in a similar way. Some programming skill is needed.

rickenjus · Jan 24, 2014

@vickybat.. thanks alot buddy.. no prblm with python..

@harshil... thanks..
I will give it a try .. well I know c,c++ and little java... that will be sufficient??

harshilsharma63 · Jan 24, 2014

Absolutely.

mprahladka · Feb 11, 2014

Can also be done beautifully in python using beautifulsoup and urllib2

nisargshah95 · Feb 25, 2014

As mprahladka said, this can be achieved using Python (and urllib2 library). In fact, there does exist a script to extract images from 9gag which implements extracting links from an HTML page. You might want to look at this GitHub page.

Sorry to bump into this thread late

Can a script be written for this task??

rickenjus

In the zone

harshilsharma63

DIY FTW!

rickenjus

In the zone

harshilsharma63

DIY FTW!

rickenjus

In the zone

cute.bandar

Cyborg Agent

vickybat

I am the night...I am...

harshilsharma63

DIY FTW!

rickenjus

In the zone

harshilsharma63

DIY FTW!

mprahladka

Right off the assembly line

nisargshah95

Your Ad here