I don't know how to write a script, I wonder if a script is possible for this task..??
I want to fetch all links from a webpage and then I want to edit them except few in the following way -
say, this is the link --> http:/www.*.com/abc/xyz
replace "xyz" portion of the link with another word like "pqr"
All the link of the webpage are uniform. I want to repeat this process with all the links and want them to open in another tab of the browser or atleast save them in a file.
sorry if this sounds little weird, but I want to edit nearly 1000 links, not all at once, but in bunches.
I think harshil is right about jsoup. Its a library that has the right set of methods to parse HTML.
The technical term of the thing that you want is termed as a
"Web Crawler" in computer science.
You simply want to crawl through the number of url's attached to anchor tags in html and beginning with 'href' attribute.
Then with each URL, you want to perform an operation, like replacing a portion with a different string.
During my initial days of learning programming, i started with the basics of python and had studied a web crawler.
Python has amazing well defined methods,and perhaps the easiest language to implement a web crawler.
I guess using jsoup, that Harshil showed, the same can be done with java. Now i haven't used jsoup ever and no have no idea of the API and its methods.
But i have the python code that i had learned way back. Will share the same here:
PHP:
#Finish crawl web
def get_page(url):
try:
if url == "*xyz.com":
return '..... ' #The entire html tags of the page goes here
except:
return ""
return ""
def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return url, end_quote
def union(p,q):
for e in q:
if e not in p:
p.append(e)
def get_all_links(page):
links = []
while True:
url,endpos = get_next_target(page)
if url:
links.append(url)
page = page[endpos:]
else:
break
return links
def crawl_web(seed):
tocrawl = [seed]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
union(tocrawl,get_all_links(get_page(page)))
crawled.append(page)
return crawled
What this program does is simply crawls or goes through all the Url's from a supplied page and stores them in an arraylist.
I was taught this in one of udacity's courses.
It does not mutate or change the individual url's with a different value, that you want. I will try and re factor this code to perform the required function.
Will python be ok for you or you need a different implementation?