[01:33] <willwh> hey guys
[01:33] <willwh> all the chatter died down I see
[01:33] <willwh> quick question though..... :)
[01:34] <MagicFab> willwh, ask away
[01:35] <willwh> well - I am using a real simple bash script atm to scrape pages for image links.
[01:35] <willwh> I have to do this for all sorts of sites
[01:35] <willwh> so, sometimes, it'll be full paths, sometimes relative, etc.
[01:35] <willwh> once sec let me throw it somewhere you can see it :]
[01:35] <willwh> I'm looking for tips for expanding it
[01:36] <willwh> complete tangent - anyone tried irssi-xmpp? :]
[01:37] <willwh> I <3 my irssi setup, using screen
[01:38] <MagicFab> sorry, I thought your question was about Ubuntu. Perhaps someone else can help.
[01:38] <willwh> :)
[01:38] <willwh> It's not
[01:38] <MagicFab> I've had my share of scripting this week ;)
[01:38] <willwh> thanks for the offer ofc
[01:38] <willwh> I'm sure you have!
[01:39] <willwh> mine is a really simple 2 liner atm
[01:39] <willwh> but I'd love to really fill it out
[01:39] <willwh> http://home.willskills.com/~willwh/crunch.txt
[01:39] <willwh> which allows me to just paste a URL and throw out any line containing http:// (and grep the string 'paris')
[01:39] <willwh> in this case
[01:40] <willwh> and I can just keep pasting and getting output
[01:46] <bregma> willwh, what else do you want to do with your script?
[01:50] <willwh> bregma: I guess I'd like to check for any <img* links
[01:50] <willwh> if it's a relative path, perhaps print the whole version of the link out
[01:50] <willwh> or just print it, if it's full path
[01:51] <willwh> i.e. links to images, as well as images on the page
[01:52] <willwh> someone I was speaking to in channel a while back (I'd have to grep my logs)
[01:52] <willwh> was going down the perl route - apparently a library that's pretty good for what I want to do
[01:52] <willwh> I'd like to stick to bash purely for learning :)
[01:53] <bregma> bash doesn't do good complex text handling
[01:53] <willwh> ah
[01:53] <bregma> python would probably be ideal, perl if you have no other choice
[01:53] <willwh> ok.
[01:54] <bregma> the classic approach was to use awk for the text processing in a shell script
[01:55] <willwh> yes
[01:55] <willwh> http://stackoverflow.com/questions/5927031/python-get-image-link-from-html - I guess this is an ok primer
[01:55] <willwh> kinda of similar to what I want to do
[01:59] <bregma> yeah, xpath is the technology you want for extracting stuff from xhtml, and maybe well-formed html
[02:00] <bregma> not my realm of expertise, though
[04:25] <dscassel> willwh: I've used BeautifulSoup (mentioned in your link).
[04:25] <dscassel> Works well, I've found.
[04:25] <dscassel> Mostly if you know where the element is in the tree.  You might need a bit of code to find it, if there's not an easy API call
[06:03] <willwh> dscassel: thank you
[06:03] <willwh> I've read though that it's not being maintained any longer?
[06:03] <willwh> lxml looks like it might do what I want too
[12:31] <BluesKaj> Howdy
[22:23] <dscassel> willwh: Whatever works. :)
[22:24] <dscassel> I think the Gnome people might be losing their minds.
[22:24] <dscassel> But then, maybe it's genius I just can't see.