willwh | hey guys | 01:33 |
---|---|---|
willwh | all the chatter died down I see | 01:33 |
willwh | quick question though..... :) | 01:33 |
MagicFab | willwh, ask away | 01:34 |
willwh | well - I am using a real simple bash script atm to scrape pages for image links. | 01:35 |
willwh | I have to do this for all sorts of sites | 01:35 |
willwh | so, sometimes, it'll be full paths, sometimes relative, etc. | 01:35 |
willwh | once sec let me throw it somewhere you can see it :] | 01:35 |
willwh | I'm looking for tips for expanding it | 01:35 |
willwh | complete tangent - anyone tried irssi-xmpp? :] | 01:36 |
willwh | I <3 my irssi setup, using screen | 01:37 |
MagicFab | sorry, I thought your question was about Ubuntu. Perhaps someone else can help. | 01:38 |
willwh | :) | 01:38 |
willwh | It's not | 01:38 |
MagicFab | I've had my share of scripting this week ;) | 01:38 |
willwh | thanks for the offer ofc | 01:38 |
willwh | I'm sure you have! | 01:38 |
willwh | mine is a really simple 2 liner atm | 01:39 |
willwh | but I'd love to really fill it out | 01:39 |
willwh | http://home.willskills.com/~willwh/crunch.txt | 01:39 |
willwh | which allows me to just paste a URL and throw out any line containing http:// (and grep the string 'paris') | 01:39 |
willwh | in this case | 01:39 |
willwh | and I can just keep pasting and getting output | 01:40 |
bregma | willwh, what else do you want to do with your script? | 01:46 |
willwh | bregma: I guess I'd like to check for any <img* links | 01:50 |
willwh | if it's a relative path, perhaps print the whole version of the link out | 01:50 |
willwh | or just print it, if it's full path | 01:50 |
willwh | i.e. links to images, as well as images on the page | 01:51 |
willwh | someone I was speaking to in channel a while back (I'd have to grep my logs) | 01:52 |
willwh | was going down the perl route - apparently a library that's pretty good for what I want to do | 01:52 |
willwh | I'd like to stick to bash purely for learning :) | 01:52 |
bregma | bash doesn't do good complex text handling | 01:53 |
willwh | ah | 01:53 |
bregma | python would probably be ideal, perl if you have no other choice | 01:53 |
willwh | ok. | 01:53 |
bregma | the classic approach was to use awk for the text processing in a shell script | 01:54 |
willwh | yes | 01:55 |
willwh | http://stackoverflow.com/questions/5927031/python-get-image-link-from-html - I guess this is an ok primer | 01:55 |
willwh | kinda of similar to what I want to do | 01:55 |
bregma | yeah, xpath is the technology you want for extracting stuff from xhtml, and maybe well-formed html | 01:59 |
bregma | not my realm of expertise, though | 02:00 |
dscassel | willwh: I've used BeautifulSoup (mentioned in your link). | 04:25 |
dscassel | Works well, I've found. | 04:25 |
dscassel | Mostly if you know where the element is in the tree. You might need a bit of code to find it, if there's not an easy API call | 04:25 |
willwh | dscassel: thank you | 06:03 |
willwh | I've read though that it's not being maintained any longer? | 06:03 |
willwh | lxml looks like it might do what I want too | 06:03 |
=== maverickpi is now known as maverickpi[afk] | ||
BluesKaj | Howdy | 12:31 |
=== maverickpi[afk] is now known as maverickpi | ||
dscassel | willwh: Whatever works. :) | 22:23 |
dscassel | I think the Gnome people might be losing their minds. | 22:24 |
dscassel | But then, maybe it's genius I just can't see. | 22:24 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!