/srv/irclogs.ubuntu.com/2011/05/19/#ubuntu-ca.txt

willwh	hey guys	01:33
willwh	all the chatter died down I see	01:33
willwh	quick question though..... :)	01:33
MagicFab	willwh, ask away	01:34
willwh	well - I am using a real simple bash script atm to scrape pages for image links.	01:35
willwh	I have to do this for all sorts of sites	01:35
willwh	so, sometimes, it'll be full paths, sometimes relative, etc.	01:35
willwh	once sec let me throw it somewhere you can see it :]	01:35
willwh	I'm looking for tips for expanding it	01:35
willwh	complete tangent - anyone tried irssi-xmpp? :]	01:36
willwh	I <3 my irssi setup, using screen	01:37
MagicFab	sorry, I thought your question was about Ubuntu. Perhaps someone else can help.	01:38
willwh	:)	01:38
willwh	It's not	01:38
MagicFab	I've had my share of scripting this week ;)	01:38
willwh	thanks for the offer ofc	01:38
willwh	I'm sure you have!	01:38
willwh	mine is a really simple 2 liner atm	01:39
willwh	but I'd love to really fill it out	01:39
willwh	http://home.willskills.com/~willwh/crunch.txt	01:39
willwh	which allows me to just paste a URL and throw out any line containing http:// (and grep the string 'paris')	01:39
willwh	in this case	01:39
willwh	and I can just keep pasting and getting output	01:40
bregma	willwh, what else do you want to do with your script?	01:46
willwh	bregma: I guess I'd like to check for any <img* links	01:50
willwh	if it's a relative path, perhaps print the whole version of the link out	01:50
willwh	or just print it, if it's full path	01:50
willwh	i.e. links to images, as well as images on the page	01:51
willwh	someone I was speaking to in channel a while back (I'd have to grep my logs)	01:52
willwh	was going down the perl route - apparently a library that's pretty good for what I want to do	01:52
willwh	I'd like to stick to bash purely for learning :)	01:52
bregma	bash doesn't do good complex text handling	01:53
willwh	ah	01:53
bregma	python would probably be ideal, perl if you have no other choice	01:53
willwh	ok.	01:53
bregma	the classic approach was to use awk for the text processing in a shell script	01:54
willwh	yes	01:55
willwh	http://stackoverflow.com/questions/5927031/python-get-image-link-from-html - I guess this is an ok primer	01:55
willwh	kinda of similar to what I want to do	01:55
bregma	yeah, xpath is the technology you want for extracting stuff from xhtml, and maybe well-formed html	01:59
bregma	not my realm of expertise, though	02:00
dscassel	willwh: I've used BeautifulSoup (mentioned in your link).	04:25
dscassel	Works well, I've found.	04:25
dscassel	Mostly if you know where the element is in the tree. You might need a bit of code to find it, if there's not an easy API call	04:25
willwh	dscassel: thank you	06:03
willwh	I've read though that it's not being maintained any longer?	06:03
willwh	lxml looks like it might do what I want too	06:03
=== maverickpi is now known as maverickpi[afk]
BluesKaj	Howdy	12:31
=== maverickpi[afk] is now known as maverickpi
dscassel	willwh: Whatever works. :)	22:23
dscassel	I think the Gnome people might be losing their minds.	22:24
dscassel	But then, maybe it's genius I just can't see.	22:24

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!