Main Stories
Slash Boxes

Slash Open Source Project

Slashcode Log In

Log In

[ Create a new account ]

Snagging Headlines

posted by pudge on 08:47 AM March 21st, 2000   Printer-friendly   Email story
bdv writes "I'd like to grab headlines from other sites, in the slashdot tradition. However, some of the sites I'm interested in grabbing headlines and linking to don't appear to offer XML or RDF files to make it easy for me. Not to worry, though--I can do it in perl. 8^)

Still, it is probably in my best interest to get permission first. And this raises my main question: any advice on convincing the owners of an existing site to make it easy for me? How far along should I be with my site before I go asking for such consideration? I'd appreciate the wisdom of your collected experience. "

This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login
Loading... please wait.
  • by Anonymous Coward
    I've done to a few sites to post headlines on my page using perl. Most web sites posting news headlines are using a script that just spits out some HTML when they add a new headline... so all the HTML surrounding all the headlines are the same. For instance...

    Headline A
    Headline B

    Download their HTML document, search it for familiar FONT and TABLE tags which in this case would be , remove the familiar HTML and you get the headlines.
  • This is an issue that -- perhaps oddly -- I haven't seen come up too much. Probably it is because most sites WANT you to link to them, and if you are just getting headlines that require people to go to these other sites for the entire story, how could that be a bad deal for these other sites?

    Also, there is the idea of "fair use." That is, it is accepted practice in the news business (print, broadcast, etc.) to reference stories in other media outlets, even to read their headlines and small pieces of their stories. I don't really think anyone can legally stop you from printing their headlines, especially if you link back to their story on their site.

    (Of course, I am not a lawyer, I just took one media law class for my journalism degree, and that was a long time ago, so consult a real attorney if you decide to go against the wishes of the site.)

    However, legal issues aside, it is probably best to ask permission if they do not explicitly provide the data for you to use. I would probably start off by just sending them a note (either to the appropriate contact address from their site, or to webmaster@ their site, or something) saying I would like to provide links to their site in the form of headlines from their site. If they say Yes, great. If they say No, you may want to point out it can only serve to give them more hits, not less, and then if they still say No, then you have a decision to make, either to abide by their decision or not.
  • Here's a script I use on my slashsite []. I modify it to fit any decent page of education headlines. Be careful though, I've been caught out a few times when people have changed the styling of their news pages.

    #!/usr/bin/perl -w
    use LWP::Simple;
    use DBI;

    $URL=' default.stm';
    unless (defined ($content = get $URL)) {
    die "could not get $URL\n";

    $content =~ s#</A.*?<A#</A><A#gi;
    $content =~ s#"#'#gi;
    $content =~ s#<BR>##gi;
    $content =~ s#<A.*?stm'><IMG##gi;
    # printf "%s\n", $content;
    $spout = "";

    @titles = $content =~ m#stm'>.*?</A#gi;
    @links = $content =~ m#<A HREF=.*?stm'>#gi;

    for ($i = 0; $i <= 14; $i++) {
    $titles[$i] =~ s#stm'>##gi;
    $titles[$i] =~ s#</A##gi;
    $titles[$i] =~ s#'#''#gi;
    $links[$i] =~ s#<A HREF='/low##gi;
    $links[$i] =~ s#'>##gi;
    $spout = $spout . "<LI><A HREF=\"" . $links[$i]. "\" TARGET=\"_top\">" . $titles[$i] . "</A></li>\n";

    # printf "%s\n", $spout;

    $dbh = DBI->connect('DBI:mysql:slash','user', 'pass');

    $dbh->do("UPDATE blocks SET block = '$spout' where bid = 'bbc'");

  • I found what I was looking for on []. Thanks for the pointer. Selecting RSS flavor from them worked best--portald parsed that flavor perfectly, while I had trouble with the XML and CDF flavors. Someone with more experience with this stuff might be able to say whether it's the XML from Moreover that's wrong, or the parsing as used in portald. FYI, their headline links point back to their site, which then counts hits and redirects to sites they've 'harvested' headlines from.

    I had another pr

    Brian Del Vecchio bdv @
  • In case you did not know, there are a few sites that provide free news/web- feeds and they look fairly stable. I have most recently been shown []

    Don't know much about them specifically, but I know a few people who are trying to pull the news feeds into their commercial sites.

  • I hereby grant any and all websites in the world permission to grab headlines from my site, formatted as a nice, clean RSS file, by retrieving


    Doug Man

  • I recently contacted Yahoo! and some other sites for this very reason. I wanted to get headlines and link back. I was told by Yahoo that they did not have the ability to distribute their headlines. After another inquiry to investigate why, I was told that since Yahoo aggregates the information themselves, and often pays for the information, they can not distribute it.

    I am sure you can rip it with perl, but you may want to look at who is the guy behind the guy. I wanted up to the minute data theat Yahoo was paying for.

    I guess another thing to do would be find another source that supplies what you need. There is always another alternative.

    I forget the name of the other source but mynetscape can link you to a bunch of channels that you could post.

    What sort of headlines are you looking for?
    If you get a chance visit Thanks.
    If you get a chance visit Thanks.