php

[php]curl 抓網頁

話不多說 附上程式碼

<?php 
 session_start();
header("Content-type: text/html;charset=utf-8");
 
$contents= file_get_contents("網址");
 
preg_match_all("/<\/?title>(.*?)<\/?title>/", $contents, $result, PREG_SET_ORDER);
preg_match_all("/<\/?pubDate>(.*?)<\/?pubDate>/", $contents, $resultdate, PREG_SET_ORDER);
 
echo $contents;
echo "自由時報新聞跑馬燈";
for($i=0;$i&amp;lt;10;$i++)
{
echo $i+1 .". ".$content[1][$i];
echo "";
echo "www.plurk.com/m/p/".$link[1][$i];
 
}
echo "下一頁".$next[1][0];
?>

而preg_match_all 其實不匹配换行符(默认情况下),所以如果要匹配到換行符號的話,要多加s

例如

preg_match_all("/<\/?title>(.*?)<\/?title>/", $contents, $result, PREG_SET_ORDER);

就要改成

preg_match_all("/<\/?title>(.*?)<\/?title>/s", $contents, $result, PREG_SET_ORDER);

或是我發現透過dom操作的方式來抓網頁也很爽快

<?php
$xml = <<< XML
<?xml version="1.0" encoding="utf-8"?>
<books>
 <book>Patterns of Enterprise Application Architecture</book>
 <book>Design Patterns: Elements of Reusable Software Design</book>
 <book>Clean Code</book>
</books>
XML;
 
$dom = new DOMDocument;
$dom->loadXML($xml);
$books = $dom->getElementsByTagName('book');
foreach ($books as $book) {
    echo $book->nodeValue, PHP_EOL;
}
?>

來源:http://php.net/manual/en/domdocument.getelementsbytagname.php

或是直接

$html = file_get_html('http://localhost/get.php');
$html2 = str_get_html($html);
foreach($html2->find('tr') as $element)
{
    $td = array();
    foreach( $element->find('th') as $row)  
    {
        $td [] = $row->plaintext;
    }
print_r($td);
 
    $td = array();
    foreach( $element->find('td') as $row)  
    {
        $td [] = $row->plaintext;
    }
print_r($td);
}