htmlSQL - Version 0.5

htmlSQL is an experimental PHP library that allows you to access HTML values by an SQL like syntax. This means that you don't have to write complex functions or regular expressions to extract specific values.

htmlSQL queries look like this:

SELECT href,title FROM a WHERE $class == "list" ^ Attributes ^ ^ search query (can be empty) to return ^ ^ HTML tag to search in "*" is possible = all tags

This query should return an array with all links that contain the attribute class="list".

Project Discontinued

HtmlSQL was an experiment I did in 2006. I'm not supporting or extending the library anymore this repository is only for historical purposes. But feel free to fork, modify and study the source code. If you need a reliable library for data scraping I recommend using other modules.

Related projects:

PHP: phpQuery, SimpleXML, DOM Perl: WWW::Mechanize, pQuery Python: Scrapy, Beautiful Soup JavaScript: node.js .NET: Html Agility Pack

Related links:

Stack Overflow: Options for HTML scraping? Stack Overflow: HTML Scraping in PHP Hacker News: PHP class to query the web by an SQL like language Hacker News: Ask YC: What do you scrape? How do you scrape?

Requirements

Any flavor of PHP4+ should do Snoopy PHP class - Version 1.2.3 (optional - required for web transfers)
You find all Snoopy related documents (copyright, readme, etc) in the snoopy_data/ subdirectory.

Usage

Just include the "snoopy.class.php" and the "htmlsql.class.php" files into your PHP scripts and look at the examples to get an idea of how to use the htmlSQL library. It should be very simple :-)

Background / idea

I had this idea while extracting some data from a website. As I realized that the algorithms and functions to extract links and other tags are often the same - I had the idea to combine all functions into a universal usable library. While drinking a coffee and thinking about that, I thought it would be cool to access HTML elements by using SQL. So I started creating this library...

Warning

The eval() function is used for the WHERE statement. Make sure that all user data is checked and filtered against malicious PHP code. Never trust any user input!

Todo

Enhance the HTML parser Test htmlSQL with invalid and bad HTML files Replace the ugly eval() method for the WHERE statement with an own method Add more error checks Add unit tests Add a LIMIT function like in SQL

Author

Jonas John

License

htmlSQL uses a modified BSD license, you find the full license text in the "htmlsql.class.php".

版权声明:

1、该文章(资料)来源于互联网公开信息,我方只是对该内容做点评,所分享的下载地址为原作者公开地址。
2、网站不提供资料下载,如需下载请到原作者页面进行下载。
3、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考学习用!
4、如文档内容存在违规,或者侵犯商业秘密、侵犯著作权等,请点击“违规举报”。