imdbfetch.pl - query movie details from IMDB and store them in a Tellico collection application file
imdbfetch [--version | [-?|-h|--help] | [-d|--debug] [[-i|--noimages] | [[-d|--directory value] [query query ... ]
This program will allow you to interactively query the International Movie Data Base (IMDB) and write the selected titles into a tellico file. Tellico is a KDE collection management application by Robby Stephenson; see http://www.periapsis.org/tellico/ until recently known under name Bookcase.
IMDB is known to change its page layout regularly. Care has been taken to make HTML table parsing as robust as possible. The currently valid values for table locations in title pages are hard coded, but if they fail to retrieve th correct values, all tables on a title page are scanned for the recognition strings. The recognition strings can be changed in the code. Search the source code for the string 'update'.
This program does not and will not support bulk downloading to avoid overloading the server.
You can run this program in fully interactive mode by invoking it with the program name only. In addition, you can pass it movie titles or their substrings as command line arguments which are processed first.
For each query string, a list of most popular, exact and partial matches is returned - in this order. You can select one or more of them for storing. An empty query string stops querying and the stored movie details are written into a bookcase XML file.
The default place to store the output file is the working directory. Use this option to change that. You can use relative or absoluth paths.
If you do not want to retrieve cover images at all.
Print out a line with the program name and version number.
Show this help.
If the program fails to retrive movie title information, use this option to find out why.
This option creates a file named 'imdbfetch_out.html' into the working directory containing the movie title page.
0.0.0, 15 Mar 2004, start of the project
0.9.0, 22 Mar 2004, first pre-release
1.0.0, 23 Mar 2004, first public release
1.0.1, 24 Mar 2004, - fixed table finding subroutine calling
- more docs and debug code
1.0.2, 25 Mar 2004, - include the program's URL into docs
- survive missing director
- do not use 'and' in 'Black & White'
- 'Unrated (USA)' needs to be 'U (USA)'?
1.0.3, 26 Mar 2004, - Alt title field format need to be '2' (title)
1.0.4, 12 Apr 2004, - fail better on incomplete records (thanks for
Gonzalo Porcel)
1.0.5, 17 Apr 2004, - documantation fixes
1.0.6, 09 Nov 2004, - rename Bookcase->Tellico
1.0.7, 12 Nov 2004, - options --noimages and --dir; contributed by Dylan
Brewis
1.1.0, 15 Nov 2004, - retrive full cast list and full plot summary
1.2.0, 21 Jan 2005, - new IMDB query web page; fix for alternative titles
1.2.1, 20 Jun 2005, - HTML::Entities is in the module HTML::Parser together
with HTML::TokeParser which is already user, so let's
use it to fix non-ASCII characters in film titles.
1.2.2, 30 Aug 2005, - Correction for the change of the title of the search page
1.2.3, 17 Mar 2005, - Correction for the change of the title of the movie
1.2.4, 29 Mar 2006, - The web site had changed slightly again
Please report bugs to the author.
You may distribute this program under the same terms as perl itself.
Heikki Lehvaslaiho, heikki a ebi ac uk
Dylan Brevis dylan a dylan me uk
You can get the latest version of this program at http://heikki.lehvaslaiho.googlepages.com/progs/
The rest of the documentation details each of the subroutines this program is composed of.
Example : init();
Description: Initialize non-standard perl modules and fail
gracefully if any of them is missing.
Checks that output directory is exists and is writable.
Returns : true on success
Exceptions : dies on fail
Caller : query()
see the LWP::UserAgent manpage
Arg [1] : string to append to IMDB base URL
Example : imdb_query('film name');
imdb_query('/title/tt0056628/');
Description: Retrieve page from IMDB.
Unless the query string starts with '/',
'/find?' will be appended to base URL.
Returns : HTML page string or undef
Exceptions : none
Caller : query()
see the LWP::UserAgent manpage
Arg [1] : IMDB movie title id string
Example : imdb_cast_page('tt0056628');
Description: Retrieve the full cast page from IMDB.
Returns : HTML page string or undef
Exceptions : none
Caller : parse_entry()
see the LWP::UserAgent manpage
Arg [1] : IMDB movie title id string
Example : imdb_summary_page('tt0056628');
Description: Retrieve the plot summary page from IMDB.
Returns : HTML page string or undef
Exceptions : none
Caller : parse_entry()
see the LWP::UserAgent manpage
Arg [1] : HTML text string Arg [2] : depth of the table in the string Arg [3] : order no. of the table in the string Arg [4] : boolean to keep the HTML, optional, default false Example : extract_table($string, $depth, $count, $keep); Description: Extract a table from an HTML text string Returns : string, the HTML table content Exceptions : none Caller : query()
see the HTML::TableExtract manpage.
Arg [1] : HTML text string
Example : extract_title_links($string);
Description: Extract a table from an HTML text string
Returns : hashref containg an array 'text' (movie title strings) and
an hash 'url' (title => url)
Exceptions : none
Caller : query()
see the HTML::TokeParser manpage.
Arg [1] : HTML text string
Arg [2] : array of query strings
Example : find_table($string, @queries);
Description: Find coordinates for HTML tables containing a query string.
Coodinates can be used to retrieve a table for content parsing.
Coordinates are used by HTML::TableExtract.
Query strings should be selected so that only one table matches,
but the finction returns an array of matches so that this
can be tested.
Returns : a hashrefs of arrays of table coordinate arrayrefs
Exceptions : none
Caller : parse_entry
see the HTML::TableExtract manpage.
Arg [1] : HTML text string Example : parse_entry($string); Description: Parse movie details from an IMDB page Returns : hashref of movie details Exceptions : none Caller : select_a_movie()
Arg [1] : string, URL into thumbnail image Example : cover_image($url); Description: Retrieves a tiny jpg image used to represent the movie Returns : string, base64 encoded jpg cover image or 0 Exceptions : none Caller : into_xml()
Arg [1] : arrayref to a list of movie detail hashrefs Example : into_xml($selected_movies); Description: convert data structure into bookcase version 5 XML Returns : XML string Exceptions : none Caller : main()
Arg [1] : arrayref to a list of found movie titles
Arg [2] : promt string
Arg [3] : integer, default value
Arg [4] : boolean, is empty selection disallowed?
Arg [5] : string, shown if empty selection was made
Example : picklist($items,$prompt,$default,
$req_nonempty,$empty_warning);
Description: Show a few of the items from a list at the time and
allow selecting one or many items. Display window size
is hard coded to 7. Note: Here the subroutine is
called in scalar mode, allowing only one one item
Returns : array of picked list items
Exceptions : none
Caller : pick_a_movie()
Method copied from CPAN::FirstTime, See the CPAN::FirstTime manpage. I have modified it to allow user to press 0 and not select anything. The number of items shown is settable by a global variable.
Arg [1] : arrayref to a list found movie titles
Arg [2] : display window size
Arg [3] : current poition
Example : display_some($items, $limit, $pos)
Description: Helper routine for picklist().
Prints out the list.
Returns : now postion in the list
Exceptions : none
Caller : picklist()
Method copied from CPAN::FirstTime, See the CPAN::FirstTime manpage.
Arg [1] : arrayref to a list of movie titles Example : pick_a_movie($exacts); Description: Allow user to select one of the seach results Returns : string, URL to a selected title, or undef Exceptions : none Caller : query()
Arg [1] : HTML string, IMDB title page
Example : select_a_movie($html_page);
Description: Parse the selected movie title page and
ask user if it is the right one; sore it in memory.
Returns : boolean, true if a movie title was selected and stored
Exceptions : none
Caller : query()
Arg [1] : string, movie title string
Example : query($string);
Description: Do the movie title web query and ask user which of
the matches are worth keeping
Returns : nil
Exceptions : none
Caller : main()
Example : store_into_file
Description: write the kept details into bookcase XML file
in the working directory
Returns : true or false
Exceptions : none
Caller : main()
Arg [1] : string, movie title string
Example :
Description: loop over command line arguments,
prompt for a query strings,
and store the entries into a bookcase XML file
Returns : nil
Exceptions : none
Caller :