NAME

imdbfetch.pl - query movie details from IMDB and store them in a Tellico collection application file


SYNOPSIS

imdbfetch [--version | [-?|-h|--help] | [-d|--debug] [[-i|--noimages] | [[-d|--directory value] [query query ... ]


DESCRIPTION

This program will allow you to interactively query the International Movie Data Base (IMDB) and write the selected titles into a tellico file. Tellico is a KDE collection management application by Robby Stephenson; see http://www.periapsis.org/tellico/ until recently known under name Bookcase.

IMDB is known to change its page layout regularly. Care has been taken to make HTML table parsing as robust as possible. The currently valid values for table locations in title pages are hard coded, but if they fail to retrieve th correct values, all tables on a title page are scanned for the recognition strings. The recognition strings can be changed in the code. Search the source code for the string 'update'.

This program does not and will not support bulk downloading to avoid overloading the server.


RUNNING

You can run this program in fully interactive mode by invoking it with the program name only. In addition, you can pass it movie titles or their substrings as command line arguments which are processed first.

For each query string, a list of most popular, exact and partial matches is returned - in this order. You can select one or more of them for storing. An empty query string stops querying and the stored movie details are written into a bookcase XML file.


OPTIONS

-d | --directory value

The default place to store the output file is the working directory. Use this option to change that. You can use relative or absoluth paths.

-i | --noimages

If you do not want to retrieve cover images at all.

-v | --version

Print out a line with the program name and version number.

-? | -h | --help

Show this help.

-d | --debug

If the program fails to retrive movie title information, use this option to find out why.

This option creates a file named 'imdbfetch_out.html' into the working directory containing the movie title page.


VERSION HISTORY

  0.0.0, 15 Mar 2004, start of the project
  0.9.0, 22 Mar 2004, first pre-release
  1.0.0, 23 Mar 2004, first public release
  1.0.1, 24 Mar 2004, - fixed table finding subroutine calling
                      - more docs and debug code
  1.0.2, 25 Mar 2004, - include the program's URL into docs
                      - survive missing director
                      - do not use 'and' in 'Black & White'
                      - 'Unrated (USA)' needs to be 'U (USA)'?
  1.0.3, 26 Mar 2004, - Alt title field format need to be '2' (title)
  1.0.4, 12 Apr 2004, - fail better on incomplete records (thanks for
                        Gonzalo Porcel)
  1.0.5, 17 Apr 2004, - documantation fixes
  1.0.6, 09 Nov 2004, - rename Bookcase->Tellico
  1.0.7, 12 Nov 2004, - options --noimages and --dir; contributed by Dylan
                        Brewis
  1.1.0, 15 Nov 2004, - retrive full cast list and full plot summary
  1.2.0, 21 Jan 2005, - new IMDB query web page; fix for alternative titles
  1.2.1, 20 Jun 2005, - HTML::Entities is in the module HTML::Parser together
                        with HTML::TokeParser which is already user, so let's
                        use it to fix non-ASCII  characters in film titles.
  1.2.2, 30 Aug 2005, - Correction for the change of the title of the search page 
  1.2.3, 17 Mar 2005, - Correction for the change of the title of the movie
  1.2.4, 29 Mar 2006, - The web site had changed slightly again


BUGS

Please report bugs to the author.


LICENSE

You may distribute this program under the same terms as perl itself.


AUTHOR

Heikki Lehvaslaiho, heikki a ebi ac uk


CONTRIBUTORS

Dylan Brevis dylan a dylan me uk


URL

You can get the latest version of this program at http://heikki.lehvaslaiho.googlepages.com/progs/


APPENDIX

The rest of the documentation details each of the subroutines this program is composed of.

init

  Example    : init();
  Description: Initialize non-standard perl modules and fail
               gracefully if any of them is missing.
               Checks that output directory is exists and is writable.
  Returns    : true on success
  Exceptions : dies on fail
  Caller     : query()

see the LWP::UserAgent manpage

imdb_query

  Arg [1]    : string to append to IMDB base URL
  Example    : imdb_query('film name');
               imdb_query('/title/tt0056628/');
  Description: Retrieve page from IMDB.
               Unless the query string starts with '/',
               '/find?' will be appended to base URL.
  Returns    : HTML page string or undef
  Exceptions : none
  Caller     : query()

see the LWP::UserAgent manpage

imdb_cast_page

  Arg [1]    : IMDB movie title id string
  Example    : imdb_cast_page('tt0056628');
  Description: Retrieve the full cast page from IMDB.
  Returns    : HTML page string or undef
  Exceptions : none
  Caller     : parse_entry()

see the LWP::UserAgent manpage

imdb_summary_page

  Arg [1]    : IMDB movie title id string
  Example    : imdb_summary_page('tt0056628');
  Description: Retrieve the plot summary page from IMDB.
  Returns    : HTML page string or undef
  Exceptions : none
  Caller     : parse_entry()

see the LWP::UserAgent manpage

extract_table

  Arg [1]    : HTML text string
  Arg [2]    : depth of the table in the string
  Arg [3]    : order no. of the table in the string
  Arg [4]    : boolean to keep the HTML, optional, default false
  Example    : extract_table($string, $depth, $count, $keep);
  Description: Extract a table from an HTML text string
  Returns    : string, the HTML table content
  Exceptions : none
  Caller     : query()

see the HTML::TableExtract manpage.

extract_title_links

  Arg [1]    : HTML text string
  Example    : extract_title_links($string);
  Description: Extract a table from an HTML text string
  Returns    : hashref containg an array 'text' (movie title strings) and 
                                an hash 'url' (title => url)
  Exceptions : none
  Caller     : query()

see the HTML::TokeParser manpage.

find_table

  Arg [1]    : HTML text string
  Arg [2]    : array of query strings
  Example    : find_table($string, @queries);
  Description: Find coordinates for HTML tables containing a query string.
               Coodinates can be used to retrieve a table for content parsing.
               Coordinates are used by HTML::TableExtract.
               Query strings should be selected so that only one table matches,
               but the finction returns an array of matches so that this
               can be tested.
  Returns    : a hashrefs of arrays of table coordinate arrayrefs
  Exceptions : none
  Caller     : parse_entry

see the HTML::TableExtract manpage.

parse_entry

  Arg [1]    : HTML text string
  Example    : parse_entry($string);
  Description: Parse movie details from an IMDB page
  Returns    : hashref of movie details
  Exceptions : none
  Caller     : select_a_movie()

cover_image

  Arg [1]    : string, URL into thumbnail image 
  Example    : cover_image($url);
  Description: Retrieves a tiny jpg image used to represent the movie
  Returns    : string, base64 encoded jpg cover image or 0
  Exceptions : none
  Caller     : into_xml()

into_xml

  Arg [1]    : arrayref to a list of movie detail hashrefs
  Example    : into_xml($selected_movies);
  Description: convert data structure into bookcase version 5 XML
  Returns    : XML string
  Exceptions : none
  Caller     : main()

picklist

  Arg [1]    : arrayref to a list of found movie titles
  Arg [2]    : promt string
  Arg [3]    : integer, default value
  Arg [4]    : boolean, is empty selection disallowed?
  Arg [5]    : string, shown if empty selection was made
  Example    : picklist($items,$prompt,$default,
                        $req_nonempty,$empty_warning);
  Description: Show a few of the items from a list at the time and
               allow selecting one or many items. Display window size
               is hard coded to 7.  Note: Here the subroutine is
               called in scalar mode, allowing only one one item
  Returns    : array of picked list items
  Exceptions : none
  Caller     : pick_a_movie()

Method copied from CPAN::FirstTime, See the CPAN::FirstTime manpage. I have modified it to allow user to press 0 and not select anything. The number of items shown is settable by a global variable.

display_some

  Arg [1]    : arrayref to a list found movie titles
  Arg [2]    : display window size
  Arg [3]    : current poition
  Example    : display_some($items, $limit, $pos)
  Description: Helper routine for picklist().
               Prints out the list.
  Returns    : now postion in the list
  Exceptions : none
  Caller     : picklist()

Method copied from CPAN::FirstTime, See the CPAN::FirstTime manpage.

pick_a_movie

  Arg [1]    : arrayref to a list of movie titles
  Example    : pick_a_movie($exacts);
  Description: Allow user to select one of the seach results
  Returns    : string, URL to a selected title, or undef
  Exceptions : none
  Caller     : query()

select_a_movie

  Arg [1]    : HTML string, IMDB title page
  Example    : select_a_movie($html_page);
  Description: Parse the selected movie title page and
               ask user if it is the right one; sore it in memory.
  Returns    : boolean, true if a movie title was selected and stored
  Exceptions : none
  Caller     : query()

query

  Arg [1]    : string, movie title string
  Example    : query($string);
  Description: Do the movie title web query and ask user which of
               the matches are worth keeping
  Returns    : nil
  Exceptions : none
  Caller     : main()

store_into_file

  Example    : store_into_file
  Description: write the kept details into bookcase XML file
               in the working directory
  Returns    : true or false
  Exceptions : none
  Caller     : main()

main

  Arg [1]    : string, movie title string
  Example    : 
  Description: loop over command line arguments,
               prompt for a query strings,
               and store the entries into a bookcase XML file
  Returns    : nil
  Exceptions : none
  Caller     :