Yes, this. I've been trying to find a general way to automatically semantically chunk various legislation for a while now. Partly so as to diff various versions/amendments, but also to graph connections to other referenced legislation.
Most of the time I end up having to just take half an hour to manually regex and format plain text.
A particular case I have is where there is a draft bill put out for industry/community consultation. Quickly diffing the releases is the goal but for now usually relies on one (preferably two) subject matter experts to read the whole thing top to bottom to build an understanding. I don't think these would be available via the means you've secured. They are usually hosted on a relevant government entities website as PDFs
One last question/comment, have you considered adding some additional reference info like the federal list of entities?[1]
> A particular case I have is where there is a draft bill put out for industry/community consultation. Quickly diffing the releases is the goal but for now usually relies on one (preferably two) subject matter experts to read the whole thing top to bottom to build an understanding. I don't think these would be available via the means you've secured. They are usually hosted on a relevant government entities website as PDFs
It's possible that they're in my database. I have included the as made version of all bills on the Federal Register of Legislation. However, if they haven't had a first reading yet, then probably not.
For processing PDFs, I recommend using `pdfplumber`, which is what I used to build the Corpus. Happy to discuss further if you'd like.
> One last question/comment, have you considered adding some additional reference info like the federal list of entities?
Do mean adding additional metadata? At the moment, I've kept the number of metadata attributes as low as possible. Every attribute added equates to more work to keep it standardised across all the jurisdictions and document types. My plan is to slowly add more attributes as I have time. I'd really like to associate a date with documents but even that is a hurdle. I have to decide what date should be the date of a document (is it the time it was issued, the time it was published, the time it came into force, the time the latest version was issued, etc... and what happens when a document doesn't have a date? should I extract it from its citation? how do I preserve time zone information? etc...).
I've used a number of pdf libraries in python and C# over the years, none have worked reliably as needed (that's just pdf I guess), but haven't used pdfplumber, I'll be sure to give it a go, thanks for the suggestion.
Yes, additional metadata. Totally understand it adds in a lot of complexity but could help for fine-tuning an LLM.
With regards to dates, not a lawyer, but for Federal I would go with "Start Date", it's always the day following the End Date of the previous comp. The Date of Assent (well the year at least) is in the title, but also the first start date. The registration date can be either before or after the start date depending. [1][2]
The tricky part is when sections have different commencement dates that are detailed in the text. I don't know anywhere that is easily accessible. And, if you think about it, usually the most important information for say businesses being regulated.
I wouldn't worry with timezone per say, it's relative to each particular state.[3] i.e. why polling closes in a federal election at 6pm in each state rather than coordinated with ACT.
Most of the time I end up having to just take half an hour to manually regex and format plain text.
A particular case I have is where there is a draft bill put out for industry/community consultation. Quickly diffing the releases is the goal but for now usually relies on one (preferably two) subject matter experts to read the whole thing top to bottom to build an understanding. I don't think these would be available via the means you've secured. They are usually hosted on a relevant government entities website as PDFs
One last question/comment, have you considered adding some additional reference info like the federal list of entities?[1]
[1] https://www.finance.gov.au/government/managing-commonwealth-...